Deploy a Web Scraper using Puppeteer, Node.js and Docker on Koyeb

January 17, 2022

Samuele Zaza

Samuele Zaza
bigpdfconverter

Introduction

Web scraping is the process of extracting meaningful data from websites. Although it can be done manually, nowadays there are several developer-friendly tools that can automate the process for you.

In this tutorial we are going to create a web scraper using Puppeteer, a Node library developed by Google to perform several automated tasks using the Chromium engine. Web scraping is just one of the several applications that makes Puppeteer shine. In fact, according to the official documentation on Github Puppeteer can be used to:

  • Generate screenshots and PDFs of web pages.
  • Crawl a Single-Page Application and generate pre-rendered content.
  • Automate form submission, UI testing, keyboard input, and interface interactions.
  • Create an up-to-date, automated testing environment.
  • Capture a timeline trace of your site to help diagnose performance issues.
  • Test Chrome Extensions.

In this guide, we will first create scripts to showcase Puppeteer capabilities, then we will create an API to scrape pages via a simple HTTP API call using Express.js and deploy our application on Koyeb.

Requirements

To successfully follow and complete this guide, you need:

  • Basic knowledge of JavaScript.
  • A local development environment with Node.js installed
  • Basic knowledge of Express.js
  • Docker installed on your machine
  • A Koyeb account to deploy and run the application
  • A GitHub account to version and deploy your application code on Koyeb

This tutorial does not require any prior knowledge of Puppeteer as we will go through every step of setting up and running a web scraper. However, make sure your version of Node.js is at least 10.18.1 as we are using Puppeteer v3+.

For more information take a look at the official readme on the Puppeteer Github repository.

Steps

To deploy a web scraper using Puppeteer, Express.js, and Docker on Koyeb, you need to follow these steps:

  1. Initializing the project
  2. Your first Puppeteer application
  3. Puppeteer in action
  4. Scrap pages via a simple API call using Express
  5. Deploy the app on Koyeb

Initializing the project

Get started by creating a new directory that will hold the application code. To a location of your choice, create and navigate to a new directory by executing the following commands:

mkdir puppeteer-on-koyeb cd puppeteer-on-koyeb

Inside the freshly created folder, we will create a Node.js application skeleton containing the Express.js dependencies that we will need to build our scraping API. In your terminal run:

npx express-generator

You will be prompted with a set of questions to populate the initial package.json file including the project name, version, and description. Once the command has been completed, your package.json content should be similar to the following:

{ "name": "puppeteer-on-koyeb", "version": "1.0.0", "description": "Deploy a Web scraper using Puppeteer, ExpressJS and Docker on Koyeb", "private": true, "scripts": { "start": "node ./bin/www" }, "dependencies": { "cookie-parser": "~1.4.4", "debug": "~2.6.9", "express": "~4.16.1", "http-errors": "~1.6.3", "jade": "~1.11.0", "morgan": "~1.9.1", }, "author": "Samuel Zaza", "license": "ISC" }

Creating the skeleton of the application is going to come handy to organize our files, especially later on when creating we will create our API endpoints.

Next, add Pupetteer, the library we will use to perform the scraping as a project dependency by running:

npm install --save puppeteer

Last, install and configure nodemon. While optional in this guide, nodemon will allow us to automatically restart our server when file changes are detected. This is a great tool to improve the development experience when developing locally. To install nodemon in your terminal, run:

npm install nodemon --save-dev

Then, in your package.json add the following section so we will be able to launch the application in development running npm dev using nodemon and in production running npm start.

{ "name": "puppeteer-on-koyeb", "version": "1.0.0", "description": "Deploy a Web scraper using Puppeteer, ExpressJS and Docker on Koyeb", "private": true, + "scripts": { + "start": "node ./bin/www", + "dev": "nodemon ./bin/www" + }, "dependencies": { "cookie-parser": "~1.4.4", "debug": "~2.6.9", "express": "~4.16.1", "http-errors": "~1.6.3", "jade": "~1.11.0", "morgan": "~1.9.1", }, "author": "Samuel Zaza", "license": "ISC" }

Execute the following command to launch the application and ensure everything is working as expected:

npm start # or npm run dev

Open your browser at http://localhost:3000 and you should see the Express welcome message.

Your first Puppeteer application

Before diving into more advanced Puppeteer web scraping capabilities, we will create a minimalist application to take a webpage screenshot and save the result in our current directory. As mentioned previously, Puppeteer provides several features to control Chrome/Chromium and the ability to take screenshots of a webpage comes very handy.

For this example, we will not dig into each parameter of the screenshot method as we mainly want to confirm our installation works properly.

Create a new JavaScript file named screenshot.js in your project directory puppeteer-on-koyeb by running:

touch screenshot.js

To take a screenshot of a webpage, our application will:

  1. Use Puppeteer, and create a new instance of Browser.
  2. Open a webpage
  3. Take a screenshot
  4. Close the page and browser

Add the code below to the screenshot.js file:

const puppeteer = require('puppeteer'); const URL = 'https://koyeb.com'; const screenshot = async () => { console.log('Opening the browser...'); const browser = await puppeteer.launch(); const page = await browser.newPage(); console.log(`Go to ${URL}`); await page.goto(URL); console.log('Taking a screenshot...'); await page.screenshot({ path: './screenshot.png', fullPage: true, }); console.log('Closing the browser...'); await page.close(); await browser.close(); console.log('Job done!'); }; screenshot();

As you can see, we are taking a screenshot of the Koyeb homepage and saving the result as a png file in the root folder of the project.

You can now run the application by running:

$ node screenshot.js Opening the browser... Taking a screenshot... Closing the browser... Job done!

Once the execution is completed, a screenshot is saved in the root folder of the application. You created your first automation using Puppeteer!

Puppeteer in action

Simple Scraper

In this section, we will create a more advanced scenario to scrap and retrieve information from a website's page. For this example, we will use the Stack Overflow questions pages and instruct Puppeteer to extract each question and excerpt present on the webpage HTML.

Before jumping into the code, in your browser, open the devTools to inspect the webpage source code. You should see a similar block for each question in the HTML:

<div class="question-summary" id="question-summary-11227809"> <!-- we can ignore the stats wrapper --> <div class="statscontainer">...</div> <div class="summary"> <h3><a href="/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array" class="question-hyperlink">Why is processing a sorted array faster than processing an unsorted array?</a></h3> <div class="excerpt"> Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data (before the timed region) miraculously makes the loop almost six times faster. #include &... </div> <!-- unnecessary wrapper --> <div class="d-flex ai-start fw-wrap">...</div> </div> </div>

We will use the JavaScript methods querySelectorAll and querySelector to extract both question and excerpt, and return as result an array of objects.

  1. querySelectorAll: Will be used to to collect each question element. document.querySelectorAll('.question-summary')
  2. querySelector Will be used to extract the question title by calling querySelector('.question-hyperlink').innerText and the excerpt using querySelector('.excerpt').innerText

Back to your terminal, and create a new folder lib containing a file called scraper.js:

mkdir lib touch lib/scraper.js

Inside the file scraper.js add the following code:

const puppeteer = require('puppeteer'); const URL = 'https://stackoverflow.com/questions'; const singlePageScraper = async () => { console.log('Opening the browser...'); const browser = await puppeteer.launch(); const page = await browser.newPage(); console.log(`Navigating to ${URL}...`); await page.goto(URL, { waitUntil: 'load' }); console.log(`Collecting the questions...`); const questions = await page.evaluate(() => { return [...document.querySelectorAll('.question-summary')] .map((question) => { return { question: question.querySelector('.question-hyperlink').innerText, excerpt: question.querySelector('.excerpt').innerText, }; }); }); console.log('Closing the browser...'); await page.close(); await browser.close(); console.log('Job done!'); console.log(questions); return questions; }; module.exports = { singlePageScraper, };

Although it looks way more complex than the screenshot.js script, it is actually performing the same actions except for the scraping one. Let's list them:

  1. Create an instance of Browser and open a page.
  2. Go to the URL (and wait for the website to load).
  3. Extract the information from the website and collect the questions into an array of objects.
  4. Close the browser and return the tools list as question-excerpt pair.

You might feel confused about the scraping syntax of:

const questions = await page.evaluate(() => { return [...document.querySelectorAll('.question-summary')] .map((question) => { return { question: question.querySelector('.question-hyperlink').innerText, excerpt: question.querySelector('.excerpt').innerText, }; }); });

We first call page.evaluate to interact with the page DOM and then we start extracting the question and the excerpt. Moreover, in the code above, we transformed the result from document.querySelectorAll into a JavaScript array to be able to call map on it and return the pair { question, excerpt } for each converter tool.

To run the function, we can import it into a new file singlePageScraper.js, and run it. Create the file in the root directory of your application:

touch singlePageScraper.js

Then, copy the code below that import the singlePageScraper function and call it:

const { singlePageScraper } = require('./lib/scaper'); singlePageScraper();

Run the script by executing the following command in your terminal:

node singlePageScraper.js

The following output appears in your terminal showing the questions and excerpts retrieved from the StackOverflow questions page:

Opening the browser... Navigating to https://stackoverflow.com/questions... Collecting the tools... Closing the browser... Job done! [ { question: 'Google Places Autocomplete for WPF Application', excerpt: 'I have a windows desktop application developed in WPF( .Net Framework.)I want to implement Autocomplete textbox using google places api autocomplete, I found few reference which used Web browser to do ...' }, { question: 'Change the field of a struct in Go', excerpt: "I'm trying to change a parameter of n1's left variable, but n1.left is not available, neither is n1.value or n1.right. What's wrong with this declarations? // lib/tree.go package lib type TreeNode ..." }, ... ]

Multi page scraper

In the previous example, we learned how to scrap and retrieve information from a single page. We can now go even further and instruct Puppeteer to explore and extract information from multiple pages.

For this scenario, we will scrap and extract questions and excerpts from a pre-defined number of StackOverflow questions pages.

Our script will:

  1. Receive the number of pages to scrap as a parameter.
  2. Extract questions and excerpts from a page.
  3. Programmatically click on the "next page" element.
  4. Repeats point 2 and point 3 until the number of pages to scrap is reached.

Based on the previous function singlePageScraper we created in lib/scraper.js we will create a new function taking as an argument the number of pages to scrape.

Let's take a look at the HTML source code of the page to select the correct button element we will emulate the click to go to the next page:

<div class="s-pagination site1 themed pager float-left"> <div class="s-pagination--item is-selected">1</div> <a class="s-pagination--item js-pagination-item" href="/questions?tab=votes&amp;page=2" rel="" title="Go to page 2">2</a> ... <a class="s-pagination--item js-pagination-item" href="/questions?tab=votes&amp;page=1470621" rel="" title="Go to page 1470621">1470621</a> <!-- This is the button we need to select and click on --> <a class="s-pagination--item js-pagination-item" href="/questions?tab=votes&amp;page=2" rel="next" title="Go to page 2"> Next</a></div>

The Puppeteer class Page provides a handy method click that accepts CSS selectors to simulate a click on an element. In our case, to go to the next page, we decide to use the .pager > a:last-child selector.

In the lib/scraper.js file, create a new function called multiPageScraper:

const puppeteer = require('puppeteer'); const URL = 'https://stackoverflow.com/questions'; const singlePageScraper = async () => { console.log('Opening the browser...'); const browser = await puppeteer.launch(); const page = await browser.newPage(); console.log(`Navigating to ${URL}...`); await page.goto(URL, { waitUntil: 'load' }); console.log(`Collecting the questions...`); const questions = await page.evaluate(() => { return [...document.querySelectorAll('.question-summary')] .map((question) => { return { question: question.querySelector('.question-hyperlink').innerText, excerpt: question.querySelector('.excerpt').innerText, }; }); }); console.log('Closing the browser...'); await page.close(); await browser.close(); console.log('Job done!'); console.log(questions); return questions; }; +const multiPageScraper = async (pages = 1) => { + console.log('Opening the browser...'); + const browser = await puppeteer.launch(); + const page = await browser.newPage(); + + console.log(`Navigating to ${URL}...`); + await page.goto(URL, { waitUntil: 'load' }); + + const totalPages = pages; + let questions = []; + + for (let initialPage = 1; initialPage <= totalPages; initialPage++) { + console.log(`Collecting the questions of page ${initialPage}...`); + let pageQuestions = await page.evaluate(() => { + return [...document.querySelectorAll('.question-summary')] + .map((question) => { + return { + question: question.querySelector('.question-hyperlink').innerText, + excerpt: question.querySelector('.excerpt').innerText, + } + }); + }); + + questions = questions.concat(pageQuestions); + console.log(questions); + // Go to next page until the total number of pages to scrap is reached + if (initialPage < totalPages) { + await Promise.all([ + await page.click('.pager > a:last-child'), + await page.waitForSelector('.question-summary'), + ]) + } + } + + console.log('Closing the browser...'); + + await page.close(); + await browser.close(); + + console.log('Job done!'); + return questions; +}; module.exports = { singlePageScraper, + multiPageScraper, };

Since we are collecting questions and related excerpts for multiple pages, we use a for loop to retrieve the list of questions for each page. Each questions retrieved for a page is then concatenated in an array questions which is returned once the fetching is completed.

As we did for the single page scraper example, create a new file multiPageScraper.js in the root directory to import and call the multiPageScraper function:

touch multiPageScraper.js

Then, add the following code:

const { multiPageScraper } = require('./lib/scaper'); multiPageScraper(2);

For the purpose of the script, we are hardcoding the number of pages to fetch to 2. We will make this dynamic when we will build the API.

In your terminal execute the following command to run the script:

node multiPageScraper.js

The following output appears in your terminal showing the questions and excerpts retrieved for each page scraped:

Opening the browser... Navigating to https://stackoverflow.com/questions... Collecting the questions of page 1... [ { question: 'Blazor MAIUI know platform', excerpt: 'there is some way to known the platform where is running my Blazor maui app?. Select not work property in "Windows" (you need use size=2 or the list not show), i would read the platform in ...' }, ... ] Collecting the questions of page 2... [ { question: 'Blazor MAIUI know platform', excerpt: 'there is some way to known the platform where is running my Blazor maui app?. Select not work property in "Windows" (you need use size=2 or the list not show), i would read the platform in ...' }, ... ] Closing the browser... Job done!

In the next section, we will write a simple API server containing one endpoint to scrap a user-defined number of pages and return the list of questions and excerpts scraped.

Scrap pages via a simple API call using Express

We are going to create a simple Express.js API server having an endpoint /questions that accepts a query parameter pages and returns the list of questions and excerpts from page 1 to the page sent as parameter.

For instance, to retrieve the first three pages of questions and their excerpts from Stack Overflow, we will call:

http://localhost:3000/questions?pages=3

To create the questions endpoit, go to the routes directory and create new file question.js:

cd routes touch questions.js

Then, add the code below to the question.js file:

const express = require('express'); const scraper = require('../lib/scaper'); const router = express.Router(); router.get('/', async (req, res, next) => { // 1. Get the parameter "pages" const { pages } = req.query; // 2. Call the scraper function const questions = await scraper.multiPageScraper(pages); // 3. Return the array of questions to the client res.status(200).json({ statusCode: 200, message: 'Questions correctly retrieved', data: { questions }, }); }); module.exports = router;

What the code does is:

  1. Get the query parameter pages.
  2. Call the newly created function multiPageScraper and pass the pages value.
  3. Return the array of questions back to the client.

We now need to define the questions route in the Express router. To do so, open app.js and add the following

const createError = require('http-errors'); const express = require('express'); const path = require('path'); const cookieParser = require('cookie-parser'); const logger = require('morgan'); const indexRouter = require('./routes/index'); +const questionsRouter = require('./routes/questions'); const app = express(); // view engine setup app.set('views', path.join(__dirname, 'views')); app.set('view engine', 'jade'); app.use(logger('dev')); app.use(express.json()); app.use(express.urlencoded({ extended: false })); app.use(cookieParser()); app.use(express.static(path.join(__dirname, 'public'))); app.use('/', indexRouter); +app.use('/questions', questionsRouter);

Note that thanks to the Express generator we used to initialize our project, we do not have to setup our server from scratch, a few middlewrides are already setup for us. Run the server again and try the to call the /questions API endpoint using either the browser or cURL.

Here is the ouput you should get running it from your terminal using cURL:

$ curl http://localhost:3000/questions\?pages\=2 | jq '.' { "statusCode": 200, "message": "Questions correctly retrieved", "data": { "questions": [ { "question": "Is there a way to return float or integer from a conditional True/False", "excerpt": "n_level = range(1, steps + 2) steps is user input,using multi-index dataframe for i in n_level: if df['Crest'] >= df[f'L{i}K']: df['Marker'] = i elif df['Trough'] &..." }, { "question": "Signin With Popup - Firebase and Custom Provider", "excerpt": "I am working on an application that authenticates users with a Spotify account. I have a working login page, however, I would prefer users to not be sent back to the home page when they sign in. I ..." }, ...

And we are done! We now have a web scraper that with minimal changes can scrap any websites.

Deploy the app on Koyeb

Now that we have a working server, we can demonstrate how to deploy the application on Koyeb. Koyeb simple user interface gives you two choices to deploy our app:

  • Deploy native code using git-driven deployment
  • Deploy pre-built Docker containers from any public or private registries.

Since we are using Puppeteer, we need some extra system packages installed so we will deploy on Koyeb using Docker.

In this guide, I won't go through the steps of creating a Dockerfile and pushing it to the Docker registry but if you are interested in learning more, I suggest you start with the official documentation.

Before we start working with Docker, we have to perform a change into the /lib/scaper.js file:

const puppeteer = require('puppeteer'); const URL = 'https://stackoverflow.com/questions'; const singlePageScraper = async () => { console.log('Opening the browser...'); const browser = await puppeteer.launch(); const page = await browser.newPage(); console.log(`Navigating to ${URL}...`); await page.goto(URL, { waitUntil: 'load' }); console.log(`Collecting the questions...`); const questions = await page.evaluate(() => { return [...document.querySelectorAll('.question-summary')] .map((question) => { return { question: question.querySelector('.question-hyperlink').innerText, excerpt: question.querySelector('.excerpt').innerText, }; }); }); console.log('Closing the browser...'); await page.close(); await browser.close(); console.log('Job done!'); console.log(questions); return questions; }; const multiPageScraper = async (pages = 1) => { console.log('Opening the browser...'); - const browser = await puppeteer.launch(); + const browser = await puppeteer.launch({ + headless: true, + executablePath: '/usr/bin/chromium-browser', + args: [ + '--no-sandbox', + '--disable-gpu', + ] + }); const page = await browser.newPage(); console.log(`Navigating to ${URL}...`); await page.goto(URL, { waitUntil: 'load' }); const totalPages = pages; let questions = []; for (let initialPage = 1; initialPage <= totalPages; initialPage++) { console.log(`Collecting the questions of page ${initialPage}...`); let pageQuestions = await page.evaluate(() => { return [...document.querySelectorAll('.question-summary')] .map((question) => { return { question: question.querySelector('.question-hyperlink').innerText, excerpt: question.querySelector('.excerpt').innerText, } }); }); questions = questions.concat(pageQuestions); console.log(questions); // Go to next page until the total number of pages to scrap is reached if (initialPage < totalPages) { await Promise.all([ await page.click('.pager > a:last-child'), await page.waitForSelector('.question-summary'), ]) } } console.log('Closing the browser...'); await page.close(); await browser.close(); console.log('Job done!'); return questions; }; module.exports = { singlePageScraper, multiPageScraper, };

These extra parameters are required to properly run Puppeteer inside a Docker container.

Dockerize the application and push it to the Docker Hub

Get started by creating a Dockerfile containing the following:

FROM node:lts-alpine WORKDIR /app RUN apk update && apk add --no-cache nmap && \ echo @edge http://nl.alpinelinux.org/alpine/edge/community >> /etc/apk/repositories && \ echo @edge http://nl.alpinelinux.org/alpine/edge/main >> /etc/apk/repositories && \ apk update && \ apk add --no-cache \ chromium \ harfbuzz \ "freetype>2.8" \ ttf-freefont \ nss ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true COPY . /app RUN npm install EXPOSE 3000 CMD ["npm", "start"]

This is a fairly simple Dockerfile. We inherit from the Node alpine base image, install the dependencies required by Puppeteer, add our web scraper application code and indicate how to run it.

We can now build the Docker image by running the following command:

docker build . -t <YOUR_DOCKER_USERNAME>/puppeteer-on-koyeb

Take care to replace <YOUR_DOCKER_USERNAME> with your Docker Hub username.

Once the build succeeded, we can push our image to the Docker Hub running:

docker push <YOUR_DOCKER_USERNAME>/puppeteer-on-koyeb

Deploy the app on Koyeb

Let's login to the Koyeb Control Panel and click on the Create App button. You land on the App creation page.

  1. In "Deployment method", select Docker
  2. Enter the Docker image you just pushed <YOUR_DOCKER_USERNAME>/puppeteer-on-koyeb to the Docker Hub. We do not need to configure the extra args or command fields
  3. Pick the container size, server region, and number of instances you'd like to run your application
  4. In the Ports section, change the port value from 8080 to 3000, this is used by Koyeb to determine if your service is healthy. The 3000 port is the port our application listen to
  5. Give your Koyeb App a name.

Once you click the Create App button, you will automatically be redirected to the Koyeb App page where you can follow the progress of your application deployment. Once your app is deployed, click on the Public URL ending with koyeb.app.

Then ensure everything is working as expected by retrieving the two first pages of questions from Stack Overflow running:

curl http://<KOYEB_APP_NAME>-<KOYEB_ORG_NAME>.koyeb.app/questions?pages=2

If everything is working fine, you should see the list of questions returned by the API.

Conclusion

First of all, congratulations on reaching this point! It was a long journey but we now have all the basic knowledge to successfully create and deploy a web scraper.

Starting from the beginning, we played with Puppeteer screenshot capabilities and slowly built up a fairly robust scraper that can automatically change the page to retrieve questions from StackOverflow.

Successively, we moved from a simple script to a running Express API server which exposes a specific endpoint to call the script and scrap a dynamic number of pages based on the query parameter sent along with the API call.

Finally, the cherry on the top is the deployment of our server with Koyeb: Thanks to the simplicity of its deployment using pre-build Docker images we can now perform our scraping in a production environment.

If you have any questions or suggestions regarding this guide, feel free to reach out on Slack.

Welcome to Koyeb

Koyeb is a developer-friendly serverless platform to deploy any apps globally.

Start for free
Start for free, pay as you grow

Deploy 2 services for free and enjoy our predictable pricing as you grow

Deploy your first app in no time

Get up and running in 5 minutes