An Introduction to Web Scraping With Cheerio

Web scraping is a technique that makes it possible to obtain data from a particular website. Websites use HTML to describe their content. If the HTML is clean and semantic, it’s easy to use it to locate useful data.

You’ll typically use a web scraper to obtain and monitor data and track future changes to it.

Screenshot of the finished project

jQuery Concepts Worth Knowing Before You Use Cheerio

jQuery is one of the most popular JavaScript packages in existence. It makes it easier to work with theDocument Object Model (DOM), handle events, animation, and more. Cheerio is a package for web scraping that builds on top of jQuery—sharing the same syntax and API, while making it easier to parse HTML or XML documents.

Before you learn how to use Cheerio, it is important to know how to select HTML elements with jQuery. Thankfully, jQuery supports most CSS3 selectors which makes it easier to grab elements from the DOM. Take a look at the following code:

In the code block above, jQuery selects the elements with theidof “container”. A similar implementation using regular old JavaScript would look something like this:

Comparing the last two code blocks, you can see the former code block is much easier to read than the latter. That is the beauty of jQuery.

jQuery also has useful methods liketext(),html(), and more that make it possible to manipulate HTML elements. There are several methods you can use to traverse the DOM, likeparent(),siblings(),prev(), andnext().

Theeach()method in jQuery is very popular in many Cheerio projects. It allows you to iterate over objects and arrays. The syntax for theeach()method looks like this:

In the code block above,callbackruns for each iteration of the array or object argument.

Loading HTML With Cheerio

To begin parsing HTML or XML data with Cheerio, you may use thecheerio.load()method. Take a look at this example:

This code block uses the jQuerytext()method retrieves the text content of theh1element. The full syntax for theload()method looks like this:

Thecontentparameter refers to the actual HTML or XML data you pass theload()method.optionsis an optional object that can modify the behavior of the method. By default, theload()method introduceshtml,head, andbodyelements if they’re missing. If you want to stop this behavior, make sure that you setmodeto false.

Scraping Hacker News With Cheerio

The code used in this project is available in aGitHub repositoryand is free for you to use under the MIT license.

It’s time to combine everything you have learned thus far and create a simple web scraper. Hacker News is a popular website for entrepreneurs and innovators. It is also a perfect website to harness your web scraping skills on because it loads fast, has a very simple interface, and does not serve any ads.

Make sure you have Node.jsand theNode Package Managerrunning on your machine. Create an empty folder, then apackage.jsonfile, and add the following JSON inside the file:

After doing that, open the terminal and run:

This should install the necessary dependencies you need to build the scraper. These packages include Cheerio for parsing the HTML, ExpressJS for creating the server, and—as a development dependency—Nodemon, a utility that listens for changesin the project and automatically restarts the server.

Setting Things Up and Creating the Necessary Functions

Create anindex.jsfile, and in that file, create a constant variable called “PORT”. SetPORTto 5500 (or whatever number you choose), then import the Cheerio and Express packages respectively.

Next, define three variables:url,html, andfinishedPage. Seturlto the Hacker News URL.

Now create a function calledgetHeader()that returns some HTML that the browser should render.

The create another functiongetScript()that returns some JavaScript for the browser to run. Make sure you pass in the variabletypeas an argument when you call it.

Finally, create an asynchronous function calledfetchAndRenderPage(). This function does exactly what you think—it scrapes a page in Hacker News, parses and formats it with Cheerio, then sends some HTML back to the client for rendering.

On Hacker News, there are different types of posts available. There is the “news”, which is the stuff on the front page, posts seeking answers from other Hacker News members have the label, “ask”. Trending posts have the label “best”, the latest posts have the label “newest” and posts regarding job vacancies have the label “jobs”.

fetchAndRenderPage()fetches the list of posts from the Hacker News page based on thetypeyou pass in as an argument. If the fetch operation is successful, the function binds thehtmlvariable to the response text.

Next, add the following lines to the function:

In the code block above, theset()method sets the HTTP header field. Thewrite()method is responsible for sending a chunk of the response body. Theload()function takes inhtmlas an argument.

Next, add the following lines to select the respective children of all elements with the class “titleline”.

In this code block, each iteration retrieves the text content of the target HTML element and stores it in thetitlevariable.

Next, push the response from thegetScript()function into thearticlesarray. Then create a variable,finishedPage, that will hold the finished HTML to send to the browser. Lastly, use thewrite()method to sendfinishedPageas a chunk and end the response process with theend()method.

Defining the Routes to Handle GET Requests

Right under thefetchAndRenderPagefunction, use the expressget()method to define the respective routes for different types of posts. Then use thelistenmethod to listen for connections to the specified port on your local network.

In the code block above, everygetmethod has a callback function that calls thefetchAndRenderPagefunction passing in respective types and theresobjects.

When you open your terminal and runnpm run start. The server should start up, then you’re able to visitlocalhost:5500in your browser to see the results.

Congratulations, you just managed to scrape Hacker News and fetch the post titles without the need for an external API.

Taking Things Further With Web Scraping

With the data you scrape from Hacker News, you can create various visualizations like charts, graphs, and word clouds to present insights and trends in a more digestible format.

You can also scrape user profiles to analyze the reputation of users on the platform based on factors such as upvotes received, comments made, and more.

jQuery Concepts Worth Knowing Before You Use Cheerio#

Loading HTML With Cheerio#

Scraping Hacker News With Cheerio#

Setting Things Up and Creating the Necessary Functions#

Next, define three variables:url,html, andfinishedPage. Seturlto the Hacker News URL.#

Next, add the following lines to the function:#

Defining the Routes to Handle GET Requests#

Taking Things Further With Web Scraping#