JavaScript, NodeJS and Cheerio Integration
Learn how to integrate ZenRows API with Axios and Cheerio to scrape any website. From the most basic calls to advanced features such as concurrency and auto-retry. From installation to the final code, we will go step-by-step, explaining everything we code.
To just grab the code, go to the final snippet and copy it. It is commented with the parts that must the filled and helpful remarks for the complicated details.
For the code to work, you will need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running npm install
.
You will also need to register to get your API Key.
Using Axios to Get a Page
The first library we will see is Axios, a “promise based HTTP client for the browser and node.js”. It exposes a get method that will call a URL and return a response with the HTML. For the moment, we won’t be using any parameters, just as a demo to see how it works.
Careful! This script will run without any proxy, and the server will see your real IP. You don’t need to execute this snippet.
Calling ZenRows API with Axios
Connecting Axios to ZenRows API is straightforward. axios.get
’s target will be the API base, and a second parameter is an object with params
: apikey
for authentication and url
. URLs must be encoded, but Axios will handle that when using params.
With this simple change, we will handle all the hassles of scraping, such as proxies rotation, bypassing CAPTCHAs and anti-bot solutions, setting correct headers, and many more. However, there are still some challenges that we will address now. Continue reading.
Extracting Basic Data with Cheerio
We will now parse the page’s HTML with Cheerio and extract some data. We’ll create a simple function extractContent
to return URL, title, and h1 content. Your custom extracting logic goes there.
Cheerio offers a “jQuery-like” syntax, and it is designed to work on the server. Its load
method receives a plain HTML and creates a querying function that will allow us to find elements. Then you can query with CSS Selectors and navigate, manipulate, or extract content as a browser would. The resulting selector exposes text
, which will give us the content in plain text, without tags. Check the docs for more advanced features.
List of URLs with Concurrency
We’ve seen how to scrape a single URL. Instead, we will now introduce a list of URLs closer to an actual use case. We’ll also set up concurrency so we don’t have to wait for the sequential process to finish. It will allow the script to process several URLs simultaneously, always with a maximum. That number depends on the plan you are in.
ZenRows JavaScript SDK provides full concurrency support, as JavaScript’s support is limited.
It will enqueue and execute all our requests. And it will do so by handling the parallelism for us and the maximum number of requests going on simultaneously, but never over the limit (10 in the example). Once all the requests finish, we will print the results. In a real case, for example, store them in a database.
Auto-Retry Failed Requests
The last step to having a robust scraper is to retry failed requests. We could use axios-retry, but the SDK already does that.
The basic idea goes like this:
- Identify the failed requests based on the return status code.
- Wait an arbitrary amount of time. Using the library’s
exponentialDelay
will increment exponentially plus a random margin between attempts. - Retry the request until it succeeds or reaches a maximum amount of retries.
Keep in mind that all the retries will take place on the same concurrency thread, effectively blocking it. Some errors are temporary, so retrying might not solve the issue. For those cases, a better strategy would be to store the URL as failed and enqueue it again after some minutes.
Passing a integer value on the SDK constructor is enough to set the number of retries you want. Visit the article on Retry Failed Requests for more info.
If the implementation does not work for your use case or you have any problem, contact us and we’ll gladly help you.
Was this page helpful?