npm install
.
Using Axios to Get a Page
The first library we will see is Axios, a “promise based HTTP client for the browser and node.js”. It exposes a get method that will call a URL and return a response with the HTML. For the moment, we won’t be using any parameters, just as a demo to see how it works. Careful! This script will run without any proxy, and the server will see your real IP. You don’t need to execute this snippet.Calling ZenRows API with Axios
Connecting Axios to ZenRows API is straightforward.axios.get
’s target will be the API base, and a second parameter is an object with params
: apikey
for authentication and url
. URLs must be encoded, but Axios will handle that when using params.
With this simple change, we will handle all the hassles of scraping, such as proxies rotation, bypassing CAPTCHAs and anti-bot solutions, setting correct headers, and many more. However, there are still some challenges that we will address now. Continue reading.
Extracting Basic Data with Cheerio
We will now parse the page’s HTML with Cheerio and extract some data. We’ll create a simple functionextractContent
to return URL, title, and h1 content. Your custom extracting logic goes there.
Cheerio offers a “jQuery-like” syntax, and it is designed to work on the server. Its load
method receives a plain HTML and creates a querying function that will allow us to find elements. Then you can query with CSS Selectors and navigate, manipulate, or extract content as a browser would. The resulting selector exposes text
, which will give us the content in plain text, without tags. Check the docs for more advanced features.
List of URLs with Concurrency
We’ve seen how to scrape a single URL. Instead, we will now introduce a list of URLs closer to an actual use case. We’ll also set up concurrency so we don’t have to wait for the sequential process to finish. It will allow the script to process several URLs simultaneously, always with a maximum. That number depends on the plan you are in. ZenRows JavaScript SDK provides full concurrency support, as JavaScript’s support is limited.Auto-Retry Failed Requests
The last step to having a robust scraper is to retry failed requests. We could use axios-retry, but the SDK already does that. The basic idea goes like this:- Identify the failed requests based on the return status code.
- Wait an arbitrary amount of time. Using the library’s
exponentialDelay
will increment exponentially plus a random margin between attempts. - Retry the request until it succeeds or reaches a maximum amount of retries.