Python Requests and BeautifulSoup Integration
Learn how to integrate ZenRows API with Python Requests and BeautifulSoup to extract the data you want. From basic calls to advanced features such as auto-retry and concurrency. We will walk over each stage of the process, from installation to final code, explaining everything we code.
For a short version, go to the final code and copy it. It is commented with the parts that must be completed and helpful suggestions for the more challenging details.
For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install
.
You will also need to register to get your API Key.
Using Requests to Get a Page
The first library we will see is requests
, an HTTP library for Python. It exposes a get
method that will call a URL and return its HTML. For the time being, we won’t be utilizing any parameters; this is simply a demo to see how it works.
Careful! This script will execute without any proxy so that the server will see your actual IP. You don’t need to run this snippet.
Calling ZenRows API with Requests
Connecting requests to ZenRows API is straightforward. get
’s target will be the API base and then two params: apikey
for authentication and url
. URLs must be encoded; however, requests
will handle that when using params
.
With this simple update, we will manage most scraping problems, such as proxy rotation, setting correct headers, avoiding CAPTCHAs and anti-bot solutions, and many more. But there are a few issues that we will address now. Keep on reading.
Extracting Basic Data with BeautifulSoup
We’ll now use BeautifulSoup to parse the HTML on the page and extract some data. We will write a simple function called extract_content
that returns URL, title, and h1 content. There is where you can put your custom extracting logic.
List of URLs with Concurrency
Up until now, we were scraping a single URL. Instead, we will now introduce a list of URLs more relevant to a real-world use case. In addition, we will set up concurrency, so we don’t have to wait for the sequential process to complete. It will allow the script to process multiple URLs simultaneously, always with a maximum. That number is determined by the plan you are in.
In short, multiprocessing
package implements a ThreadPool
that will queue and execute all our requests. And it will do so by handling the parallelism for us and the maximum number of requests going on simultaneously, but never over the limit (10 in the example). Once all the requests finish, it will group all the results in a single variable, and we will print them. In a real case, for example, store them in a database.
Note that this is not a queue; we can add no new URLs once the process initiates. If that is your use case, check out our guide on how to Scrape and Crawl from a Seed URL.
Auto-Retry Failed Requests
The final step in creating a robust scraper is to retry on failed requests. We will be using Retry
from urllib3 and HTTPAdapter
from requests.
The basic idea is as follows:
- Using the return status code, identify the failed requests.
- Wait an arbitrary amount of time. In our example, it will grow exponentially between tries.
- Retry the request until it succeeds or reaches a maximum number of retries.
Fortunately, we can use these two libraries to implement that behavior. We must first configure Retry
and then mount the HTTPAdapter
for a requests session. Unlike the previous ones, we won’t be calling requests.get
directly but requests_session.get
. Once created the session, it will use the same adapter for all subsequent calls.
For more information, visit the article on Retry Failed Requests.
If you have any problem with the implementation or it does not work for your use case, contact us and we’ll help you.
Was this page helpful?