Scrape from a List of URLs
When embarking on web scraping, it’s often necessary to extract data from a single page and process multiple URLs. If you start with a single URL and need to discover additional links, you can refer to our guide on scraping and crawling from a seed URL.
We’ll explore two approaches to scraping a list of URLs: sequential and parallel processing. Sequential processing is straightforward and works well for a small number of URLs. However, parallel processing can significantly reduce the time required when dealing with a large set of URLs by handling multiple requests simultaneously.
Prerequisites
Ensure you have Python 3 installed. After setting up Python, install the necessary libraries:
These libraries will help you make HTTP requests and parse HTML content.
Sequential Processing
In sequential processing, URLs are processed one after another. This method is simple and involves iterating over the list of URLs using a for
loop. This approach is suitable when dealing with a manageable number of URLs or when parallelism isn’t a concern.
The above code fetches the HTML content for each URL and prints it. However, to make this useful, we can extract specific data, like the page title, using BeautifulSoup
:
The extract_content
function can be customized to extract any required data from the page.
Parallel Processing
Sequential processing is straightforward but can be slow for many URLs. Parallel processing allows multiple URLs to be processed simultaneously, significantly reducing the overall time. However, it’s essential to respect the server’s limitations and not overwhelm it with too many simultaneous requests.
Using Python’s asyncio
library and the ZenRows Python SDK, we can efficiently manage concurrency. The SDK handles creating a pool of workers and ensures the number of concurrent requests does not exceed the specified limit.
First, install the ZenRows® SDK:
We must introduce the asyncio library and its function asyncio.gather
for waiting for all the calls to finish. Then, we can use client.get_async
to request URLs. Internally, the SDK creates a pool with a maximum number of processes to efficiently manage worker availability and ensure tasks are processed within the set limit.
This script initializes a ZenRowsClient
with a concurrency limit, fetches data asynchronously for each URL, and processes it using BeautifulSoup
. The asyncio.gather
function handles the concurrent execution of all requests.
For more detailed information on handling concurrency, refer to our concurrency guide, which includes examples in JavaScript.
Was this page helpful?