Scrape from a List of URLs

When embarking on web scraping, it’s often necessary to extract data from a single page and process multiple URLs. If you start with a single URL and need to discover additional links, you can refer to our guide on scraping and crawling from a seed URL.

This article assumes you have a list of URLs ready for scraping

We’ll explore two approaches to scraping a list of URLs: sequential and parallel processing. Sequential processing is straightforward and works well for a small number of URLs. However, parallel processing can significantly reduce the time required when dealing with a large set of URLs by handling multiple requests simultaneously.

Prerequisites

Ensure you have Python 3 installed. After setting up Python, install the necessary libraries:

pip install requests beautifulsoup4

These libraries will help you make HTTP requests and parse HTML content.

Sequential Processing

In sequential processing, URLs are processed one after another. This method is simple and involves iterating over the list of URLs using a for loop. This approach is suitable when dealing with a manageable number of URLs or when parallelism isn’t a concern.

scraper.py
import requests

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
urls = [
	# ... your URLs here
]

for url in urls:
	response = requests.get(zenrows_api_base, params={"url": url})
	print(response.text)

The above code fetches the HTML content for each URL and prints it. However, to make this useful, we can extract specific data, like the page title, using BeautifulSoup:

scraper.py
from bs4 import BeautifulSoup
# ...
def extract_content(soup): 
	print(soup.title.string)

for url in urls:
	response = requests.get(zenrows_api_base, params={"url": url})
	soup = BeautifulSoup(response.text, "html.parser")
	extract_content(soup)

The extract_content function can be customized to extract any required data from the page.

You could create objects with the extracted data and accumulate them for later processing. However, data might be lost if the script crashes the process. The snippets do not include error control for simplicity, but they should be considered when writing production code.

scraper.py
import requests
from bs4 import BeautifulSoup

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
urls = [
	# ... your URLs here
]

def extract_content(url, soup):
	return {
		"url": url,
		"title": soup.title.string,
		"h1": soup.find("h1").text,
	}

results = []
for url in urls:
	response = requests.get(zenrows_api_base, params={"url": url})
	soup = BeautifulSoup(response.text, "html.parser")
	results.append(extract_content(url, soup))

print(results)

Parallel Processing

Sequential processing is straightforward but can be slow for many URLs. Parallel processing allows multiple URLs to be processed simultaneously, significantly reducing the overall time. However, it’s essential to respect the server’s limitations and not overwhelm it with too many simultaneous requests.

Using Python’s asyncio library and the ZenRows Python SDK, we can efficiently manage concurrency. The SDK handles creating a pool of workers and ensures the number of concurrent requests does not exceed the specified limit.

First, install the ZenRows® SDK:

pip install zenrows

We must introduce the asyncio library and its function asyncio.gather for waiting for all the calls to finish. Then, we can use client.get_async to request URLs. Internally, the SDK creates a pool with a maximum number of processes to efficiently manage worker availability and ensure tasks are processed within the set limit.

from zenrows import ZenRowsClient
import asyncio
from bs4 import BeautifulSoup

client = ZenRowsClient("YOUR_ZENROWS_API_KEY", concurrency=5, retries=1)

urls = [
	# ...
]

async def call_url(url):
	try:
		response = await client.get_async(url)
		if (response.ok):
			soup = BeautifulSoup(response.text, "html.parser")
			return {
				"url": url,
				"title": soup.title.string,
				"h1": soup.find("h1").text,
			}
	except Exception as e:
		pass

async def main():
	results = await asyncio.gather(*[call_url(url) for url in urls])
	print(results)

asyncio.run(main())

This script initializes a ZenRowsClient with a concurrency limit, fetches data asynchronously for each URL, and processes it using BeautifulSoup. The asyncio.gather function handles the concurrent execution of all requests.

Proper error handling should be implemented in production environments to manage exceptions and ensure reliability

For more detailed information on handling concurrency, refer to our concurrency guide, which includes examples in JavaScript.

Quickstart

Get Started

Features

Help

Scrape from a List of URLs

Prerequisites

Sequential Processing

Parallel Processing

Quickstart

Get Started

Features

Help

​Prerequisites

​Sequential Processing

​Parallel Processing

Prerequisites

Sequential Processing

Parallel Processing