Scrape from a List of URLs

When you need to extract data from multiple web pages, you can process a list of URLs using either sequential or parallel approaches. Sequential processing handles URLs one at a time, while parallel processing—handling multiple requests simultaneously—significantly reduces scraping time for large URL lists.

This guide assumes you already have a list of URLs ready for scraping

If you need to discover URLs from a starting page, check our Scrape and Crawl from a Seed URL.

The techniques shown in this guide work with any programming language that supports HTTPS requests (PHP, Java, Ruby, Go, C#, etc.). We’re using Python and Node.js examples for simplicity, but you can adapt these patterns to your preferred language.

Prerequisites

Python Setup

You’ll need Python 3 installed. Install the required libraries:

pip install requests beautifulsoup4

The requests library handles HTTP requests to ZenRows, while beautifulsoup4 parses HTML content for data extraction.

Node.js Setup

You’ll need Node.js installed. Install the required packages:

npm install axios cheerio

The axios library handles HTTP requests, while cheerio provides server-side HTML parsing similar to jQuery.

Sequential Processing

Sequential processing handles URLs one after another in a simple loop. This approach works well for small URL lists or when you want to avoid overwhelming target servers with concurrent requests.

Basic sequential scraping

import requests

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
urls = [
	# ... your URLs here
]

for url in urls:
	response = requests.get(zenrows_api_base, params={"url": url})
	print(response.text)

This code sends each URL to ZenRows sequentially and prints the returned HTML. The loop processes URLs in order, waiting for each request to complete before moving to the next.

Extracting specific data

To make the scraper more useful, extract specific data like page titles and headings:

import requests
from bs4 import BeautifulSoup

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
urls = [
	# ... your URLs here
]

def extract_content(url, soup):
	return {
		"url": url,
		"title": soup.title.string,
		"h1": soup.find("h1").text,
	}

results = []
for url in urls:
	response = requests.get(zenrows_api_base, params={"url": url})
	soup = BeautifulSoup(response.text, "html.parser")
	results.append(extract_content(url, soup))

print(results)

The extract_content function parses each page and extracts the title and main heading. Results are stored in a list for further processing. The function includes safety checks to handle pages missing title or H1 elements.

Adding error handling

For production use, add error handling to manage failed requests:

import requests
from bs4 import BeautifulSoup
import time

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
urls = [
	# ... your URLs here
]

def extract_content(url, soup):
	return {
		"url": url,
		"title": soup.title.string if soup.title else "No title",
		"h1": soup.find("h1").text if soup.find("h1") else "No H1"
	}

results = []
for url in urls:
	try:
		response = requests.get(zenrows_api_base, params={"url": url})
		response.raise_for_status()
		
		soup = BeautifulSoup(response.text, "html.parser")
		results.append(extract_content(url, soup))
		
		# Add delay between requests
		time.sleep(1)
		
	except requests.exceptions.RequestException as e:
		print(f"Error scraping {url}: {e}")
		results.append({"url": url, "error": str(e)})

print(results)

This version includes try-catch blocks to handle network errors, validates HTTP responses, and adds a delay between requests to be respectful to servers.

Parallel Processing

Parallel processing handles multiple URLs simultaneously, dramatically reducing total scraping time. However, you must manage concurrency carefully to avoid overwhelming servers or exceeding rate limits.

While these examples use Python and Node.js, the same parallel processing concepts apply to other languages. Most modern programming languages provide similar concurrency features (async/await, thread pools, promises, etc.).

Using standard libraries

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import time

zenrows_api_base = "https://api.zenrows.com/v1/"
api_key = "YOUR_ZENROWS_API_KEY"

urls = [
	# ... your URLs here
]

def scrape_single_url(url):
	try:
		response = requests.get(zenrows_api_base, params={"apikey": api_key, "url": url})
		response.raise_for_status()
		soup = BeautifulSoup(response.text, "html.parser")
		return {
			"url": url,
			"title": soup.title.string if soup.title else "No title",
			"h1": soup.find("h1").text if soup.find("h1") else "No H1"
		}
	except Exception as e:
		return {"url": url, "error": str(e)}

def scrape_parallel_threads():
	start_time = time.time()
	with ThreadPoolExecutor(max_workers=5) as executor:
		results = list(executor.map(scrape_single_url, urls))
	end_time = time.time()
	print(f"Completed in {end_time - start_time:.2f} seconds")
	print(results)

if __name__ == "__main__":
	scrape_parallel_threads()

This approach uses ThreadPoolExecutor in Python and Promise.all in Node.js to handle multiple requests simultaneously. The concurrency is limited by the max_workers parameter in Python and naturally managed by Node.js’s event loop.

Controlling concurrency with batching

For better control over concurrency, process URLs in batches:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import time

zenrows_api_base = "https://api.zenrows.com/v1/"
api_key = "YOUR_ZENROWS_API_KEY"

def scrape_single_url(url):
	try:
		response = requests.get(zenrows_api_base, params={"apikey": api_key, "url": url})
		response.raise_for_status()
		soup = BeautifulSoup(response.text, "html.parser")
		return {
			"url": url,
			"title": soup.title.string if soup.title else "No title",
			"h1": soup.find("h1").text if soup.find("h1") else "No H1"
		}
	except Exception as e:
		return {"url": url, "error": str(e)}

def process_batch(urls_batch, max_workers=5):
	with ThreadPoolExecutor(max_workers=max_workers) as executor:
		return list(executor.map(scrape_single_url, urls_batch))

def scrape_with_batching(all_urls, batch_size=10, max_workers=5):
	all_results = []
	for i in range(0, len(all_urls), batch_size):
		batch = all_urls[i:i + batch_size]
		print(f"Processing batch {i//batch_size + 1} ({len(batch)} URLs)")
		batch_results = process_batch(batch, max_workers)
		all_results.extend(batch_results)
		# Optional delay between batches
		time.sleep(1)
	return all_results

if __name__ == "__main__":
	urls = [
		# ... your URLs here
	]
	results = scrape_with_batching(urls, batch_size=5, max_workers=3)
	print(f"Total results: {len(results)}")

This batching approach processes URLs in smaller groups, giving you better control over server load and memory usage. You can adjust batch size and concurrency based on your specific needs.

Using ZenRows SDKs

ZenRows SDKs simplify parallel web scraping by providing built-in concurrency management, automatic error handling, and connection pooling. When you’re scraping multiple URLs, the SDKs handle the complex orchestration automatically, so you can focus on extracting the data you need rather than managing technical details.

Python SDK
Node.js SDK

First, install the ZenRows Python SDK:

pip install zenrows

Python SDK

from zenrows import ZenRowsClient
import asyncio
from bs4 import BeautifulSoup

client = ZenRowsClient("YOUR_ZENROWS_API_KEY", concurrency=5, retries=1)

urls = [
	# ... your URLs here
]

async def scrape_url(url):
	try:
		response = await client.get_async(url)
		if response.ok:
			soup = BeautifulSoup(response.text, "html.parser")
			return {
				"url": url,
				"title": soup.title.string if soup.title else "No title",
				"h1": soup.find("h1").text if soup.find("h1") else "No H1"
			}
		else:
			return {"url": url, "error": f"HTTP {response.status_code}"}
	except Exception as e:
		return {"url": url, "error": str(e)}

async def main():
	results = await asyncio.gather(*[scrape_url(url) for url in urls])
	valid_results = [r for r in results if r is not None]
	print(valid_results)

if __name__ == "__main__":
	asyncio.run(main())

The SDKs handle connection pooling, automatic retries, and concurrency limits automatically. The concurrency=5 parameter ensures no more than 5 requests run simultaneously.

Processing large URL lists with SDKs

For very large URL lists, process them in batches to manage memory usage:

from zenrows import ZenRowsClient
import asyncio
from bs4 import BeautifulSoup

client = ZenRowsClient("YOUR_ZENROWS_API_KEY", concurrency=5, retries=1)

async def scrape_url(url):
	try:
		response = await client.get_async(url)
		if response.ok:
			soup = BeautifulSoup(response.text, "html.parser")
			return {
				"url": url,
				"title": soup.title.string if soup.title else "No title",
				"h1": soup.find("h1").text if soup.find("h1") else "No H1"
			}
	except Exception as e:
		return {"url": url, "error": str(e)}

async def process_batch(urls_batch):
	results = await asyncio.gather(*[scrape_url(url) for url in urls_batch])
	return [r for r in results if r is not None]

async def main():
	# Your large list of URLs
	all_urls = [
		# ... hundreds or thousands of URLs
	]
	
	batch_size = 50
	all_results = []
	
	for i in range(0, len(all_urls), batch_size):
		batch = all_urls[i:i + batch_size]
		batch_results = await process_batch(batch)
		all_results.extend(batch_results)
		print(f"Processed batch {i//batch_size + 1}, got {len(batch_results)} results")
	
	print(f"Total results: {len(all_results)}")

asyncio.run(main())

This approach processes URLs in batches of 50, reducing memory usage and providing progress updates for long-running scraping jobs.

Best Practices

Follow these guidelines for efficient URL list scraping:

Start small - Test with a few URLs before scaling up
Monitor performance - Track success rates and response times
Handle errors gracefully - Always include error handling for production code
Respect rate limits - Don’t exceed your plan’s concurrency limits
Save progress - For large jobs, save results incrementally to avoid data loss
Use appropriate concurrency - Balance speed with server respect and resource usage

Understanding Concurrency Limits

The concurrency level you choose affects both performance and resource usage:

Target server capacity - Some servers handle more concurrent requests than others
Your ZenRows plan limits - Higher plans support more concurrent requests
Data processing speed - Don’t set concurrency higher than your ability to process results
Memory usage - Higher concurrency uses more memory for storing responses

For detailed concurrency management and rate limiting information, check our Concurrency Guide.

Quickstart

Get Started

Features

Troubleshooting

Scrape from a List of URLs

Prerequisites

Python Setup

Node.js Setup

Sequential Processing

Basic sequential scraping

Extracting specific data

Adding error handling

Parallel Processing

Using standard libraries

Controlling concurrency with batching

Using ZenRows SDKs

Processing large URL lists with SDKs

Best Practices

Understanding Concurrency Limits

Further Reading

Quickstart

Get Started

Features

Troubleshooting

​Prerequisites

​Python Setup

​Node.js Setup

​Sequential Processing

​Basic sequential scraping

​Extracting specific data

​Adding error handling

​Parallel Processing

​Using standard libraries

​Controlling concurrency with batching

​Using ZenRows SDKs

​Processing large URL lists with SDKs

​Best Practices

​Understanding Concurrency Limits

​Further Reading

Prerequisites

Python Setup

Node.js Setup

Sequential Processing

Basic sequential scraping

Extracting specific data

Adding error handling

Parallel Processing

Using standard libraries

Controlling concurrency with batching

Using ZenRows SDKs

Processing large URL lists with SDKs

Best Practices

Understanding Concurrency Limits

Further Reading