When you need to extract data from multiple web pages, you can process a list of URLs using either sequential or parallel approaches. Sequential processing handles URLs one at a time, while parallel processing—handling multiple requests simultaneously—significantly reduces scraping time for large URL lists.
This guide assumes you already have a list of URLs ready for scraping
The techniques shown in this guide work with any programming language that supports HTTPS requests (PHP, Java, Ruby, Go, C#, etc.). We’re using Python and Node.js examples for simplicity, but you can adapt these patterns to your preferred language.
Sequential processing handles URLs one after another in a simple loop. This approach works well for small URL lists or when you want to avoid overwhelming target servers with concurrent requests.
import requestszenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"urls = [ # ... your URLs here]for url in urls: response = requests.get(zenrows_api_base, params={"url": url}) print(response.text)
This code sends each URL to ZenRows sequentially and prints the returned HTML. The loop processes URLs in order, waiting for each request to complete before moving to the next.
The extract_content function parses each page and extracts the title and main heading. Results are stored in a list for further processing. The function includes safety checks to handle pages missing title or H1 elements.
For production use, add error handling to manage failed requests:
Copy
Ask AI
import requestsfrom bs4 import BeautifulSoupimport timezenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"urls = [ # ... your URLs here]def extract_content(url, soup): return { "url": url, "title": soup.title.string if soup.title else "No title", "h1": soup.find("h1").text if soup.find("h1") else "No H1" }results = []for url in urls: try: response = requests.get(zenrows_api_base, params={"url": url}) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") results.append(extract_content(url, soup)) # Add delay between requests time.sleep(1) except requests.exceptions.RequestException as e: print(f"Error scraping {url}: {e}") results.append({"url": url, "error": str(e)})print(results)
This version includes try-catch blocks to handle network errors, validates HTTP responses, and adds a delay between requests to be respectful to servers.
Parallel processing handles multiple URLs simultaneously, dramatically reducing total scraping time. However, you must manage concurrency carefully to avoid overwhelming servers or exceeding rate limits.
While these examples use Python and Node.js, the same parallel processing concepts apply to other languages. Most modern programming languages provide similar concurrency features (async/await, thread pools, promises, etc.).
import requestsfrom bs4 import BeautifulSoupfrom concurrent.futures import ThreadPoolExecutorimport timezenrows_api_base = "https://api.zenrows.com/v1/"api_key = "YOUR_ZENROWS_API_KEY"urls = [ # ... your URLs here]def scrape_single_url(url): try: response = requests.get(zenrows_api_base, params={"apikey": api_key, "url": url}) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") return { "url": url, "title": soup.title.string if soup.title else "No title", "h1": soup.find("h1").text if soup.find("h1") else "No H1" } except Exception as e: return {"url": url, "error": str(e)}def scrape_parallel_threads(): start_time = time.time() with ThreadPoolExecutor(max_workers=5) as executor: results = list(executor.map(scrape_single_url, urls)) end_time = time.time() print(f"Completed in {end_time - start_time:.2f} seconds") print(results)if __name__ == "__main__": scrape_parallel_threads()
This approach uses ThreadPoolExecutor in Python and Promise.all in Node.js to handle multiple requests simultaneously. The concurrency is limited by the max_workers parameter in Python and naturally managed by Node.js’s event loop.
For better control over concurrency, process URLs in batches:
Copy
Ask AI
import requestsfrom bs4 import BeautifulSoupfrom concurrent.futures import ThreadPoolExecutorimport timezenrows_api_base = "https://api.zenrows.com/v1/"api_key = "YOUR_ZENROWS_API_KEY"def scrape_single_url(url): try: response = requests.get(zenrows_api_base, params={"apikey": api_key, "url": url}) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") return { "url": url, "title": soup.title.string if soup.title else "No title", "h1": soup.find("h1").text if soup.find("h1") else "No H1" } except Exception as e: return {"url": url, "error": str(e)}def process_batch(urls_batch, max_workers=5): with ThreadPoolExecutor(max_workers=max_workers) as executor: return list(executor.map(scrape_single_url, urls_batch))def scrape_with_batching(all_urls, batch_size=10, max_workers=5): all_results = [] for i in range(0, len(all_urls), batch_size): batch = all_urls[i:i + batch_size] print(f"Processing batch {i//batch_size + 1} ({len(batch)} URLs)") batch_results = process_batch(batch, max_workers) all_results.extend(batch_results) # Optional delay between batches time.sleep(1) return all_resultsif __name__ == "__main__": urls = [ # ... your URLs here ] results = scrape_with_batching(urls, batch_size=5, max_workers=3) print(f"Total results: {len(results)}")
This batching approach processes URLs in smaller groups, giving you better control over server load and memory usage. You can adjust batch size and concurrency based on your specific needs.
ZenRows SDKs simplify parallel web scraping by providing built-in concurrency management, automatic error handling, and connection pooling. When you’re scraping multiple URLs, the SDKs handle the complex orchestration automatically, so you can focus on extracting the data you need rather than managing technical details.
Python SDK
Node.js SDK
First, install the ZenRows Python SDK:
Copy
Ask AI
pip install zenrows
Python SDK
Copy
Ask AI
from zenrows import ZenRowsClientimport asynciofrom bs4 import BeautifulSoupclient = ZenRowsClient("YOUR_ZENROWS_API_KEY", concurrency=5, retries=1)urls = [ # ... your URLs here]async def scrape_url(url): try: response = await client.get_async(url) if response.ok: soup = BeautifulSoup(response.text, "html.parser") return { "url": url, "title": soup.title.string if soup.title else "No title", "h1": soup.find("h1").text if soup.find("h1") else "No H1" } else: return {"url": url, "error": f"HTTP {response.status_code}"} except Exception as e: return {"url": url, "error": str(e)}async def main(): results = await asyncio.gather(*[scrape_url(url) for url in urls]) valid_results = [r for r in results if r is not None] print(valid_results)if __name__ == "__main__": asyncio.run(main())
The SDKs handle connection pooling, automatic retries, and concurrency limits automatically. The concurrency=5 parameter ensures no more than 5 requests run simultaneously.