Concurrency in web scraping is essential for efficient data extraction, especially when dealing with multiple URLs. Managing the number of concurrent requests helps prevent overwhelming the target server and ensures you stay within rate limits. Depending on your subscription plan, you can perform twenty or more concurrent requests.

In short, concurrency refers to the number of API requests you can have in progress (or running) simultaneously. If your plan supports 5 concurrent requests, you can process up to 5 requests simultaneously. You’ll get an error if you send a sixth request while five are already processing.

Understanding Concurrency

Concurrency is a fundamental concept in web scraping, referring to the ability to handle multiple tasks simultaneously. In the context of ZenRows, it defines how many scraping requests can be processed at the same time.

Think of concurrency like a team of workers in a factory. Each worker represents a “concurrent request slot.” If you have 5 workers, you can assign them 5 tasks (or requests) simultaneously. If you try to assign a 6th task while all workers are occupied, you will need to wait until one of them finishes their current task before the new one can be started.

In ZenRows, each “task” is an API request, and each “worker” is a concurrent request slot available to you based on your subscription.

Impact of Request Duration on Throughput

The duration that each request takes to complete significantly influences how many requests you can process in a given timeframe. This concept is crucial for optimizing your scraping efficiency and maximizing throughput. Here’s how it works:

  • Fast Requests: If each request takes 1 second to complete and you have 5 concurrent slots available, you can process 5 requests every second. Over a 60-second period, this means you can handle 300 requests (5 requests/second × 60 seconds).
  • Slow Requests: Conversely, if each request takes 10 seconds to complete, you can process 5 requests every 10 seconds. Over the same 60-second period, you’ll only manage 30 requests (5 requests/10 seconds × 60 seconds).

This demonstrates that reducing the duration of each request increases the number of requests you can process in the same amount of time.

Example Scenario

To better understand this, consider a situation where your plan allows 5 concurrent requests:

Scenario:

  • 1st Request: Takes 10 seconds to finish.
  • 2nd Request: Takes 7 seconds to finish.
  • 3rd Request: Takes 8 seconds to finish.
  • 4th Request: Takes 9 seconds to finish.
  • 5th Request: Takes 14 seconds to finish.

You start all 5 requests simultaneously. Each request occupies one of the 5 available slots. If you then attempt to send a:

  • 6th & 7th Request: Since all 5 slots are occupied, you will receive ”429 Too Many Requests” errors. The system can only process additional requests once one of the initial 5 requests finishes. In this example, the quickest request (the 2nd request) completes in 7 seconds, freeing up a slot for new requests.

Concurrency Headers

To help you manage and optimize your API usage, each response from our API includes two important HTTP headers related to concurrency:

  1. Concurrency-Limit: Indicates the total number of concurrent requests allowed by your current plan. This header helps you understand the maximum concurrency capacity available to you.
  2. Concurrency-Remaining: Shows the number of available concurrency slots at the time the request was received by the server. This provides insight into how many slots are still free.

For example, if your plan supports 20 concurrent requests and you send 3 requests simultaneously, the response headers might be:

  • Concurrency-Limit: 20
  • Concurrency-Remaining: 17

This means that at the time of the request, 17 slots were available, while 3 were occupied by the requests you had in progress.

Using Concurrency Headers for Optimization

These headers are valuable tools for optimizing your scraping tasks. By monitoring and interpreting these headers in real time, you can adjust your request patterns to make the most efficient use of your concurrency slots.

Optimization Tips:

  1. Before sending a batch of requests, inspect the Concurrency-Remaining header of the most recent response.
  2. Based on the value of this header, adjust the number of parallel requests you send. For example, if Concurrency-Remaining is 5, avoid sending more than 5 simultaneous requests.

By adapting your request strategy based on these headers, you can reduce the likelihood of encountering “429 Too Many Requests” errors and ensure a smoother, more efficient interaction with the API.

Using Concurrency

Most programming languages and clients do not natively support concurrency, so you may need to implement your solution. Alternatively, you can use the solutions provided below. 😉

ZenRows SDK for Python

To run the examples, ensure you have Python 3 installed. Install the necessary libraries with:

pip install zenrows

ZenRows Python SDK comes with built-in concurrency and retries. You can set these parameters in the constructor. Keep in mind that each client instance has its own limit, so running multiple scripts might lead to 429 Too Many Requests errors.

The asyncio.gather function will wait for all the calls to finish and store all the responses in an array. Afterward, you can loop over the array and extract the necessary data. Each response will include the status, request, response content, and other values. Remember to run the scripts with asyncio.run to avoid a coroutine 'main' was never awaited error.

scraper.py
from zenrows import ZenRowsClient
import asyncio
from urllib.parse import urlparse, parse_qs

client = ZenRowsClient("YOUR_ZENROWS_API_KEY", concurrency=5, retries=1)

urls = [
	# ...
]

async def main():
	responses = await asyncio.gather(*[client.get_async(url) for url in urls])

	for response in responses:
		original_url = parse_qs(urlparse(response.request.url).query)["url"]
		print({
			"response": response,
			"status_code": response.status_code,
			"request_url": original_url,
		})

asyncio.run(main())

Python with requests

If you prefer using the requests library and want to handle multiple requests concurrently, Python’s multiprocessing package can be an effective solution. This approach is particularly useful when you’re dealing with a large list of URLs and need to speed up the data collection process by sending multiple requests simultaneously.

pip install requests

The multiprocessing package in Python includes a ThreadPool class, which allows you to manage a pool of worker threads. Each thread can handle a separate task, enabling multiple HTTP requests to be processed in parallel. This is particularly beneficial when scraping data from a large number of websites, as it reduces the overall time required.

scraper.py
import requests
from multiprocessing.pool import ThreadPool

apikey = "YOUR_ZENROWS_API_KEY"
concurrency = 10
urls = [
	# ... your URLs here
]

def scrape_with_zenrows(url):
	response = requests.get(
		url="https://api.zenrows.com/v1/",
		params={
			"url": url,
			"apikey": apikey,
		},
	)

	return {
		"content": response.text,
		"status_code": response.status_code,
		"request_url": url,
	}

pool = ThreadPool(concurrency)
results = pool.map(scrape_with_zenrows, urls)
pool.close()
pool.join()

[print(result) for result in results]

ZenRows SDK for JavaScript

When working with JavaScript for web scraping, managing concurrency and handling retries can be challenging.

The ZenRows JavaScript SDK simplifies these tasks by providing built-in concurrency and retry options. This is particularly useful for developers who need to scrape multiple URLs efficiently while avoiding rate limits.

To get started, install the ZenRows SDK using npm:

npm i zenrows

ZenRows allows you to control the concurrency level by passing a number in the constructor. It’s important to set this according to your subscription plan’s limits to prevent 429 (Too Many Requests) errors. Remember, each client instance has its own concurrency limit, so running multiple scripts won’t share this limit.

const { ZenRows } = require('zenrows');

const apiKey = 'YOUR_ZENROWS_API_KEY';

(async () => {
	const client = new ZenRows(apiKey, { concurrency: 5, retries: 1 });

	const urls = [
		// ...
	];
	const promises = urls.map(url => client.get(url));

	const results = await Promise.allSettled(promises);
	console.log(results);
	/*
	[
		{
			status: 'fulfilled',
			value: {
				status: 200,
				statusText: 'OK',
				data: ...
		...
	*/

	// separate results list into rejected and fulfilled for later processing
	const rejected = results.filter(({ status }) => status === 'rejected');
	const fulfilled = results.filter(({ status }) => status === 'fulfilled');
})();

In this example, we use Promise.allSettled() to handle multiple asynchronous requests. This method is available in Node.js 12.9 and later. It waits for all the promises to settle, meaning it doesn’t stop if some requests fail. Instead, it returns an array of objects, each with a status of either fulfilled or rejected.

This approach makes your scraping more robust, as it ensures that all URLs in your list are processed, even if some requests encounter issues. You can then handle the fulfilled and rejected responses separately, allowing you to log errors or retry failed requests as needed.

Frequently Asked Questions (FAQ)