Scrapy is a powerful web scraping library, but anti-scraping measures can make it challenging. A ZenRows Scrapy integration can overcome these obstacles.

In this tutorial, you’ll learn how to get your ZenRows proxy and integrate it with Scrapy using two methods: via Meta Parameter and Custom Middleware.

Use ZenRows’ Proxies with Scrapy to Avoid Blocks

ZenRows offers premium proxies in 190+ countries that auto-rotate the IP address for you, as well as the User-Agent header with the Scraper API. Integrate them into Scrapy to appear as a different user every time so that your chances of getting blocked are reduced exponentially.

ZenRows provides two options for integrating proxies with Scrapy:

  1. Residential Proxies: With Residential Proxies, you can directly access our dedicated proxy network, billed by bandwidth usage. This option is ideal if you need flexible, on-demand proxy access.

  2. Scraper API with ZenRows Middleware: Our Scraper API’s is optimized for high-demand scraping scenarios and billed per request based on chosen parameters. Using ZenRows Middleware for Scrapy allows you to seamlessly connect your Scrapy project to the Scraper API, automatically routing requests through the Premium Proxy and handling API-specific configurations.

In this tutorial, we’ll focus on using the Scraper API’s with the ZenRows Middleware, the recommended setup for seamless Scrapy integration.

Let’s assume you have set the Scrapy environment with the initial script below.

scraper.py
import scrapy

class ScraperSpider(scrapy.Spider):
    name = "scraper"
    allowed_domains = ["httpbin.io"]
    start_urls = ["https://httpbin.io/ip"]

    def parse(self, response):
        pass

Follow the steps below to integrate ZenRows proxies into this scraper!

Integrate the ZenRows Middleware into Scrapy!

The ZenRows Middleware for Scrapy allows seamless integration of the ZenRows Scraper API into Scrapy projects. This middleware helps you manage proxy settings, enable advanced features like JavaScript rendering, and apply custom headers and cookies.

Installation

First, install the scrapy-zenrows package, which provides the necessary middleware for integrating ZenRows with Scrapy.

pip install scrapy-zenrows

Usage

To use the ZenRows Scraper API with Scrapy, sign in on ZenRows to obtain your API key. The API key allows you to access the Premium Proxy, JavaScript rendering, and other advanced scraping features.

Setting Up Global Middleware

To enable ZenRows as the default proxy across all Scrapy requests, add ZenRows Middleware to your project’s settings.py file. This setup configures your Scrapy spiders to use the ZenRows API for every request automatically.

settings.py
DOWNLOADER_MIDDLEWARES = {
    "scrapy_zenrows.ZenRowsMiddleware": 543,  # Add ZenRows Middleware
}

# ZenRows API Key
ZENROWS_API_KEY = "<YOUR_ZENROWS_API_KEY>"

Enabling Premium Proxy and JavaScript Rendering

ZenRows offers Premium Proxy and JavaScript rendering features, which are essential for handling websites that require complex interactions or are protected by anti-bot systems. To enable these features for all requests, configure them in settings.py:

settings.py
# ...

USE_ZENROWS_PREMIUM_PROXY = True  # Enable Premium Proxy for all requests (default is False)
USE_ZENROWS_JS_RENDER = True      # Enable JavaScript rendering for all requests (default is False)
By default, both features are disabled to keep requests lean and cost-effective.

Customizing ZenRows Middleware for Specific Requests

In scenarios where you don’t need Premium Proxy or JavaScript rendering for every request (e.g., for only certain pages or spiders), you can override global settings and apply these features only to specific requests. This is done using the ZenRowsRequest class, which provides a flexible way to configure ZenRows on a per-request basis.

scraper.py
from scrapy_zenrows import ZenRowsRequest

class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/ip"]

    def start_requests(self):
        # Use ZenRowsRequest to customize settings per request
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "js_render": "true",       # Enable JavaScript rendering for this request
                    "premium_proxy": "true",   # Enable Premium Proxy for this request
                },
            )

In this example, ZenRowsRequest is configured with js_render and premium_proxy set to true, ensuring that only this specific request uses JavaScript rendering and Premium Proxy.

Using Additional Request Parameters

The ZenRowsRequest function supports several other parameters, allowing you to customize each request to meet specific requirements. Here are some useful parameters:

  • proxy_country: Specifies the country for the proxy, useful for geo-targeting.
  • js_instructions: Allows custom JavaScript actions on the page, such as waiting for elements to load.
  • autoparse: Automatically extracts data from supported websites.
  • outputs: Extracts specific content types like tables, images, or links.
  • css_extractor: Allows CSS-based content extraction.

Here’s an example of using these advanced parameters:

scraper.py
class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/ip"]

    def start_requests(self):
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "js_render": "true",       # Enable JavaScript rendering for this request
                    "premium_proxy": "true",   # Enable Premium Proxy for this request
                    "proxy_country": "ca",     # Use a proxy from Canada
                    "js_instructions": '[{"wait": 500}]',    # Wait 500ms after page load
                    "autoparse": "true",                     # Enable automatic parsing
                    "outputs": "tables",                     # Extract tables from the page
                    "css_extractor": '{"links":"a @href","images":"img @src"}'  # Extract links and images
                },
            )
Refer to the ZenRows Scraper API documentation for a complete list of supported parameters.

Customizing Headers with ZenRows

Certain websites require specific headers (such as Referer or Origin) for successful scraping. ZenRows Middleware allows you to set custom headers on a per-request basis. When using custom headers, set the custom_headers parameter to "true" so that ZenRows includes your headers while managing essential browser headers on its end.

Here’s an example of setting a custom Referer header:

scraper.py
class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/anything"]

    def start_requests(self):
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "custom_headers": "true",  # Enable custom headers for this request
                },
                headers={
                    "Referer": "https://www.google.com/",  # Set a custom Referer header
                },
            )

For cookies add them to the cookies dictionary in the request’s meta parameter. Just as with custom headers, custom_headers must be set to "true" for ZenRows to allow custom cookies. This is particularly useful for handling sessions or accessing region-specific content.

scraper.py
class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/anything"]

    def start_requests(self):
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "custom_headers": "true",  # Allow custom cookies
                },
                cookies={
                    "currency": "USD",
                    "country": "UY",
                },
            )
Cookies are often required to maintain user sessions or comply with location-based content restrictions. For more information on cookies and headers, see ZenRows headers feature documentation.

Pricing

ZenRows operates on a pay-per-success model on the Scraper API (that means you only pay for requests that produce the desired result); on the Residential Proxies, it’s based on bandwidth use.

To optimize your scraper’s success rate, fully replace Scrapy with ZenRows. Different pages on the same site may have various levels of protection, but using the parameters recommended above will get you covered.

ZenRows offers a range of plans, starting at just $69 monthly. For more detailed information, please refer to our pricing page.

Frequently Asked Questions (FAQs)

Was this page helpful?