How to Integrate Scrapy with ZenRows

Scrapy is a powerful web scraping library, but anti-scraping measures can make it challenging. A ZenRows Scrapy integration can overcome these obstacles. In this tutorial, you’ll learn how to get your ZenRows proxy and integrate it with Scrapy using two methods: via Meta Parameter and Custom Middleware.

Use ZenRows’ Proxies with Scrapy to Avoid Blocks

ZenRows offers premium proxies in 190+ countries that auto-rotate the IP address for you, as well as the User-Agent header with the Universal Scraper API. Integrate them into Scrapy to appear as a different user every time so that your chances of getting blocked are reduced exponentially. ZenRows provides two options for integrating proxies with Scrapy:

Residential Proxies: With Residential Proxies, you can directly access our dedicated proxy network, billed by bandwidth usage. This option is ideal if you need flexible, on-demand proxy access.
Universal Scraper API with ZenRows Middleware: Our Universal Scraper API’s is optimized for high-demand scraping scenarios and billed per request based on chosen parameters. Using ZenRows Middleware for Scrapy allows you to seamlessly connect your Scrapy project to the Universal Scraper API, automatically routing requests through the Premium Proxy and handling API-specific configurations.

In this tutorial, we’ll focus on using the Universal Scraper API’s with the ZenRows Middleware, the recommended setup for seamless Scrapy integration.

Let’s assume you have set the Scrapy environment with the initial script below.

scraper.py

import scrapy

class ScraperSpider(scrapy.Spider):
    name = "scraper"
    allowed_domains = ["httpbin.io"]
    start_urls = ["https://httpbin.io/ip"]

    def parse(self, response):
        pass

Follow the steps below to integrate ZenRows proxies into this scraper!

Integrate the ZenRows Middleware into Scrapy!

The ZenRows Middleware for Scrapy allows seamless integration of the ZenRows Universal Scraper API into Scrapy projects. This middleware helps you manage proxy settings, enable advanced features like JavaScript rendering, and apply custom headers and cookies.

Installation

First, install the scrapy-zenrows package, which provides the necessary middleware for integrating ZenRows with Scrapy.

pip install scrapy-zenrows

Usage

To use the ZenRows Universal Scraper API with Scrapy, sign in on ZenRows to obtain your API key. The API key allows you to access the Premium Proxy, JavaScript rendering, and other advanced scraping features.

Setting Up Global Middleware

To enable ZenRows as the default proxy across all Scrapy requests, add ZenRows Middleware to your project’s settings.py file. This setup configures your Scrapy spiders to use the ZenRows API for every request automatically.

settings.py

DOWNLOADER_MIDDLEWARES = {
    "scrapy_zenrows.ZenRowsMiddleware": 543,  # Add ZenRows Middleware
}

# ZenRows API Key
ZENROWS_API_KEY = "YOUR_ZENROWS_API_KEY"

Enabling Premium Proxy and JavaScript Rendering

ZenRows offers Premium Proxy and JavaScript rendering features, which are essential for handling websites that require complex interactions or are protected by anti-bot systems. To enable these features for all requests, configure them in settings.py:

settings.py

# ...

USE_ZENROWS_PREMIUM_PROXY = True  # Enable Premium Proxy for all requests (default is False)
USE_ZENROWS_JS_RENDER = True      # Enable JavaScript rendering for all requests (default is False)

By default, both features are disabled to keep requests lean and cost-effective.

Customizing ZenRows Middleware for Specific Requests

In scenarios where you don’t need Premium Proxy or JavaScript rendering for every request (e.g., for only certain pages or spiders), you can override global settings and apply these features only to specific requests. This is done using the ZenRowsRequest class, which provides a flexible way to configure ZenRows on a per-request basis.

scraper.py

from scrapy_zenrows import ZenRowsRequest

class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/ip"]

    def start_requests(self):
        # Use ZenRowsRequest to customize settings per request
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "js_render": "true",       # Enable JavaScript rendering for this request
                    "premium_proxy": "true",   # Enable Premium Proxy for this request
                },
            )

In this example, ZenRowsRequest is configured with js_render and premium_proxy set to true, ensuring that only this specific request uses JavaScript rendering and Premium Proxy.

Using Additional Request Parameters

The ZenRowsRequest function supports several other parameters, allowing you to customize each request to meet specific requirements. Here are some useful parameters:

proxy_country: Specifies the country for the proxy, useful for geo-targeting.
js_instructions: Allows custom JavaScript actions on the page, such as waiting for elements to load.
autoparse: Automatically extracts data from supported websites.
outputs: Extracts specific content types like tables, images, or links.
css_extractor: Allows CSS-based content extraction.

Here’s an example of using these advanced parameters:

scraper.py

class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/ip"]

    def start_requests(self):
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "js_render": "true",       # Enable JavaScript rendering for this request
                    "premium_proxy": "true",   # Enable Premium Proxy for this request
                    "proxy_country": "ca",     # Use a proxy from Canada
                    "js_instructions": '[{"wait": 500}]',    # Wait 500ms after page load
                    "autoparse": "true",                     # Enable automatic parsing
                    "outputs": "tables",                     # Extract tables from the page
                    "css_extractor": '{"links":"a @href","images":"img @src"}'  # Extract links and images
                },
            )

Refer to the ZenRows Universal Scraper API documentation for a complete list of supported parameters.

Customizing Headers with ZenRows

Certain websites require specific headers (such as Referer or Origin) for successful scraping. ZenRows Middleware allows you to set custom headers on a per-request basis. When using custom headers, set the custom_headers parameter to "true" so that ZenRows includes your headers while managing essential browser headers on its end. Here’s an example of setting a custom Referer header:

scraper.py

class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/anything"]

    def start_requests(self):
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "custom_headers": "true",  # Enable custom headers for this request
                },
                headers={
                    "Referer": "https://www.google.com/",  # Set a custom Referer header
                },
            )

For cookies add them to the cookies dictionary in the request’s meta parameter. Just as with custom headers, custom_headers must be set to "true" for ZenRows to allow custom cookies. This is particularly useful for handling sessions or accessing region-specific content.

scraper.py

class YourSpider(scrapy.Spider):
    name = "your_spider"
    start_urls = ["https://httpbin.io/anything"]

    def start_requests(self):
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "custom_headers": "true",  # Allow custom cookies
                },
                cookies={
                    "currency": "USD",
                    "country": "UY",
                },
            )

Cookies are often required to maintain user sessions or comply with location-based content restrictions. For more information on cookies and headers, see ZenRows headers feature documentation.

Pricing

ZenRows operates on a pay-per-success model on the Universal Scraper API (that means you only pay for requests that produce the desired result); on the Residential Proxies, it’s based on bandwidth use. To optimize your scraper’s success rate, fully replace Scrapy with ZenRows. Different pages on the same site may have various levels of protection, but using the parameters recommended above will get you covered. ZenRows offers a range of plans, starting at just $69 monthly. For more detailed information, please refer to our pricing page.

Troubleshooting Guide

Even with ZenRows handling most scraping challenges, you might encounter issues. Here’s how to diagnose and resolve common problems:

Anti-Bot Detection Issues

Problem: Content doesn’t match what you see in browser

Solutions:

Enable JavaScript rendering: Some sites load content dynamically

# Enable JavaScript rendering for a specific request
yield ZenRowsRequest(
    url=url,
    params={
        "js_render": "true",
    },
)

Check if Premium Proxies are needed: Some sites may block datacenter IPs

# Enable Premium Proxies for a specific request
yield ZenRowsRequest(
    url=url,
    params={
        "premium_proxy": "true",
    },
)

Use custom headers to appear more like a real browser: add a valid referer like Google or Bing

yield ZenRowsRequest(
    url=url,
    params={
        "custom_headers": "true",
    },
    headers={
        "Referer": "https://www.google.com/",
    },
)

Problem: Getting redirected to CAPTCHA or security pages

Solution:

Use full browser emulation with JS rendering and Premium Proxies:

yield ZenRowsRequest(
    url=url,
    params={
        "js_render": "true",
        "premium_proxy": "true",
        "wait": "3000",  # Wait 3 seconds for page to load fully
    },
)

Try different geographic locations:

# Try accessing from a different country
yield ZenRowsRequest(
    url=url,
    params={
        "js_render": "true",
        "premium_proxy": "true",
        "proxy_country": "ca",  # Canada
    },
)

Frequently Asked Questions (FAQs)

Why do I need a proxy for Scrapy?

Scrapy is widely recognized by websites’ anti-bot systems, which can block your requests. Using residential proxies from ZenRows allows you to rotate IP addresses and appear as a legitimate user, helping to bypass these restrictions and reduce the chances of being blocked.

Do you have any code examples?

Yes! You can find code examples demonstrating how to use the scrapy_zenrows middleware here!

How do I know if my proxy is working?

You can test the proxy connection by running the script provided in the tutorial and checking the output from httpbin.io/ip. If the proxy is working, the response will display a different IP address than your local machine’s.

What should I do if my requests are blocked?

Many websites employ advanced anti-bot measures, such as CAPTCHAs and Web Application Firewalls (WAFs), to prevent automated scraping. Simply using proxies may not be enough to bypass these protections.Instead of relying solely on proxies, consider using ZenRows’ Universal Scraper API, which provides:

JavaScript Rendering and Interaction Simulation: Optimized with anti-bot bypass capabilities.
Comprehensive Anti-Bot Toolkit: ZenRows offers advanced tools to overcome complex anti-scraping solutions.

Developer Tools

No-code/Low-code Integrations

AI & Automation

Captcha Solvers

How to Integrate Scrapy with ZenRows

Use ZenRows’ Proxies with Scrapy to Avoid Blocks

Integrate the ZenRows Middleware into Scrapy!

Installation

Usage

Setting Up Global Middleware

Enabling Premium Proxy and JavaScript Rendering

Customizing ZenRows Middleware for Specific Requests

Using Additional Request Parameters

Customizing Headers with ZenRows

Pricing

Troubleshooting Guide

Anti-Bot Detection Issues

Problem: Content doesn’t match what you see in browser

Problem: Getting redirected to CAPTCHA or security pages

Frequently Asked Questions (FAQs)

Developer Tools

No-code/Low-code Integrations

AI & Automation

Captcha Solvers

​Use ZenRows’ Proxies with Scrapy to Avoid Blocks

​Integrate the ZenRows Middleware into Scrapy!

​Installation

​Usage

​Setting Up Global Middleware

​Enabling Premium Proxy and JavaScript Rendering

​Customizing ZenRows Middleware for Specific Requests

​Using Additional Request Parameters

​Customizing Headers with ZenRows

​Pricing

​Troubleshooting Guide

​Anti-Bot Detection Issues

​Problem: Content doesn’t match what you see in browser

​Problem: Getting redirected to CAPTCHA or security pages

​Frequently Asked Questions (FAQs)

Use ZenRows’ Proxies with Scrapy to Avoid Blocks

Integrate the ZenRows Middleware into Scrapy!

Installation

Usage

Setting Up Global Middleware

Enabling Premium Proxy and JavaScript Rendering

Customizing ZenRows Middleware for Specific Requests

Using Additional Request Parameters

Customizing Headers with ZenRows

Pricing

Troubleshooting Guide

Anti-Bot Detection Issues

Problem: Content doesn’t match what you see in browser

Problem: Getting redirected to CAPTCHA or security pages

Frequently Asked Questions (FAQs)