Extract web data with AI agents using ZenRows’ enterprise-grade scraping infrastructure. The langchain-zenrows integration enables large language models (LLMs) to access real-time web data using ZenRows’ robust scraping infrastructure. This guide covers how to scrape data with LLMs using the langchain-zenrows module.

What is LangChain?

LangChain is a framework that connects large language models to external data sources and applications. It provides a composable architecture that enables you to create AI workflows by chaining LLM operations, from simple prompt-response patterns to autonomous agents.

One key advantage of LangChain is that it allows for easy swapping, coupling, and decoupling of LLMs.

Key Benefits of Integrating LangChain With ZenRows

The langchain-zenrows integration brings the following benefits:

  • Integrate ZenRows with LLMs: Easily integrate scraping capabilities into your desired LLM.
  • Build an agentic data pipeline: Assign different data pipeline roles to each LLM agent based on its capabilities.
  • Real-time web access without getting blocked: Fetch live web content without antibot or JavaScript rendering limitations.
  • Multiple output formats: Fetch website data in various formats, including HTML, Markdown, Plaintext, PDF, or Screenshots.
  • Specific data point extraction: Extract specific data from web pages, such as emails, tables, phone numbers, images, and more.
  • Support for custom parsing: Fetch specific information from web elements using ZenRows’ advanced CSS selector feature.

Use Cases

Here are some use cases of the langchain-zenrows integration:

  • Real-time monitoring: Develop an AI application that scrapes and monitors website content changes in real-time.
  • Market research and demand forecasting: Scrape demand signals, such as reviews, social comments, engagement metrics, price trends, and more. Then, pass the data to an LLM model for forecasting.
  • Finding the best deals: Spot the best deals for a specific product from several e-commerce websites using ZenRows.
  • Review summarization: Summarize scraped reviews using a selected model.
  • Sentiment analysis: Scrape and analyze sentiment in social comments or product reviews.
  • Product research and comparison: Compare products across multiple retail websites and e-commerce platforms to identify the best options.
  • Consistent data pipeline update: Keep your data pipeline up to date with fresh data by integrating langchain-zenrows into your pipeline operations.

Getting Started: Basic Usage

Let’s start with a simple example that uses the langchain-zenrows package to scrape the Antibot Challenge page and return its content in Markdown format.

Install the langchain-zenrows package using pip:

pip3 install langchain-zenrows

Import the ZenRowsUniversalScraper class from the langchain_zenrows module, instantiate the universal scraper with your ZenRows API key, and specify ZenRows parameters with the response_type set to markdown:

Python
from langchain_zenrows import ZenRowsUniversalScraper

# Set your ZenRows API key
os.environ["ZENROWS_API_KEY"] = "YOUR_ZENROWS_API_KEY"

# Instantiate the universal scraper
scraper = ZenRowsUniversalScraper()

url = "https://www.scrapingcourse.com/antibot-challenge"

# Set ZenRows parameters
params = {
    "url": url,
    "js_render": "true",
    "premium_proxy": "true",
    "response_type": "markdown",
}

# Get content in markdown format
result = scraper.invoke(params)
print(result)

The integration bypasses the target site’s antibot measure and returns its content as Markdown:

Output
[![](https://www.scrapingcourse.com/assets/images/logo.svg) Scraping Course](http://www.scrapingcourse.com/)

# Antibot Challenge

![](https://www.scrapingcourse.com/assets/images/challenge.svg)

## You bypassed the Antibot challenge! :D

You’ve successfully integrated ZenRows with LangChain and bypassed an antibot challenge. Let’s build an AI research assistant with this integration.

Advanced Usage: Building an AI Research Assistant

Let’s take things a step further by building an AI-powered pricing research assistant for Etsy. Using the langchain-zenrows integration together with OpenAI’s gpt-4o-mini model, our assistant will automatically visit Etsy’s accessories category and extract key product details such as names, prices, and URLs.

Here’s the prompt we’ll use to guide the assistant:

Example Prompt

Prompt
Go to the accessories category of https://www.etsy.com/ and return the names, prices, and URLs of the top 4 products in JSON format using the autoparse feature.

Step 1: Install the packages

pip install langgraph langchain-openai langchain-zenrows

Step 2: Add ZenRows as a scraping tool for the AI model

Import the necessary modules and define your ZenRows and OpenAI API keys. Instantiate OpenAI’s chat model and langchain-zenrows integration with the relevant API keys. Configure the LLM agent to use ZenRows as a scraping tool:

Python
# pip install langgraph langchain-openai langchain-zenrows
from langchain_zenrows import ZenRowsUniversalScraper
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
import os

os.environ["ZENROWS_API_KEY"] = "YOUR_ZENROWS_API_KEY"
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

def scraper():
    # initialize the model
    llm = ChatOpenAI(model="gpt-4o-mini")

    # initialize the universal scraper
    zenrows_tool = ZenRowsUniversalScraper()

    # create an agent that uses ZenRows as a tool
    agent = create_react_agent(llm, [zenrows_tool])

Step 3: Prompt the AI Agent

Invoke the AI agent with the research prompt and execute the scraper. As stated in the prompt, the agent uses ZenRows’ markdown response to scrape the target page in Markdown format. It then analyzes the result and returns the 4 cheapest products:

Python
# ...
def scraper():
    # ...

    try:
        # create a prompt
        result = agent.invoke(
            {
                "messages": "Go to the Accessories category page of https://www.etsy.com/. Scrape the page in markdown format and return the 4 cheapest products in JSON format."
            }
        )
        # extract the response
        for message in result["messages"]:
            print(f"{message.content}")

    except NameError:
        print(
            "⚠️  Agent not available."
        )
    except Exception as e:
        print(f"❌ Error running agent: {e}")

scraper()

The agent uses ZenRows to visit and scrape the product information. Once scraped, the agent returns the items in the desired format.

Complete Code Example

Combine the snippets from the two steps, and you’ll get the following code:

Python
# pip install langgraph langchain-openai langchain-zenrows
from langchain_zenrows import ZenRowsUniversalScraper
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
import os

os.environ["ZENROWS_API_KEY"] = "YOUR_ZENROWS_API_KEY"
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

def scraper():

    # initialize the model
    llm = ChatOpenAI(model="gpt-4o-mini")

    # initialize the universal scraper
    zenrows_tool = ZenRowsUniversalScraper()

    # create an agent that uses ZenRows as a tool
    agent = create_react_agent(llm, [zenrows_tool])

    try:
        # create a prompt
        result = agent.invoke(
            {
                "messages": "Go to the Accessories category page of https://www.etsy.com/. Scrape the page in markdown format and return the 4 cheapest products in JSON format."
            }
        )
        # extract the response
        for message in result["messages"]:
            print(f"{message.content}")

    except NameError:
        print(
            "⚠️  Agent not available."
        )
    except Exception as e:
        print(f"❌ Error running agent: {e}")


scraper()

The above code returns the names, prices, and URLs of the 4 cheapest products in JSON format as expected.

Example Output

Output
[
    {
        "title": "Lovely Cat Keychain Gift For Pet Mom",
        "price": "$4.68",
        "url": "https://www.etsy.com/listing/1812260433/lovel...",
    },
    {
        "title": "Personalized slim leather keychain, key fob, custom keychain, leather initial keychain, quick shipping anniversary gift",
        "price": "$4.79",
        "url": "https://www.etsy.com/listing/876501930/personalized...",
    },
    {
        "title": "Custom OWALA Name Tag Back to School for daughter Owala Cup accessory for son waterbottle Tumbler Name Plate for sports tumbler athlete tag",
        "price": "$4.50",
        "url": "https://www.etsy.com/listing/1796331543/custom-...",
    },
    {
        "title": "Set of Blue and White Striped Hair Bows - 3-Inch Handmade Clips for Girls & Toddlers",
        "price": "$6.00",
        "url": "https://www.etsy.com/listing/4328846122/set...",
    },
]

Congratulations! 🎉 You’ve now integrated ZenRows as a web scraping tool for an AI agent using the langchain-zenrows module.

API Reference

ParameterTypeDescription
zenrows_api_keystringYour ZenRows API key. If not provided, the setup looks for the ZENROWS_API_KEY environment variable.
urlstringRequired. The URL to scrape.
js_renderbooleanEnable JavaScript rendering with a headless browser. Essential for modern web apps, SPAs, and sites with dynamic content (default: False).
js_instructionsstringExecute custom JavaScript on the page to interact with elements, scroll, click buttons, or manipulate content.
premium_proxybooleanUse residential IPs to bypass antibot protection. Essential for accessing protected sites (default: False).
proxy_countrystringSet the country of the IP used for the request. Use for accessing geo-restricted content. Two-letter country code.
session_idintegerMaintain the same IP for multiple requests for up to 10 minutes. Essential for multi-step processes.
custom_headersbooleanInclude custom headers in your request to mimic browser behavior.
wait_forstringWait for a specific CSS Selector to appear in the DOM before returning content.
waitintegerWait a fixed amount of milliseconds after page load.
block_resourcesstringBlock specific resources (images, fonts, etc.) from loading to speed up scraping.
response_typestringConvert HTML to other formats. Options: “markdown”, “plaintext”, “pdf”.
css_extractorstringExtract specific elements using CSS selectors (JSON format).
autoparsebooleanAutomatically extract structured data from HTML (default: False).
screenshotstringCapture an above-the-fold screenshot of the page (default: “false”).
screenshot_fullpagestringCapture a full-page screenshot (default: “false”).
screenshot_selectorstringCapture a screenshot of a specific element using CSS Selector.
screenshot_formatstringChoose between “png” (default) and “jpeg” formats for screenshots.
screenshot_qualityintegerFor JPEG format, set the quality from 1 to 100. Lower values reduce file size but decrease quality.
original_statusbooleanReturn the original HTTP status code from the target page (default: False).
allowed_status_codesstringReturns the content even if the target page fails with the specified status codes. Useful for debugging or when you need content from error pages.
json_responsebooleanCapture network requests in JSON format, including XHR or Fetch data. Ideal for intercepting API calls made by the web page (default: False).
outputsstringSpecify which data types to extract from the scraped HTML. Accepted values: emails, phone numbers, headings, images, audios, videos, links, menus, hashtags, metadata, tables, favicon.

For complete parameter documentation and details, see the official ZenRows API Reference.

Troubleshooting

Token limit exceeded

  • Solution 1: If you hit the LLM token limit, it means the output size has exceeded what the model can process in a single request. You can parse specific data and then feed it to the LLM model.
  • Solution 2: If the issue is related to usage-based token quotas or the model version’s capabilities, consider upgrading your plan or switching to a higher model with higher bandwidth. For instance, moving from gpt-3.5 to gpt-4o-mini increases the token limit significantly.

API key error

  • Solution 1: Ensure you’ve added your ZenRows and the LLM’s API keys to your environment variables.
  • Solution 2: Cross-check the API keys and ensure you’ve entered the correct keys.

Empty or incomplete data/tool response

  • Solution 1: Activate JS rendering to handle dynamic content and increase the success rate.
  • Solution 2: Increase the wait time using the ZenRows wait or wait_for parameter. The wait parameter introduces a general delay to allow the entire page to load, whereas wait_for targets a specific element, pausing execution until that element appears before scraping continues.
  • Solution 3: If you’ve used the css_extractor parameter to target specific elements, ensure you’ve entered the correct selectors.

Helpful Resources

Frequently Asked Questions (FAQ)