langchain-zenrows
integration enables large language models (LLMs) to access real-time web data using ZenRows’ robust scraping infrastructure. This guide covers how to scrape data with LLMs using the langchain-zenrows
module.
langchain-zenrows
integration brings the following benefits:
langchain-zenrows
integration:
langchain-zenrows
into your pipeline operations.langchain-zenrows
package to scrape the Antibot Challenge page and return its content in Markdown format.
Install the langchain-zenrows
package using pip
:
ZenRowsUniversalScraper
class from the langchain_zenrows
module, instantiate the universal scraper with your ZenRows API key, and specify ZenRows parameters with the response_type
set to markdown
:
langchain-zenrows
integration together with OpenAI’s gpt-4o-mini
model, our assistant will automatically visit Etsy’s accessories category and extract key product details such as names, prices, and URLs.
Here’s the prompt we’ll use to guide the assistant:
langchain-zenrows
integration with the relevant API keys. Configure the LLM agent to use ZenRows as a scraping tool:
markdown
response to scrape the target page in Markdown format. It then analyzes the result and returns the 4 cheapest products:
langchain-zenrows
module.
Parameter | Type | Description |
---|---|---|
zenrows_api_key | string | Your ZenRows API key. If not provided, the setup looks for the ZENROWS_API_KEY environment variable. |
url | string | Required. The URL to scrape. |
js_render | boolean | Enable JavaScript rendering with a headless browser. Essential for modern web apps, SPAs, and sites with dynamic content (default: False). |
js_instructions | string | Execute custom JavaScript on the page to interact with elements, scroll, click buttons, or manipulate content. |
premium_proxy | boolean | Use residential IPs to bypass antibot protection. Essential for accessing protected sites (default: False). |
proxy_country | string | Set the country of the IP used for the request. Use for accessing geo-restricted content. Two-letter country code. |
session_id | integer | Maintain the same IP for multiple requests for up to 10 minutes. Essential for multi-step processes. |
custom_headers | boolean | Include custom headers in your request to mimic browser behavior. |
wait_for | string | Wait for a specific CSS Selector to appear in the DOM before returning content. |
wait | integer | Wait a fixed amount of milliseconds after page load. |
block_resources | string | Block specific resources (images, fonts, etc.) from loading to speed up scraping. |
response_type | string | Convert HTML to other formats. Options: “markdown”, “plaintext”, “pdf”. |
css_extractor | string | Extract specific elements using CSS selectors (JSON format). |
autoparse | boolean | Automatically extract structured data from HTML (default: False). |
screenshot | string | Capture an above-the-fold screenshot of the page (default: “false”). |
screenshot_fullpage | string | Capture a full-page screenshot (default: “false”). |
screenshot_selector | string | Capture a screenshot of a specific element using CSS Selector. |
screenshot_format | string | Choose between “png” (default) and “jpeg” formats for screenshots. |
screenshot_quality | integer | For JPEG format, set the quality from 1 to 100. Lower values reduce file size but decrease quality. |
original_status | boolean | Return the original HTTP status code from the target page (default: False). |
allowed_status_codes | string | Returns the content even if the target page fails with the specified status codes. Useful for debugging or when you need content from error pages. |
json_response | boolean | Capture network requests in JSON format, including XHR or Fetch data. Ideal for intercepting API calls made by the web page (default: False). |
outputs | string | Specify which data types to extract from the scraped HTML. Accepted values: emails, phone numbers, headings, images, audios, videos, links, menus, hashtags, metadata, tables, favicon. |
wait
or wait_for
parameter. The wait
parameter introduces a general delay to allow the entire page to load, whereas wait_for
targets a specific element, pausing execution until that element appears before scraping continues.css_extractor
parameter to target specific elements, ensure you’ve entered the correct selectors.Which LLMs does langchain-zenrows support?
langchain-zenrows
is compatible with all LLMs supported by LangChain. Check LangChain’s official chat models documentation for more information.Can I use selectors with the LLM agent option?
Does langchain-zenrows support custom JavaScript execution?
js_instructions
parameter. Check our JavaScript instructions guide for more.Is antibot bypass automatic with the LLM agent option?
Does the LLM agent integration handle JS rendering?
How do I extract specific data with CSS selectors?
css_extractor
parameter to specify the selectors of the elements containing the data you want to scrape.Can I take screenshots with the LLM agent integration?
What's the difference between this and other web scraping tools in LangChain?