CSS Selectors

You can use CSS Selectors for data extraction. In the table below, you will find a list of examples of how to use it. You only need to add &css_extractor={"links":"a @href"} to the request to use this feature. Here are some examples
extraction rulessample htmlvaluejson output
{“divs”:“div”}<div>text0</div>text{“divs”: “text0”}
{“divs”:“div”}<div>text1</div><div>text2</div>text{“divs”: [“text1”, “text2”]}
{“links”:“a @href”}<a href=“#register”>Register</a>href attribute{“links”: “#register”}
{“hidden”:“input[type=hidden] @value”}<input type=“hidden” name=“_token” value=“f23g23g.b9u1bg91g.zv97” />value attribute{“hidden”: “f23g23g.b9u1bg91g.zv97”}
{“class”:“button.submit @data-v”}<button class=“submit” data-v=“register-user”>click</button>data-v attribute with submit class{“class”: “register-user”}
{“emails”:“a[href^=‘mailto:’] @href”}<a href=“mailto:test1@‍domain.com”>email 1</a><a href=“mailto:test2@‍domain.com”>email 2</a>href attribute for links starting with mailto:{“emails”: [“test1@‍domain.com”, “test2@‍domain.com”]}
{“id”:“#my-id”}<div id=“my-id”>Content here</div>Content from element with id{“id”: “Content here”}
{“links”:“a[id=‘register-link’] @href”}<a id=“register-link” href=“#signup”>Sign up</a>href attribute of element with specific id{“links”: “#signup”}
{“xpath”:“//h1”}<h1>Welcome</h1>Extract text using XPath{“xpath”: “Welcome”}
{“xpath”:“//img @src”}<img src=“image.png” alt=“image description” />Extract src attribute using XPath{“xpath”: “image.png”}
If you are interested in learning more, you can find a complete reference of CSS Selectors here.
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
    'css_extractor': """{"links":"a @href","images":"img @src"}""",
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Auto Parsing

ZenRows® API will return the HTML of the URL by default. Enabling Autoparse uses our extraction algorithms to parse data in JSON format automatically.
Understand more about the autoparse feature on: What Is Autoparse?
Add &autoparse=true to the request for this feature.
# pip install requests
import requests

url = 'https://www.amazon.com/dp/B01LD5GO7I/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
    'autoparse': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Output Filters

The outputs parameter lets you specify which data types to extract from the scraped HTML. This allows you to efficiently retrieve only the data types you’re interested in, reducing processing time and focusing on the most relevant information. The parameter accepts a comma-separated list of filter names and returns the results in a structured JSON format.
Use outputs=* to retrieve all available data types.
Here’s an example of how to use the outputs parameter:
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
	'outputs': 'emails,headings,menus',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Supported Filters and Examples:

Emails

Extracts email addresses using CSS selectors and regular expressions. This includes standard email formats like example@example.com and obfuscated versions like example[at]example.com. Example: outputs=emails
output
{
  "emails": [
    "example@example.com",
    "info@website.com",
    "contact[at]domain.com",
    "support at support dot com"
  ]
}

Phone Numbers

Extracts phone numbers using CSS selectors and regular expressions, focusing on links with tel: protocol. Example: outputs=phone_numbers
output
{
  "phone_numbers": [
    "+1-800-555-5555",
    "(123) 456-7890",
    "+44 20 7946 0958"
  ]
}

Headings

Extracts heading text from HTML elements h1 through h6. Example: outputs=headings
output
{
  "headings": [
    "Welcome to Our Website",
    "Our Services",
    "Contact Us",
    "FAQ"
  ]
}

Images

Extracts image sources from img tags. Only the src attribute is returned. Example: outputs=images
output
{
  "images": [
    "https://example.com/image1.jpg",
    "https://example.com/image2.png"
  ]
}

Audios

Extracts audio sources from source elements inside audio tags. Only the src attribute is returned. Example: outputs=audios
output
{
  "audios": [
    "https://example.com/audio1.mp3",
    "https://example.com/audio2.wav"
  ]
}

Videos

Extracts video sources from source elements inside video tags. Only the src attribute is returned. Example: outputs=videos
output
{
  "videos": [
    "https://example.com/video1.mp4",
    "https://example.com/video2.webm"
  ]
}
Extracts URLs from a tags. Only the href attribute is returned. Example: outputs=links
output
{
  "links": [
    "https://example.com/page1",
    "https://example.com/page2"
  ]
}
Extracts menu items from li elements inside menu tags. Example: outputs=menus
output
{
  "menus": [
    "Home",
    "About Us",
    "Services",
    "Contact"
  ]
}

Hashtags

Extracts hashtags using regular expressions, matching typical hashtag formats like #example. Example: outputs=hashtags
output
{
  "hashtags": [
    "#vacation",
    "#summer2024",
    "#travel"
  ]
}

Metadata

Extracts meta-information from meta tags inside the head section. Returns name and content attributes in the format name: content. Example: outputs=metadata
output
{
  "metadata": [
    "description: This is an example webpage.",
    "keywords: example, demo, website",
    "author: John Doe"
  ]
}

Tables

Extracts data from table elements and returns the table data in JSON format, including dimensions, headings, and content. Example: outputs=tables
output
{
  "dimensions": {
    "rows": 4,
    "columns": 4,
    "heading": true
  },
  "heading": ["A", "B", "C", "D"],
  "content": [
    {"A": "1", "B": "1", "C": "1", "D": "1"},
    {"A": "2", "B": "2", "C": "2", "D": "2"},
    {"A": "3", "B": "3", "C": "3", "D": "3"},
    {"A": "4", "B": "4", "C": "4", "D": "4"}
  ]
}

Favicon

Extracts the favicon URL from the link element in the head section of the HTML. Example: outputs=favicon
output
{
  "favicon": "https://example.com/favicon.ico"
}

Markdown Response

By adding response_type=markdown to the request parameters, the ZenRows API will return the content in a Markdown format, making it easier to read and work with, especially if you are more comfortable with Markdown than HTML. It can be beneficial if you prefer working with Markdown for its simplicity and readability.
You can’t use the Markdown Response in conjunction with other outputs
Add response_type=markdown to the request:
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
	'response_type': 'markdown',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Let’s say the HTML content of the ScrapingCourse product page includes a product title, a description, and a list of features. In HTML, it might look something like this:
<h1>Product Title</h1>
<p>This is a great product that does many things.</p>
<ul>
    <li>Feature 1</li>
    <li>Feature 2</li>
    <li>Feature 3</li>
</ul>
When you enable the Markdown response feature, ZenRows Universal Scraper API will convert this HTML content into Markdown like this:
# Product Title

This is a great product that does many things.

- Feature 1
- Feature 2
- Feature 3

Plain Text Response

The plaintext feature is an output option that returns the scraped content as plain text instead of HTML or Markdown. This feature can be helpful when you want a clean, unformatted version of the content without any HTML tags or Markdown formatting. It simplifies the content extraction process and makes processing or analyzing the text easier.
You can’t use the Plain Text Response in conjunction with other outputs
Add response_type=plaintext to the request:
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
	'response_type': 'plaintext',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Let’s say the HTML content of the ScrapingCourse product page includes a product title, a description, and a list of features. In HTML, it might look something like this:
<h1>Product Title</h1>
<p>This is a great product that does many things.</p>
<ul>
    <li>Feature 1</li>
    <li>Feature 2</li>
    <li>Feature 3</li>
</ul>
When you enable the plaintext_response feature, ZenRows Universal Scraper API will convert this HTML content into plain text like this:
Product Title

This is a great product that does many things.

Feature 1
Feature 2
Feature 3

PDF Response

In today’s data-driven world, the ability to generate and save web scraping results in various formats can significantly enhance data utilization and sharing. To use the PDF response feature, you must include the js_render=true parameter alongside with the response_type with the value pdf in your request. This instructs the API to generate a PDF file from the scraped content.
Check our documentation about the JS Rendering
You can’t use the PDF Response in conjunction with other outputs.
The resulting PDF file will contain the same information as the web page you scraped.
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
	'response_type': 'pdf',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
After getting the response in .pdf you can save it using the following example in Python:
scraper.py
# Save the response as a binary file
with open('output.pdf', 'wb') as file:
    file.write(response.content)

print("Response saved into output.pdf")

Page Screenshot

Capture an above-the-fold screenshot of the target page by adding screenshot=true to the request. By default, the image will be in PNG format.

Additional Options

  • screenshot_fullpage=true takes a full-page screenshot.
  • screenshot_selector=<CSS Selector> takes a screenshot of the element given in the CSS Selector.
Due to the nature of the params, screenshot_selector and screenshot_fullpage are mutually exclusive. Additionally, JavaScript rendering (js_render=true) is required. These screenshot features can be combined with other options like wait, wait_for, or js_instructions to ensure that the page or elements are fully loaded before capturing the image. When using json_response, the result will include a JSON object with the screenshot data encoded in base64, allowing for easy integration into your workflows.

Image Format and Quality

In addition to the basic screenshot functionality, ZenRows offers customization options to optimize the output. These features are particularly useful for reducing file size, especially when taking full-page screenshots where the image might exceed 10MB, causing errors.
  • screenshot_format: Choose between png and jpeg formats, with PNG being the default. PNG is great for high-quality images and transparency, while JPEG offers efficient compression.
  • screenshot_quality: Applicable when using JPEG, this parameter allows you to set the quality from 1 to 100. Useful for balancing image clarity and file size, especially in scenarios where storage or bandwidth is limited.
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
	'js_render': 'true',
    'screenshot_fullpage': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Download Files and Pictures

ZenRows® lets you download images, PDFs, and other files directly from web pages. This feature is handy when extracting non-text content, like product images, manuals, or downloadable reports, as part of your web scraping workflow. Example:
# pip install requests
import requests

url = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
}
response = requests.get('https://api.zenrows.com/v1/', params=params)

# Save the PDF file as .pdf if the request is successful
if response.status_code == 200:
    with open('output.pdf', 'wb') as f:
        f.write(response.content)
    print("File downloaded and saved successfully!")
else:
    print(f"Failed to download the file. Status code: {response.text}")
Supported file download scenarios:
1

Direct File Response

If the URL you request returns the file directly, such as an image or PDF link, ZenRows will fetch the file so you can save it in its original format. This is the most reliable method.
2

Triggered Downloads Using JS Instructions

If a file download is started by a user action, such as clicking a button or link, you can use ZenRows’ JS Instructions to simulate these actions. If the download begins automatically, without prompting for a directory or further user input, ZenRows can capture and return the file.
Downloads are only possible when the file is delivered directly in the HTTP response. If the website asks the user to choose a download location or requires more interaction, ZenRows cannot capture the file. In these cases, we recommend using our Scraping Browser, which gives you more control over the browser session and supports more complex interactions.

File Size Limits

ZenRows enforces a maximum file size per request to ensure stable performance. If you try downloading a file larger than your plan allows, you will receive a 413 Content Too Large error.
You can find more details on the plan limits on our Pricing Documentation

Frequently Asked Questions (FAQ)