Output
CSS Selectors
You can use CSS Selectors for data extraction. In the table below, you will find a list of examples of how to use it.
You only need to add &css_extractor={"links":"a @href"}
to the request to use this feature.
Here are some examples
extraction rules | sample html | value | json output |
---|---|---|---|
{“divs”:“div”} | <div>text0</div> | text | {“divs”: “text0”} |
{“divs”:“div”} | <div>text1</div><div>text2</div> | text | {“divs”: [“text1”, “text2”]} |
{“links”:“a @href”} | <a href=“#register”>Register</a> | href attribute | {“links”: “#register”} |
{“hidden”:“input[type=hidden] @value”} | <input type=“hidden” name=“_token” value=“f23g23g.b9u1bg91g.zv97” /> | value attribute | {“hidden”: “f23g23g.b9u1bg91g.zv97”} |
{“class”:“button.submit @data-v”} | <button class=“submit” data-v=“register-user”>click</button> | data-v attribute with submit class | {“class”: “register-user”} |
{“emails”:“a[href^=‘mailto:’] @href”} | <a href=“mailto:test1@domain.com”>email 1</a><a href=“mailto:test2@domain.com”>email 2</a> | href attribute for links starting with mailto: | {“emails”: [“test1@domain.com”, “test2@domain.com”]} |
{“id”:“#my-id”} | <div id=“my-id”>Content here</div> | Content from element with id | {“id”: “Content here”} |
{“links”:“a[id=‘register-link’] @href”} | <a id=“register-link” href=“#signup”>Sign up</a> | href attribute of element with specific id | {“links”: “#signup”} |
{“xpath”:“//h1”} | <h1>Welcome</h1> | Extract text using XPath | {“xpath”: “Welcome”} |
{“xpath”:“//img @src”} | <img src=“image.png” alt=“image description” /> | Extract src attribute using XPath | {“xpath”: “image.png”} |
If you are interested in learning more, you can find a complete reference of CSS Selectors here.
Auto Parsing
ZenRows® API will return the HTML of the URL by default. Enabling Autoparse uses our extraction algorithms to parse data in JSON format automatically.
autoparse
feature on: What Is Autoparse?Add &autoparse=true
to the request for this feature.
Output Filters
The outputs
parameter lets you specify which data types to extract from the scraped HTML. This allows you to efficiently retrieve only the data types you’re interested in, reducing processing time and focusing on the most relevant information.
The parameter accepts a comma-separated list of filter names and returns the results in a structured JSON format.
outputs=*
to retrieve all available data types.Here’s an example of how to use the outputs
parameter:
Supported Filters and Examples:
Emails
Extracts email addresses using CSS selectors and regular expressions. This includes standard email formats like example@example.com
and obfuscated versions like example[at]example.com
.
Example: outputs=emails
Phone Numbers
Extracts phone numbers using CSS selectors and regular expressions, focusing on links with tel:
protocol.
Example: outputs=phone_numbers
Headings
Extracts heading text from HTML elements h1
through h6
.
Example: outputs=headings
Images
Extracts image sources from img
tags. Only the src
attribute is returned.
Example: outputs=images
Audios
Extracts audio sources from source
elements inside audio tags. Only the src
attribute is returned.
Example: outputs=audios
Videos
Extracts video sources from source
elements inside video tags. Only the src
attribute is returned.
Example: outputs=videos
Links
Extracts URLs from a
tags. Only the href
attribute is returned.
Example: outputs=links
Menus
Extracts menu items from li
elements inside menu
tags.
Example: outputs=menus
Hashtags
Extracts hashtags using regular expressions, matching typical hashtag formats like #example
.
Example: outputs=hashtags
Metadata
Extracts meta-information from meta
tags inside the head
section. Returns name
and content
attributes in the format name: content
.
Example: outputs=metadata
Tables
Extracts data from table
elements and returns the table data in JSON format, including dimensions, headings, and content.
Example: outputs=tables
Favicon
Extracts the favicon URL from the link
element in the head
section of the HTML.
Example: outputs=favicon
Markdown Response
By adding response_type=markdown
to the request parameters, the ZenRows API will return the content in a Markdown format, making it easier to read and work with, especially if you are more comfortable with Markdown than HTML.
It can be beneficial if you prefer working with Markdown for its simplicity and readability.
Add response_type=markdown
to the request:
Let’s say the HTML content of the ScrapingCourse product page includes a product title, a description, and a list of features. In HTML, it might look something like this:
When you enable the Markdown response feature, ZenRows Universal Scraper API will convert this HTML content into Markdown like this:
Plain Text Response
The plaintext
feature is an output option that returns the scraped content as plain text instead of HTML or Markdown.
This feature can be helpful when you want a clean, unformatted version of the content without any HTML tags or Markdown formatting. It simplifies the content extraction process and makes processing or analyzing the text easier.
Add response_type=plaintext
to the request:
Let’s say the HTML content of the ScrapingCourse product page includes a product title, a description, and a list of features. In HTML, it might look something like this:
When you enable the plaintext_response
feature, ZenRows Universal Scraper API will convert this HTML content into plain text like this:
PDF Response
In today’s data-driven world, the ability to generate and save web scraping results in various formats can significantly enhance data utilization and sharing.
To use the PDF response feature, you must include the js_render=true
parameter alongside with the response_type
with the value pdf
in your request. This instructs the API to generate a PDF file from the scraped content.
The resulting PDF file will contain the same information as the web page you scraped.
After getting the response in .pdf
you can save it using the following example in Python:
Page Screenshot
Capture an above-the-fold screenshot of the target page by adding screenshot=true
to the request. By default, the image will be in PNG format.
Additional Options
screenshot_fullpage=true
takes a full-page screenshot.screenshot_selector=<CSS Selector>
takes a screenshot of the element given in the CSS Selector.
Due to the nature of the params, screenshot_selector
and screenshot_fullpage
are mutually exclusive. Additionally, JavaScript rendering (js_render=true
) is required.
These screenshot features can be combined with other options like wait
, wait_for
, or js_instructions
to ensure that the page or elements are fully loaded before capturing the image. When using json_response
, the result will include a JSON object with the screenshot data encoded in base64, allowing for easy integration into your workflows.
Image Format and Quality
In addition to the basic screenshot functionality, ZenRows offers customization options to optimize the output. These features are particularly useful for reducing file size, especially when taking full-page screenshots where the image might exceed 10MB, causing errors.
screenshot_format
: Choose betweenpng
andjpeg
formats, with PNG being the default. PNG is great for high-quality images and transparency, while JPEG offers efficient compression.screenshot_quality
: Applicable when using JPEG, this parameter allows you to set the quality from1
to100
. Useful for balancing image clarity and file size, especially in scenarios where storage or bandwidth is limited.
Download Files and Pictures
ZenRows® will download images, PDFs or any type of file. Instead of reading the response’s content as text, you can store it directly in a file.
Frequently Asked Questions (FAQ)
Was this page helpful?