&css_extractor={"links":"a @href"}
to the request to use this feature.
Here are some examples
extraction rules | sample html | value | json output |
---|---|---|---|
{“divs”:“div”} | <div>text0</div> | text | {“divs”: “text0”} |
{“divs”:“div”} | <div>text1</div><div>text2</div> | text | {“divs”: [“text1”, “text2”]} |
{“links”:“a @href”} | <a href=“#register”>Register</a> | href attribute | {“links”: “#register”} |
{“hidden”:“input[type=hidden] @value”} | <input type=“hidden” name=“_token” value=“f23g23g.b9u1bg91g.zv97” /> | value attribute | {“hidden”: “f23g23g.b9u1bg91g.zv97”} |
{“class”:“button.submit @data-v”} | <button class=“submit” data-v=“register-user”>click</button> | data-v attribute with submit class | {“class”: “register-user”} |
{“emails”:“a[href^=‘mailto:’] @href”} | <a href=“mailto:test1@domain.com”>email 1</a><a href=“mailto:test2@domain.com”>email 2</a> | href attribute for links starting with mailto: | {“emails”: [“test1@domain.com”, “test2@domain.com”]} |
{“id”:“#my-id”} | <div id=“my-id”>Content here</div> | Content from element with id | {“id”: “Content here”} |
{“links”:“a[id=‘register-link’] @href”} | <a id=“register-link” href=“#signup”>Sign up</a> | href attribute of element with specific id | {“links”: “#signup”} |
{“xpath”:“//h1”} | <h1>Welcome</h1> | Extract text using XPath | {“xpath”: “Welcome”} |
{“xpath”:“//img @src”} | <img src=“image.png” alt=“image description” /> | Extract src attribute using XPath | {“xpath”: “image.png”} |
autoparse
feature on: What Is Autoparse?&autoparse=true
to the request for this feature.
outputs
parameter lets you specify which data types to extract from the scraped HTML. This allows you to efficiently retrieve only the data types you’re interested in, reducing processing time and focusing on the most relevant information.
The parameter accepts a comma-separated list of filter names and returns the results in a structured JSON format.
outputs=*
to retrieve all available data types.outputs
parameter:
example@example.com
and obfuscated versions like example[at]example.com
.
Example: outputs=emails
tel:
protocol.
Example: outputs=phone_numbers
h1
through h6
.
Example: outputs=headings
img
tags. Only the src
attribute is returned.
Example: outputs=images
source
elements inside audio tags. Only the src
attribute is returned.
Example: outputs=audios
source
elements inside video tags. Only the src
attribute is returned.
Example: outputs=videos
a
tags. Only the href
attribute is returned.
Example: outputs=links
li
elements inside menu
tags.
Example: outputs=menus
#example
.
Example: outputs=hashtags
meta
tags inside the head
section. Returns name
and content
attributes in the format name: content
.
Example: outputs=metadata
table
elements and returns the table data in JSON format, including dimensions, headings, and content.
Example: outputs=tables
link
element in the head
section of the HTML.
Example: outputs=favicon
response_type=markdown
to the request parameters, the ZenRows API will return the content in a Markdown format, making it easier to read and work with, especially if you are more comfortable with Markdown than HTML.
It can be beneficial if you prefer working with Markdown for its simplicity and readability.
response_type=markdown
to the request:
plaintext
feature is an output option that returns the scraped content as plain text instead of HTML or Markdown.
This feature can be helpful when you want a clean, unformatted version of the content without any HTML tags or Markdown formatting. It simplifies the content extraction process and makes processing or analyzing the text easier.
response_type=plaintext
to the request:
plaintext_response
feature, ZenRows Universal Scraper API will convert this HTML content into plain text like this:
js_render=true
parameter alongside with the response_type
with the value pdf
in your request. This instructs the API to generate a PDF file from the scraped content.
.pdf
you can save it using the following example in Python:
screenshot=true
to the request. By default, the image will be in PNG format.
screenshot_fullpage=true
takes a full-page screenshot.screenshot_selector=<CSS Selector>
takes a screenshot of the element given in the CSS Selector.screenshot_selector
and screenshot_fullpage
are mutually exclusive. Additionally, JavaScript rendering (js_render=true
) is required.
These screenshot features can be combined with other options like wait
, wait_for
, or js_instructions
to ensure that the page or elements are fully loaded before capturing the image. When using json_response
, the result will include a JSON object with the screenshot data encoded in base64, allowing for easy integration into your workflows.
screenshot_format
: Choose between png
and jpeg
formats, with PNG being the default. PNG is great for high-quality images and transparency, while JPEG offers efficient compression.screenshot_quality
: Applicable when using JPEG, this parameter allows you to set the quality from 1
to 100
. Useful for balancing image clarity and file size, especially in scenarios where storage or bandwidth is limited.Direct File Response
Triggered Downloads Using JS Instructions
413 Content Too Large
error.
Can I use multiple response_type formats together?
response_type
formats like Markdown, Plain Text, and PDF cannot be used together.Why am I getting the original content instead of the specified response_type (Markdown, Plain Text, or PDF)?
response_type
(Markdown, Plain Text, or PDF), we need to be able to parse the response as HTML. If we can’t parse the response as HTML, we’ll return the original response.When can this happen? When the response type is not text/html
or when the response is not rendered.How do I control the image quality and format for screenshots?
screenshot_format
. For JPEGs, you can control the quality using screenshot_quality
, with a value between 1
and 100
, to balance image clarity and file size.How can I ensure that dynamic pages are fully loaded before scraping?
js_render=true
) and pair it with parameters like wait
or wait_for
. This ensures that ZenRows waits until the necessary elements are present on the page before scraping.