This guide walks you through advanced CSS selector strategies to help you extract structured data from a wide variety of web layouts using the ZenRows API.
Basic API Call Structure
All examples use the following basic API call pattern with the ZenRows API:
const axios = require('axios');
const response = await axios.get('https://api.zenrows.com/v1/', {
params: {
apikey: 'YOUR_ZENROWS_API_KEY',
url: 'https://example.com',
js_render: true, // Optional, needed for JavaScript-rendered content
css_extractor: JSON.stringify({
// Your selectors here
})
}
});
Essential CSS Selector Techniques
Basic Selectors
Target elements by their tag names, classes, or IDs:
css_extractor: JSON.stringify({
headings: "h1", // Select all <h1> elements
products: ".product", // Select elements with the 'product' class
mainContent: "#main-content", // Select the element with the 'main-content' ID
productTitles: ".product h2.title", // Select <h2> elements with the 'title' class inside 'product' class
topLevelNav: "nav > a" // Select direct child <a> elements of <nav>
})
Attribute Selectors
Extract content based on HTML attributes:
css_extractor: JSON.stringify({
imageUrls: "img @src", // Extract 'src' attribute from <img> tags
linkUrls: "a @href", // Extract 'href' attribute from <a> tags
premiumItems: "[data-premium='true']", // Select elements with a specific attribute value
externalLinks: "[href^='https://'] @href", // Extract 'href' starting with 'https://'
pdfDownloads: "[href$='.pdf'] @href" // Extract 'href' ending with '.pdf'
})
Positional Selectors
Target elements based on their position in the document:
css_extractor: JSON.stringify({
firstProduct: ".product:first-child", // Selects the first element with the class 'product'
lastProduct: ".product:last-child", // Selects the last element with the class 'product'
thirdProduct: ".product:nth-child(3)", // Selects the third element with the class 'product'
evenProducts: ".product:nth-child(even)", // Selects all even-numbered 'product' elements
oddProducts: ".product:nth-child(odd)", // Selects all odd-numbered 'product' elements
firstHeading: "h2:nth-of-type(1)", // Selects the first <h2> element of its type
tableHeaders: "table th", // Selects all <th> elements inside any <table>
secondColumnCells: "tr td:nth-child(2)" // Selects the second <td> element in each table row
})
Combining Multiple Selectors
Combine selectors for more specific targeting:
css_extractor: JSON.stringify({
headingsAndLinks: "h1, h2, a", // Multiple selectors with commas
cardTitles: ".card .title", // Descendant combinator (space)
directListItems: "ul > li", // Child combinator (>)
labelValues: "label + input @value", // Adjacent sibling combinator (+)
relatedItems: ".main-item ~ .related-item" // General sibling combinator (~)
})
Extract product details from a listing page:
css_extractor: JSON.stringify({
productNames: ".product-item .product-title", // Selects the product title within each product item
productPrices: ".product-item .price", // Selects the price element within each product item
productRatings: ".product-item .rating @data-score", // Extracts the 'data-score' attribute from the rating element
productImages: ".product-item img.product-image @src", // Extracts the 'src' attribute from the product image
productUrls: ".product-item a.product-link @href", // Extracts the 'href' attribute from the product link
productAvailability: ".product-item .availability-badge", // Selects the availability badge within each product item
productDiscounts: ".product-item .discount-tag" // Selects the discount tag element within each product item
})
Product Specification Tables
Extracting structured data from specification tables:
css_extractor: JSON.stringify({
specLabels: ".specs-table tr td:first-child", // Table headers
specValues: ".specs-table tr td:last-child", // Table values
processor: ".tech-specs .processor", // Processor details
memory: ".tech-specs .memory", // Memory details
storage: ".tech-specs .storage", // Storage details
graphics: ".tech-specs .graphics" // Graphics details
})
Real Estate Listings
Extracting property information:
css_extractor: JSON.stringify({
propertyAddresses: ".property-listing .address", // Property addresses
propertyPrices: ".property-listing .price", // Prices
propertyBedrooms: ".property-listing .bedrooms", // Number of bedrooms
propertyBathrooms: ".property-listing .bathrooms", // Number of bathrooms
propertyArea: ".property-listing .square-footage", // Square footage
propertyTypes: ".property-listing .property-type", // Property type
propertyAgents: ".property-listing .agent-name", // Agent names
propertyImages: ".property-listing .property-image @src" // Image URLs
})
News Articles and Blog Posts
Extracting content from articles:
css_extractor: JSON.stringify({
articleTitle: "article h1", // Article title
articleSubtitle: "article h2", // Article subtitle
articleDate: "article .publication-date", // Publication date
articleAuthor: "article .author-name", // Author name
articleContent: "article .content p", // Article content
articleCategories: "article .category-tag", // Categories
articleImages: "article .article-image @src", // Image URLs
relatedArticles: ".related-articles .article-link @href" // Related article links
})
Advanced Selection Techniques
Identify and extract pagination information:
css_extractor: JSON.stringify({
currentPage: ".pagination .current @data-page", // Current page number
totalPages: ".pagination @data-total-pages", // Total number of pages
nextPageUrl: ".pagination .next @href", // Next page URL
prevPageUrl: ".pagination .prev @href", // Previous page URL
pageNumbers: ".pagination .page-number", // All page numbers
isLastPage: ".pagination .next @disabled" // Check if it's the last page
})
Extract multi-level navigation structures:
css_extractor: JSON.stringify({
// Main navigation
mainNavLinks: ".main-nav > li > a @href",
mainNavText: ".main-nav > li > a",
// Second level categories
subNavLinks: ".main-nav > li > .dropdown > a @href",
subNavText: ".main-nav > li > .dropdown > a",
// Third level
deepNavLinks: ".main-nav > li > .dropdown > .sub-dropdown > a @href"
})
Social Media Content
Extract content from social media-style layouts:
css_extractor: JSON.stringify({
postAuthors: ".post .author-name",
postTimestamps: ".post .timestamp",
postContent: ".post .content-text",
postImages: ".post .post-image @src",
postLikes: ".post .like-count",
postComments: ".post .comment-count",
postShares: ".post .share-count",
// Comments
commentAuthors: ".comments .comment .author",
commentContent: ".comments .comment .text",
commentTimestamps: ".comments .comment .time"
})
Extract structured data from tables:
css_extractor: JSON.stringify({
// Table headers
tableHeaders: "table thead th",
// First column (often labels)
rowLabels: "table tbody tr td:first-child",
// Specific cells using nth-child
secondColValues: "table tbody tr td:nth-child(2)",
thirdColValues: "table tbody tr td:nth-child(3)",
// Cell with specific data attributes
highlightedCells: "table td[data-highlight='true']"
})
Extract form field values and attributes:
css_extractor: JSON.stringify({
formLabels: "form label", // Form labels
inputValues: "form input @value", // Input field values
inputPlaceholders: "form input @placeholder", // Input placeholders
selectedOptions: "form select option[selected]", // Selected options in dropdowns
checkboxStatus: "form input[type='checkbox'] @checked", // Checkbox status
radioStatus: "form input[type='radio'] @checked", // Radio button status
formActionUrl: "form @action", // Form action URL
formMethod: "form @method" // Form method (GET/POST)
})
Extract metadata from the HTML <head>
:
css_extractor: JSON.stringify({
pageTitle: "title", // Page title
metaDescription: "meta[name='description'] @content", // Meta description
canonicalUrl: "link[rel='canonical'] @href", // Canonical URL
ogTitle: "meta[property='og:title'] @content", // Open Graph title
ogImage: "meta[property='og:image'] @content", // Open Graph image
ogDescription: "meta[property='og:description'] @content", // Open Graph description
twitterCard: "meta[name='twitter:card'] @content" // Twitter card type
})
Troubleshooting Selectors
When your selectors aren’t working as expected, try these approaches:
-
Make Selectors More Specific
// Too general
{ title: ".title" }
// More specific
{ title: "article .main-content .title" }
-
Check for iframes
Content might be inside iframes that require additional handling.
-
Handle Special Characters
// For classes with special characters
{ price: ".price-\\$" }
-
Use Developer Tools to Verify
Always test your selectors using the browser developer tools first.
Testing Workflow
We recommend this workflow for developing and testing selectors:
-
Test the selector in the browser using DevTools
-
Extract Full HTML First
const htmlResponse = await axios.get('https://api.zenrows.com/v1/', {
params: {
apikey: 'YOUR_ZENROWS_API_KEY',
url: 'https://example.com',
js_render: true
}
});
-
Test Selectors Locally with Cheerio
const cheerio = require('cheerio');
const $ = cheerio.load(htmlResponse.data);
console.log($('h1.product-title').text()); // Test selector
-
Refine and Apply CSS Selectors with ZenRows
const extractedData = await axios.get('https://api.zenrows.com/v1/', {
params: {
apikey: 'YOUR_ZENROWS_API_KEY',
url: 'https://example.com',
js_render: true,
css_extractor: JSON.stringify({
// Your refined selectors
})
}
});
By following these techniques, you can effectively extract data from even the most complex web layouts using ZenRows and CSS selectors.