This guide will teach you how to use advanced CSS selector techniques to extract specific data from challenging websites. Whether dealing with dynamic content or intricate page structures, these strategies will help you scrape data with precision.
Beyond Basic Selectors
While simple selectors like .class
and #id
work well for simple tasks, complex websites often require more sophisticated approaches. Advanced CSS selectors allow you to:
- Target elements with specific attributes or patterns.
- Combine multiple conditions for greater accuracy.
- Extract data based on element relationships.
Selector Types and Examples
Websites often don’t provide convenient classes but use data attributes or dynamic IDs:
Selector Type | Example | Description |
---|
Attribute Contains | [attr*="value"] | Selects elements with an attribute containing “value” |
Attribute Starts With | [attr^="value"] | Selects elements with an attribute starting with “value” |
Attribute Ends With | [attr$="value"] | Selects elements with an attribute ending with “value” |
Not Selector | :not(selector) | Excludes elements that match the selector |
Nth-child | :nth-child(n) | Selects the nth child of its parent |
Nth-of-type | :nth-of-type(n) | Selects the nth sibling of its type |
Attribute-Based Selection
Many websites use dynamic IDs or data attributes instead of simple classes. Here’s how you can target these elements:
# Import the necessary libraries
import json
import requests
url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
# Remove comments below for valid JSON string
'css_extractor': """{
# Selects product names within <a> tags that have a data-product_id attribute
"products_with_data_id": "a[data-product_id] .product-name",
# Selects the product name where SKU equals 'MS01'
"products_with_sku_ms01": "a[data-product_sku='MS01'] .product-name",
# Extracts image URLs that end with .jpg
"image_jpg": "img[src$='.jpg'] @src",
# Extracts the link for 'Add to Cart' buttons
"add_to_cart_buttons": "a.button.add_to_cart_button @href",
# Selects product names in list items belonging to the 'shorts' category
"shorts_only": "li[class*='product_cat-shorts'] .product-name"
}"""
}
Combinatorial Selectors
Combine multiple conditions to pinpoint specific elements:
// Complex selector combinations
const response = await fetch('https://api.zenrows.com/v1/', {
method: 'GET',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
apikey: 'YOUR_ZENROWS_API_KEY',
url: 'https://example.com/search',
css_extractor: {
// Items that are on sale AND in stock
available_sale_items: '.product[data-sale="true"]:not([data-inventory="0"])',
// Every third featured product
featured_third: '.featured-products > div:nth-child(3n)',
// First paragraph in each section except the intro
first_paragraphs: 'section:not(#intro) p:first-of-type'
}
})
});
Selecting by Relationships
Use sibling and parent-child relationships to locate elements:
Selector | Syntax | Description |
---|
Adjacent | A + B → h2 + p | Select p immediately after an h2 |
General Sibling | A ~ B → .a ~ .b | Select all .b siblings after .a |
Direct Child | A > B → ul > li | Select li that is a direct child of ul |
// Relationship-based selectors
const response = await fetch('https://api.zenrows.com/v1/', {
method: 'GET',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
apikey: 'YOUR_ZENROWS_API_KEY',
url: 'https://example.com/blog',
css_extractor: {
// Extract paragraph immediately following each heading
heading_descriptions: 'h2 + p',
// Get all list items after the "featured" item
items_after_featured: '.featured ~ li',
// Extract direct text labels within form fields
form_labels: '.form-field > label'
}
})
});
Dynamic Content Selection
When dealing with dynamic or JavaScript-rendered content, enable js_render
and use flexible selectors:
// Adapting to dynamic content
const response = await fetch('https://api.zenrows.com/v1/', {
method: 'GET',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
apikey: 'YOUR_ZENROWS_API_KEY',
url: 'https://example.com/dynamic-page',
js_render: true,
css_extractor: {
// Multiple potential selectors for the same data type
prices: [
'.price-new',
'[data-testid="price"]',
'.product-price .amount',
'span.current-price'
].join(', '),
// Get all elements with specific text pattern
availability_elements: [
'[data-availability]',
':not(script):not(style):contains("In Stock")',
':not(script):not(style):contains("Available")'
].join(', ') // Use multiple selectors with .join(', ') to fall back if some fail.
}
})
});
Debugging Selectors
When selectors don’t work as expected:
-
Inspect the Full HTML: Use ZenRows with js_render: true
to see what the DOM actually contains
-
Start Broad, Then Narrow Down:
// Start general and get more specific
css_extractor: {
all_elements: '*', // Get everything to inspect structure
possible_containers: 'div, section, article', // All potential containers
with_attributes: '[data-*]' // Elements with data attributes
}
-
Use Text-Based Debugging: Find elements by their text content:
js_instructions: [
'const findElementsByText = (text) => {',
' const walker = document.createTreeWalker(',
' document.body,',
' NodeFilter.SHOW_TEXT,',
' { acceptNode: node => node.textContent.includes(text) ? NodeFilter.FILTER_ACCEPT : NodeFilter.FILTER_REJECT }',
' );',
' const results = [];',
' let node;',
' while (node = walker.nextNode()) {',
' results.push({',
' text: node.textContent.trim(),',
' path: getNodePath(node.parentElement)',
' });',
' }',
' ',
' function getNodePath(el) {',
' const path = [];',
' while (el && el.nodeType === Node.ELEMENT_NODE) {',
' let selector = el.nodeName.toLowerCase();',
' if (el.id) selector += `#${el.id}`;',
' else if (el.className) selector += `.${Array.from(el.classList).join(".")}`;',
' path.unshift(selector);',
' el = el.parentElement;',
' }',
' return path.join(" > ");',
' }',
' ',
' // Create a visible element with results',
' const debugDiv = document.createElement("div");',
' debugDiv.id = "zenrows-debug-results";',
' debugDiv.setAttribute("data-results", JSON.stringify(results));',
' debugDiv.style.display = "none";',
' document.body.appendChild(debugDiv);',
'};',
'findElementsByText("Price");' // Replace with text you're looking for
]
Optimize your selectors for both accuracy and performance:
- Avoid Universal Selectors:
*
is slow; use more specific selectors. Use class (.class
) and ID (#id
) selectors over attribute selectors for speed.
- Minimize Selector Depth:
.product-grid .product .title
is faster than body div.container div.products div.product-grid div.product div.title
- Prefer ID and Class Selectors:
#product-123
is faster than [data-product-id="123"]
- Avoid Parent Selectors When Possible: Child (
>
) and adjacent (+
) selectors are faster than descendant selectors (space)
CSS Selector Cheat Sheet
Selector | Purpose | Example |
---|
element | Select by tag | div , span , h1 |
.class | Select by class | .product , .price |
#id | Select by ID | #main , #product-123 |
[attr] | Has attribute | [data-id] |
[attr="val"] | Exact attribute | [type="submit"] |
[attr*="val"] | Contains value | [href*="product"] |
[attr^="val"] | Starts with value | [class^="product-"] |
[attr$="val"] | Ends with value | [src$=".jpg"] |
:nth-child(n) | By position | li:nth-child(2) |
:first-child | First child | li:first-child |
:last-child | Last child | li:last-child |
:not(selector) | Negation | .item:not(.featured) |
A > B | Direct child | .product > .title |
A + B | Adjacent sibling | h2 + p |
A ~ B | General sibling | .featured ~ .product |
A, B | Multiple selectors | .price, .discount |
A B | Descendant | .product .price |
Use these advanced CSS selector techniques to create precise data extraction patterns for even the most complex websites.