This guide will teach you how to use advanced CSS selector techniques to extract specific data from challenging websites. Whether dealing with dynamic content or intricate page structures, these strategies will help you scrape data with precision.

Beyond Basic Selectors

While simple selectors like .class and #id work well for simple tasks, complex websites often require more sophisticated approaches. Advanced CSS selectors allow you to:

  • Target elements with specific attributes or patterns.
  • Combine multiple conditions for greater accuracy.
  • Extract data based on element relationships.

Selector Types and Examples

Websites often don’t provide convenient classes but use data attributes or dynamic IDs:

Selector TypeExampleDescription
Attribute Contains[attr*="value"]Selects elements with an attribute containing “value”
Attribute Starts With[attr^="value"]Selects elements with an attribute starting with “value”
Attribute Ends With[attr$="value"]Selects elements with an attribute ending with “value”
Not Selector:not(selector)Excludes elements that match the selector
Nth-child:nth-child(n)Selects the nth child of its parent
Nth-of-type:nth-of-type(n)Selects the nth sibling of its type

Attribute-Based Selection

Many websites use dynamic IDs or data attributes instead of simple classes. Here’s how you can target these elements:

# Import the necessary libraries
import json
import requests

url = 'https://www.scrapingcourse.com/ecommerce/'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
    # Remove comments below for valid JSON string
    'css_extractor': """{
        # Selects product names within <a> tags that have a data-product_id attribute
        "products_with_data_id": "a[data-product_id] .product-name",

        # Selects the product name where SKU equals 'MS01'
        "products_with_sku_ms01": "a[data-product_sku='MS01'] .product-name",

        # Extracts image URLs that end with .jpg
        "image_jpg": "img[src$='.jpg'] @src",

        # Extracts the link for 'Add to Cart' buttons
        "add_to_cart_buttons": "a.button.add_to_cart_button @href",

        # Selects product names in list items belonging to the 'shorts' category
        "shorts_only": "li[class*='product_cat-shorts'] .product-name"
    }"""
}
You can find more details about the CSS extractor in the CSS Extractor documentation.

Combinatorial Selectors

Combine multiple conditions to pinpoint specific elements:

// Complex selector combinations
const response = await fetch('https://api.zenrows.com/v1/', {
  method: 'GET',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    apikey: 'YOUR_ZENROWS_API_KEY',
    url: 'https://example.com/search',
    css_extractor: {
      // Items that are on sale AND in stock
      available_sale_items: '.product[data-sale="true"]:not([data-inventory="0"])',
      // Every third featured product
      featured_third: '.featured-products > div:nth-child(3n)',
      // First paragraph in each section except the intro
      first_paragraphs: 'section:not(#intro) p:first-of-type'
    }
  })
});

Selecting by Relationships

Use sibling and parent-child relationships to locate elements:

SelectorSyntaxDescription
AdjacentA + Bh2 + pSelect p immediately after an h2
General SiblingA ~ B.a ~ .bSelect all .b siblings after .a
Direct ChildA > Bul > liSelect li that is a direct child of ul
// Relationship-based selectors
const response = await fetch('https://api.zenrows.com/v1/', {
  method: 'GET',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    apikey: 'YOUR_ZENROWS_API_KEY',
    url: 'https://example.com/blog',
    css_extractor: {
      // Extract paragraph immediately following each heading
      heading_descriptions: 'h2 + p',
      // Get all list items after the "featured" item
      items_after_featured: '.featured ~ li',
      // Extract direct text labels within form fields
      form_labels: '.form-field > label'
    }
  })
});

Dynamic Content Selection

When dealing with dynamic or JavaScript-rendered content, enable js_render and use flexible selectors:

// Adapting to dynamic content
const response = await fetch('https://api.zenrows.com/v1/', {
  method: 'GET',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    apikey: 'YOUR_ZENROWS_API_KEY',
    url: 'https://example.com/dynamic-page',
    js_render: true,
    css_extractor: {
      // Multiple potential selectors for the same data type
      prices: [
        '.price-new',
        '[data-testid="price"]',
        '.product-price .amount',
        'span.current-price'
      ].join(', '),
      // Get all elements with specific text pattern
      availability_elements: [
        '[data-availability]',
        ':not(script):not(style):contains("In Stock")',
        ':not(script):not(style):contains("Available")'
      ].join(', ') // Use multiple selectors with .join(', ') to fall back if some fail.
    }
  })
});

Debugging Selectors

When selectors don’t work as expected:

  1. Inspect the Full HTML: Use ZenRows with js_render: true to see what the DOM actually contains

  2. Start Broad, Then Narrow Down:

    // Start general and get more specific
    css_extractor: {
      all_elements: '*', // Get everything to inspect structure
      possible_containers: 'div, section, article', // All potential containers
      with_attributes: '[data-*]' // Elements with data attributes
    }
    
  3. Use Text-Based Debugging: Find elements by their text content:

    js_instructions: [
      'const findElementsByText = (text) => {',
      '  const walker = document.createTreeWalker(',
      '    document.body,',
      '    NodeFilter.SHOW_TEXT,',
      '    { acceptNode: node => node.textContent.includes(text) ? NodeFilter.FILTER_ACCEPT : NodeFilter.FILTER_REJECT }',
      '  );',
      '  const results = [];',
      '  let node;',
      '  while (node = walker.nextNode()) {',
      '    results.push({',
      '      text: node.textContent.trim(),',
      '      path: getNodePath(node.parentElement)',
      '    });',
      '  }',
      '  ',
      '  function getNodePath(el) {',
      '    const path = [];',
      '    while (el && el.nodeType === Node.ELEMENT_NODE) {',
      '      let selector = el.nodeName.toLowerCase();',
      '      if (el.id) selector += `#${el.id}`;',
      '      else if (el.className) selector += `.${Array.from(el.classList).join(".")}`;',
      '      path.unshift(selector);',
      '      el = el.parentElement;',
      '    }',
      '    return path.join(" > ");',
      '  }',
      '  ',
      '  // Create a visible element with results',
      '  const debugDiv = document.createElement("div");',
      '  debugDiv.id = "zenrows-debug-results";',
      '  debugDiv.setAttribute("data-results", JSON.stringify(results));',
      '  debugDiv.style.display = "none";',
      '  document.body.appendChild(debugDiv);',
      '};',
      'findElementsByText("Price");' // Replace with text you're looking for
    ]
    

Selector Performance Tips

Optimize your selectors for both accuracy and performance:

  1. Avoid Universal Selectors: * is slow; use more specific selectors. Use class (.class) and ID (#id) selectors over attribute selectors for speed.
  2. Minimize Selector Depth: .product-grid .product .title is faster than body div.container div.products div.product-grid div.product div.title
  3. Prefer ID and Class Selectors: #product-123 is faster than [data-product-id="123"]
  4. Avoid Parent Selectors When Possible: Child (>) and adjacent (+) selectors are faster than descendant selectors (space)

CSS Selector Cheat Sheet

SelectorPurposeExample
elementSelect by tagdiv, span, h1
.classSelect by class.product, .price
#idSelect by ID#main, #product-123
[attr]Has attribute[data-id]
[attr="val"]Exact attribute[type="submit"]
[attr*="val"]Contains value[href*="product"]
[attr^="val"]Starts with value[class^="product-"]
[attr$="val"]Ends with value[src$=".jpg"]
:nth-child(n)By positionli:nth-child(2)
:first-childFirst childli:first-child
:last-childLast childli:last-child
:not(selector)Negation.item:not(.featured)
A > BDirect child.product > .title
A + BAdjacent siblingh2 + p
A ~ BGeneral sibling.featured ~ .product
A, BMultiple selectors.price, .discount
A BDescendant.product .price

Use these advanced CSS selector techniques to create precise data extraction patterns for even the most complex websites.