Extract Data from Lists, Tables, and Grids
We’ll explore popular use cases for scraping, such as lists, tables, and product grids. Use these as inspiration and a guide for your scrapers.
Scraping from Lists
We will use the Wikipedia page on Web scraping for testing. A section at the bottom, “See also”, contains links in a list. We can get the content by using the CSS selector for the list items: {"items": ".div-col > ul li"}
.
That will get the text, but what of the links? To access attributes, we need a non-standard syntax for the selector: @href
. It won’t work with the previous selector since the last item is the li
element, which does not have an href
attribute. So we must change it for the link element: {"links": ".div-col > ul a @href"}
.
CSS selectors, in some languages, must be encoded to avoid problems with URLs.
Our Builder can help you write and test the selectors and output code in several languages.
Scraping from Tables
Assuming regular tables (no empty cells, rows with fewer items, and others), we can extract table data with CSS selectors. We’ll use a list of countries, the first table on the page, the one with the class wikitable
.
To extract the rank, which is the first column, we can use "table.wikitable tr > :first-child"
. It will return an array with 243 items, 2 header lines, and 241 ranks. For the country name, second column, something similar but adding an a
to avoid capturing the flags: "table.wikitable tr > :nth-child(2) a"
. In this case, the array will have one less item since the second heading has no link. That might be a problem if we want to match items by array index.
Outputs:
As stated above, this might prove difficult for non-regular tables. For those, we might prefer to get the Plain HTML and scrape the content with a tool or library so we can add conditionals and logic.
This example lists items by column, not row, which might prove helpful in various cases. However, there are no easy ways to extract structured data from tables using CSS Selectors and group it by row.
Scraping from Product Grids
As with the tables, non-regular grids might cause problems. We’ll scrape the price, product name, and link from an online store. By manually searching the page’s content, we arrive at cards with the class .product
. Those contain all the data we want.
It is essential to avoid duplicates, so we have to use some precise selectors. For example, ".product-item .product-link @href"
for the links. We added the .product-link
class because it is unique to the product cards. The same goes for name and price, which also have unique classes.
All in all, the final selector would be:
Several items are on the page at the time of this writing. And each array has the same number of elements, so everything looks fine. If we were to group them, we could zip the arrays.
For example, in python, taking advantage of the auto-encoding that requests.get
does to parameters. Remember to encode the URL and CSS extractor for different scenarios when that is not available.
Remember that this approach won’t work properly if, for example, some products have no price. Not all the arrays would have the same length, and the zipping would misassign data. Getting the Plain HTML and parsing the content with a library and custom logic is a better solution for those cases.
If you encounter any problems or cannot correctly set up your scraper, contact us, and we’ll help you.
Was this page helpful?