Scrape and Crawl from a Seed URL

To perform web scraping at scale, it’s essential to extract data and continually discover new URLs. This process ensures you cover all relevant website pages, especially when dealing with dynamically generated content or paginated sections.

We’ll start with a single URL (the “seed URL”) and extract internal links. These internal links are URLs that keep the scraping process within the same website, avoiding external links that could lead to different sites. This method is commonly used to gather all website pages, such as scraping all product pages from an e-commerce site or all articles from a blog.

Prerequisites

Before starting, ensure you have Python 3 installed. Some systems come with Python pre-installed. Once you have Python set up, install the necessary libraries by running the following command:

pip install requests beautifulsoup4

The requests library will handle HTTP requests, while BeautifulSoup will help parse the HTML content and extract links.

Extracting Links from the Seed URL

We’ll use BeautifulSoup to extract the links from the page’s HTML content. Although BeautifulSoup isn’t required for the ZenRows® API to function, it simplifies parsing and filtering the extracted data. We’ll also define separate functions for better code organization and maintainability.

scraper.py
import requests
from bs4 import BeautifulSoup

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
seed_url = "" # ... your seed URL here

def extract_links(soup):
	return [a.get("href")
			for a in soup.find_all("a")
			if a.get("href") and a.get("href").startswith("/")]

def call_url(url):
	response = requests.get(zenrows_api_base, params={"url": url})
	soup = BeautifulSoup(response.text, "html.parser")
	links = extract_links(url, soup)
	print(links)

call_url(seed_url)

Managing Crawl Limits and Visited URLs

Setting a maximum number of requests and keeping track of visited URLs is crucial to preventing infinite loops and excessive requests. This prevents the script from endlessly looping over the same pages or making thousands of unnecessary calls.

max_visits = 10
visited = set()

Setting Up a Queue and Worker Threads

The next step involves setting up a queue and worker threads to manage the URLs that must be crawled. This allows for concurrent processing of multiple URLs, improving the efficiency of the scraping process.

scraper.py
import queue
from threading import Thread

num_workers = 5

def queue_worker(i, q):
	while True:
		url = q.get() # Get an item from the queue, blocks until one is available
		# Some processing, to be defined
		q.task_done() # Notifies the queue that the item has been processed

q = queue.Queue()
for i in range(num_workers):
	Thread(target=queue_worker, args=(i, q), daemon=True).start()

q.put(seed_url)
q.join() # Blocks until all items in the queue are processed and marked as done

Full Implementation: Crawling and Data Extraction

Finally, we combine these elements to create a fully functional crawler. The script manages the queue, processes URLs, and extracts the desired content.

For simplicity, error handling and data storage are not included but can be added as needed.

scraper.py
import requests
from bs4 import BeautifulSoup
import queue
from threading import Thread

zenrows_api_base = "https://api.zenrows.com/v1/?apikey=YOUR_ZENROWS_API_KEY"
seed_url = ""  # ... your seed URL here
max_visits = 10
num_workers = 5

visited = set()
data = []

def extract_links(soup):
	# get the links you want to follow here
	return [a.get("href")
			for a in soup.find_all("a")
			if a.get("href") and a.get("href").startswith("/")]

def extract_content(url, soup):
	# extract the content you want here
	data.append({
		"url": url,
		"title": soup.title.string,
		"h1": soup.find("h1").text,
	})

def crawl(url):
	visited.add(url)
	print("Crawl: ", url)
	response = requests.get(zenrows_api_base, params={"url": url})
	soup = BeautifulSoup(response.text, "html.parser")
	extract_content(url, soup)
	links = extract_links(soup)
	for link in links:
		if link not in visited:
			q.put(link)

def queue_worker(i, q):
	while True:
		url = q.get()
		if (len(visited) < max_visits and url not in visited):
			crawl(url)
		q.task_done()

q = queue.Queue()
for i in range(num_workers):
	Thread(target=queue_worker, args=(i, q), daemon=True).start()

q.put(seed_url)
q.join()

print("Visited:", visited)
print("Data:", data)

This code sets up a basic web crawler using ZenRows®, requests, and BeautifulSoup. It starts from a seed URL, extracts links, and follows them up to a defined limit. Be cautious when using this method on large websites, as it can quickly generate a massive number of pages to crawl. Proper error handling, rate limiting, and data storage should be added for production use.

For a more detailed guide and additional techniques, check out our scraping with python series. If you have any questions, feel free to contact us.

Quickstart

Get Started

Features

Help

Scrape and Crawl from a Seed URL

Prerequisites

Extracting Links from the Seed URL

Managing Crawl Limits and Visited URLs

Setting Up a Queue and Worker Threads

Full Implementation: Crawling and Data Extraction

Quickstart

Get Started

Features

Help

​Prerequisites

​Extracting Links from the Seed URL

​Managing Crawl Limits and Visited URLs

​Setting Up a Queue and Worker Threads

​Full Implementation: Crawling and Data Extraction

Prerequisites

Extracting Links from the Seed URL

Managing Crawl Limits and Visited URLs

Setting Up a Queue and Worker Threads

Full Implementation: Crawling and Data Extraction