pip install
.
requests
, an HTTP library for Python. It exposes a get
method that will call a URL and return its HTML. For the time being, we won’t be utilizing any parameters; this is simply a demo to see how it works.
Careful! This script will execute without any proxy so that the server will see your actual IP. You don’t need to run this snippet.
get
’s target will be the API base and then two params: apikey
for authentication and url
. URLs must be encoded; however, requests
will handle that when using params
.
With this simple update, we will manage most scraping problems, such as proxy rotation, setting correct headers, avoiding CAPTCHAs and anti-bot solutions, and many more. But there are a few issues that we will address now. Keep on reading.
extract_content
that returns URL, title, and h1 content. There is where you can put your custom extracting logic.
multiprocessing
package implements a ThreadPool
that will queue and execute all our requests. And it will do so by handling the parallelism for us and the maximum number of requests going on simultaneously, but never over the limit (10 in the example). Once all the requests finish, it will group all the results in a single variable, and we will print them. In a real case, for example, store them in a database.
Note that this is not a queue; we can add no new URLs once the process initiates. If that is your use case, check out our guide on how to Scrape and Crawl from a Seed URL.
Retry
from urllib3 and HTTPAdapter
from requests.
The basic idea is as follows:
Retry
and then mount the HTTPAdapter
for a requests session. Unlike the previous ones, we won’t be calling requests.get
directly but requests_session.get
. Once created the session, it will use the same adapter for all subsequent calls.
For more information, visit the article on Retry Failed Requests.