Do you want to learn how to build a web crawler from scratch? Join me as I show you how to build a web crawler using Python as the language of choice for the tutorial.
Have you ever wondered how the Internet would be without search engines? Well, what if I tell you that web crawlers are some of the secrets that makes search engine what they have become today.
They have proven to be incredibly important, not only in the area of general web search but other aspects in academic research, lead generation, and even Search Engine Optimizations (SEO).
Any project that intends to extract data from many pages on a website or the full Internet without a prior list of links to extract the data from will most likely make use of web crawlers to achieve that.
If you are interested in developing a web crawler for a project, then you need to know that the basics of a web crawler are easy, and everyone can design and develop one. However, depending on the complexity and size of your project, the befitting crawler could be difficult to build and maintain. In this article, you will learn how to build web crawlers yourself. Before going into the tutorial proper, let take a look at what web crawler actually means.
What are Web Crawlers?
The terms web crawlers and web scrapers are used interchangeably, and many think they mean the same thing. While they loosely mean the same thing, if you go deep, you will discover web scraping and web crawling are not the same things – and you can even see that from the way web crawlers and web scrapers are designed.
Web crawlers, also known as web spiders, spiderbots, or simply crawlers, are web bots that have been developed to systematically visit webpages on the World Wide Web for the purpose of web indexing and collecting other data from the pages they visit.
How Do They Differ from Web Scrapers?
From the above, you can tell that they are different from web scrapers. They are both bots for web data extraction. However, you can see web scrapers as more streamlined and specialized workers designed for extracting specific data from a specific and defined list of web pages, such as Yelp reviews, Instagram posts, Amazon price data, Shopify product data, and so on…Web scrapers are fed a list of URLs, and it visits those URLs and scrapes required data.
This is not the case for web crawlers as they are fed a list of URLs, and from this list, the web crawler is meant to find other URLs to be crawled by themselves, following some set of rules. The reason why marketers use the terms interchangeably is that in the process of web crawling, web scraping is involved – and some web scrapers incorporate aspects of web crawling.
How Does Web Crawlers Work?
Depending on the complexity and use cases of a web crawler, it could work in the basic way web crawlers work or have some modifications in working mechanism. At its most basic level, you can see web crawlers as web browsers that browse pages on the Internet collecting information.
The working mechanism for web crawlers is simple. For a web crawler to work, you will have to provide it a list of URLs – these URLs are known as seed URLs. These seed URLs are added to a list of URLs to be visited. The crawler then goes through the list of URLs to be visited and visit them one after the other.
For each URL the crawler visits, it extracts all the hyperlinks on the page and adds them to the list of URLs to be visited. Aside from collecting hyperlinks in other to cover the width and breadth of the site or web, as in the case of web crawlers not specifically designed for a specific website, web crawlers also collect other information.
Take, for instance, Google bots, the most popular web crawler on the Internet aside from link data, also index the content of a page to make it easier to search. On the other hand, a web archive takes a snapshot of the pages it visits – other crawlers extract data they are interested in. Aside from a list of URLs to be visited, the crawler also keeps a list of URLs that have already been visited to avoid adding a crawled URLs into the list of sites to be crawled.
There are a good number of consideration you will have to look into, including a crawling policy that set the rule for URLs to be visited, a re-visit policy that dictates when to look out for a change on a web page, a politeness policy that determines whether you should respect the robots.txt rules or not, and lastly, a parallelization policy for coordinating distributed web crawling exercise, among others.
Developing a Web Crawler with Python
From the above, we expect you to have an idea of what web crawlers are. It is now time to move into learning how to develop one yourself. Web crawlers are computer programs written using any of the general-purpose programming languages out there.
In this article, we are going to be making use of Python because of its simplicity, ease of use, beginner-friendliness, and extensive library support. Even if you are not a Python programmer, you can take a crash course fin python programming in other to understand what will be discussed as all of the codes will be written in Python.
Project Idea: Page Title Extractor
The project we will be building will be a very easy one and what can be called a proof of concept. The crawler we will be developing will accept a seed URL and visit all pages on the website, outing the links and title to the screen.
We won’t be respecting robots.txt files, no proxy usage, no multithreading, and any other complexities – we are making it easy for you to follow and understand.
Requirements for the Project
Earlier, I stated that Python has an extensive library of tools for web crawling. The most important of them all for web crawling is Scrapy, a web crawling framework that makes it easy for the development of web crawlers in fewer lines of code. However, we won’t be using Scrapy as it hides some details; let make use of the Requests and BeautifulSoup combination for the development.
- Python: While many Operating Systems come with Python preinstalled, the version installed is usually old, and as such, you will need to install a recent version of Python. You can visit the official download page to download an updated version of the Python programming language.
- Requests: Dubbed HTTP for Humans, Requests is the best third-party library for sending HTTP requests to web servers. It is very simple and easy to use. under the hood, this library makes use of the urllib package but abstract it and provide you better APIs for handling HTTP requests and responses. This is a third-party library, and as such, you will need to download it. You can use the
pip install Requests
to download it.
- BeautifulSoup: while the Requests library is for sending HTTP requests, BeautifulSoup is for parsing HTML and XML documents. With BeautifulSoup, you do not have to deal with using regular expressions and the standard HTM parser that are not easy to use and prone to errors if you are not skilled in their usage. BeautifulSoup makes it easy for you to transverse HTML documents and parse out the required data. This tool is also a third-party library and not included in the standard python distribution. You can use download it using the pip command:
pip install beautifulsoup4
Steps for Coding the Page Title Extractor Project
As stated earlier, the process of developing a web crawler can be complex, but the crawler we are developing in this tutorial is very easy. In fact, if you already know how to scrape data from web pages, there is a high chance that you already know how to develop a simple web crawler. The Page Title Extractor project will be contained in only one module. You can create a new Python file and name it
. The module will have a class named TitleExtractor with 2 methods. The two classes are
for defining main crawling logic and
for giving the crawl method directive on the URL to crawl.
Import the Necessary Libraries
Let start by importing the required libraries for the project. We require requests, beautifulsoup, and urlparse. Requests is for sending web requests, beautifulsoup for parsing title, and URLs from web pages downloaded by requests. The urlparse library is bundled inside the standard Python library and use for parsing URLs.
from urllib.parseimport urlparse import requests from bs4 import BeautifulSoup
Web Crawler Class Definition
After importing the required library, let create a new class name TitleExtractor. This will be the crawler class.
class TitleCrawler: """ Crawler class accepts a URL as argument. This seed url will be the url from which other urls will be discovered """ def __init__(self, start_url): self.urls_to_be_visited =  self.urls_to_be_visited.append(start_url) self.visited =  self.domain = "https://" + urlparse(start_url).netloc
From the above, you can see the initialization function – it accepts a URL as argument. There are 3 variables – the
is for keeping a list of URLs to visit, the
variable is for keeping a list of visited URLs to avoid crawling a URL more than once, and the domain variable is for the
of the site you are scraping from. You will need it so that only links from the domain are crawled.
Start Method Coding
def start(self): for urlin self.urls_to_be_visited: self.crawl(url) x = TitleCrawler("https://cop.guru/") x.start()
The start method above belongs to the TitleExtractor class. You can see a for loop that loops through the urls_to_be_visitedand pass URLs to the crawl method. The crawl method is also a method of the TitleExtractor class. The x variable is for creating an instance of the TitleExtractor class and then calling the start method to get the crawler to start crawling. From the above code snippets, nothing has actually been done. The main work is done in the crawl method. Below is the code for the crawl method.
Crawl Method Coding
def crawl(self, link): page_content = requests.get(link).text soup = BeautifulSoup(page_content, "html.parser") title = soup.find("title") print("PAGE BEING CRAWLED: " + title.text + "|" + link ) self.visited.append(link) urls = soup.find_all("a") for urlin urls: url = url.get("href") if urlis not None: if url.startswith(self.domain): if urlnot in self.visited: self.urls_to_be_visited.append(url) print("Number of Crawled pages:" + str(len(self.visited))) print("Number of Links to be crawled:" + str(len(self.urls_to_be_visited))) print("::::::::::::::::::::::::::::::::::::::")
The URL to be crawled is passed into the crawl method by the start function, and it does that by iterating through the urls_to_be_visited list variable. The first line in the code above sends a request to the URL and returns the content of the page.
Using Beautifulsoup, the title of the page and URLs present on the page are scraped. The web crawler is meant for crawling only URLs for a target website, and such, URLs for external sources are not considered – you can see that from the second if statement. For a URL to be added to the list of URLs to be visited, it must be a valid URL and has not been visited before.
from urllib.parseimport urlparse import requests from bs4 import BeautifulSoup class TitleCrawler: """ Crawler class accepts a URL as argument. This seed url will be the url from which other urls will be discovered """ def __init__(self, start_url): self.urls_to_be_visited =  self.urls_to_be_visited.append(start_url) self.visited =  self.domain = "https://" + urlparse(start_url).netloc def crawl(self, link): page_content = requests.get(link).text soup = BeautifulSoup(page_content, "html.parser") title = soup.find("title") print("PAGE BEING CRAWLED: " + title.text + "|" + link ) self.visited.append(link) urls = soup.find_all("a") for urlin urls: url = url.get("href") if urlis not None: if url.startswith(self.domain): if urlnot in self.visited: self.urls_to_be_visited.append(url) print("Number of Crawled pages:" + str(len(self.visited))) print("Number of Links to be crawled:" + str(len(self.urls_to_be_visited))) print("::::::::::::::::::::::::::::::::::::::") def start(self): for urlin self.urls_to_be_visited: self.crawl(url) x = TitleCrawler("https://cop.guru/") x.start()
You can change the seed URL to any other URL. In the code above, we use https://cop.guru/. If you run the code above, you will get something like the result below.
PAGE BEING CRAWLED: Sneaker Bots • Cop Guru|https://cop.guru/sneaker-bots/ Number of Crawled pages:4 Number of Links to be crawled:1535 :::::::::::::::::::::::::::::::::::::: PAGE BEING CRAWLED: All in One Bots • Cop Guru|https://cop.guru/aio-bots/ Number of Crawled pages:5 Number of Links to be crawled:1666 :::::::::::::::::::::::::::::::::::::: PAGE BEING CRAWLED: Adidas Bots • Cop Guru|https://cop.guru/adidas-bots/ Number of Crawled pages:6 Number of Links to be crawled:1763
A Catch: There is a lot of improvement in the Project
Looking at the above code, you will most likely run it without any problem, but when an exception is experienced, that’s when the code will stop running. No exception handling was considered in the code for simplicity.
Aside from exception handling, you will discover that no anti-bot system evading technique was incorporated, while in reality, many popular websites have them in place to discourage bot access. There is also the issue of speed, which you can solve by making the bot multithreaded and making the code more efficient. Aside from these, there are other rooms for improvement.
Looking at the code of the web crawler we developed, you will agree with me that web crawlers are like web scrapers but have a wider scope. Another thing you need to know is that depending on the number of URLs discovered, the running time of the crawler can be long, but with multithreading, this can be shortened. Also, you need to also have it at the back of your mind that complex web crawlers for real projects will require a more planned approach.