Web Scraping with Python [Library + Framework]

As a python developer, there are many web scraping tools available to you which you can use. Come in now to discover these tools and learn how to put them into use.

Scraping with Python

It is no news saying that Python is the most popular programming language for web scraping, and this is not unconnected to the fact that it is simple to learn and use, as well as have a good number of popular web scraping libraries and frameworks you can leverage.

As a Python developer, you need to know of these tools and learn how to use them for you to write better codes for your web scraping tasks.

In this article, you will be learning about the most popular Python libraries and frameworks used for building web scrapers. You will also be learning how to install each of them and code examples to give you a head start.

Python Web Scraping Library

Python web scraping libraries are modules and packages written to carry out specific tasks in the web scraping workflow, which can either be sending HTTP requests, handling headless browsers to render JavaScript and simulate human-web interaction and parsing out data from the downloaded page.

Urllib

The Urllib is a package in the Python’s standard libraries with modules for handling URLs and HTTP requests. It comes with 4 modules, which include

urllib.request

for handling HTTP requests,

urllib.error

which contains exceptions raised by

urllib.request

urllib.parse

for parsing URLs, and

urllib.robotparser

for parsing the content of robots.txt files. Urllib is not easy to use but can help you handle authentication, cookies, URL encoding, and proxies, among other things. You should only use this when you need advanced control over your requests.

How to Install Urllib

As stated earlier, the Urllib package is included in the standard python library, and as such, there is no need for you to install it again. Just import it in your code and use it.

Urllib Code Sample

The code below will send a GET request to Wikipedia’s Homepage and print out the response. The response will be the whole HTML for the page.

import urllib.request as rq

get_content = rq.urlopen("https://en.wikipedia.org/wiki/Main_Page")

print(get_content.read().decode("utf-8"))

Python Requests

Dubbed as Python HTTP for Humans, the Python Requests library is a third-party library that was introduced to ease the process of handling HTTP requests and URLs. It is built on top of the Urllib and provides an easy to use interface.

Aside from being easier to use than urllib, it also has better documentation. When it comes to popularity, it can be said that Requests is one of the most popular Python libraries as it is the most downloaded Python package. It supports internationalization, session cookies, and authentication, as well as connection pooling and timeouts, and multi-part file uploads.

How to Install

Python Requests is a third-party package, and as such, you need to install it before you can use it. the recommended way of installing it is using the pip command.

pip install requests

Python Requests Code Sample

The code below will download the same page downloaded using Urllib so you can compare even though the difference comes into play when you are using their advanced features.

>> import requests

>>get_content = requests.get("https://en.wikipedia.org/wiki/Main_Page")

>> print(get_content.text)

Selenium

The Selenium web driver is a browser automation tool – what you use it for is entirely up to you. It has become popular among web scrapers as it can be used for scraping data from JavaScript rich websites. Traditional tools such as the Python Requests library and Scrapy are not able to render JavaScript, and as such, you need Selenium for that.

Selenium can be used for automating a good number of browsers, including Chrome and Firefox. When running in a headless mode, you won’t actually see a browser open, but it will simulate actions in browser environments. With Selenium, you can simulate mouse and keyboard actions, visit sites, and scrape required contents.

How to Install Selenium

There are two requirements you need for you to be able to automate browsers using the Selenium web driver. These include the Selenium Python binding and a browser driver. In this article, we are going to be using Chrome, and as such, you need to download Chrome driver from here – make sure it is for the version of Chrome you are using. After installing, unzip it and place the chromedriver.exe file in the same directory with your python script. With this, you can then install the selenium python binding using the below pip command.

pip install requests

Selenium Code Sample

Below is a code that shows how to search Amazon using Selenium. Remember, the script has to be in the same directory as the

chromedriver.exe

file.

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

 

driver = webdriver.Chrome()

driver.get("https://www.indeed.com/")

amazon_search = driver.find_element_by_id("twotabsearchtextbox")

amazon_search.send_keys("Web scraping for python developers")

amazon_search.send_keys(Keys.RETURN)

driver.close()

With python and Selenium, you can find current open positions and aggregates data of python developers from different job platforms just like this website, So, You can easily scrape python developers' data from indeed, glassdoor, flexjobs, monster, and so on.

BeautifulSoup

BeautifulSoup is a parsing library for parsing HTML and XML files. It converts a web page document into a parse tree so that you can use its pythonic ways to traverse and manipulate it. With BeautiSoup, you can parse out any required data provided it is available in the HTML. It is important you know that BeautifulSoup does not have its own parser, it sits on top of other parsers such as the lxml, and even the html.parser available in the python standard library. When it comes to parsing data from web pages,

BeautifulSoup is the most popular choice. Interestingly, it is easy to learn and master. When parsing a webpage using BeautifulSoup, you do not have a problem even when the page HTML is messy and complex.

How to Install BeautifulSoup

Just like every other library discussed, you can see pip to install it. Enter the following command in the command prompt.

pip install beautifulsoup4

BeautifulSoup Code Sample

Below is a code that scrapes and prints the list of LGAs in Nigeria to the console. BeautifulSoup does not have the capability of downloading web pages, and as such, we will use the Python Requests library for that.

import requests

from bs4 import BeautifulSoup

 

url = "https://en.wikipedia.org/wiki/Local_government_areas_of_Nigeria"

page_content = requests.get(url).text

soup = BeautifulSoup(page_content, "html.parser")

table = soup.find("table", {"class": "wikitable"})

lga_trs = table.find_all("tr")[1:]

for i in lga_trs:

tds = i.find_all("td")

td1 = tds[0].find("a")

td2 = tds[1].find("a")

l_name = td1.contents[0]

l_url = td1["href"]

l_state = td2["title"]

l_state_url = td2["href"]

print([l_name,l_url, l_state, l_state_url])

Lxml

From the name of this library, you can tell that it has something to do with XML. Actually, it is a parser – a real one at that and no like BeautifulSoup that sits on top of a parser to act as a parsing library. Aside from XML files, lxml can also be used for parsing HTML files. It might interest you to know that lxml is one of the parsers that BeautifulSoup uses to turn a web page document into a tree to be parsed.

Lxml is very fast when it comes to parsing. However, it is difficult to learn and master. Most web scrapers do not use it alone but use it as the parser to be used by BeautifulSoup. Because of this, there is really no need for a code sample as you won’t be working with it alone.

How to Install Lxml

Lxml is available in the Pypi repository, and as such, you can use the pip command to install it. the command for installing lxml is below.

pip install lxml

Python Web Scraping Frameworks

Unlike in the case of libraries that are meant for only one function, frameworks are a complete tool that incorporates a good number of functions that you require in developing web scrapers, which includes the function for sending HTTP requests and parsing requests.

Scrapy

Scrapy is the most popular and arguably the best web scraping framework publicly available as an open-source tool. It was created and still being managed extensively by Scrapinghub.

Scrapy is a complete framework as it takes care of both sending requests and parsing the required data out of the downloaded page. Scrapy is multithreaded and the fastest of all python frameworks and libraries. It makes the development of complex web scrapers easy. However, one of the problems associated with it is that it does not render and executes JavaScript, and as such, you need to use Selenium or Splash for that. It is also important you know that it has a steep learning curve.

How to Install Scrapy

Scrapy is available on Pypi, and as such, you can install it using the pip command. Below is the command to run on your command prompt/ terminal to download and install Scrapy.

pip install scrapy

Scrapy Code Sample

As stated earlier, Scrapy is a full framework and does not have an easy learning curve. For a code sample, you need to get a lot of code written and does not work like it is in the case of the ones described above. For a code sample of Scrapy, visit the official tutorial page on the Scrapy website.

Read more,

Pyspider

Pyspider is another web scraping framework written for python programmers to develop web scrapers. Pyspider is a powerful web crawling framework you can use to create web scrapers for the modern web.

Unlike in the case of Scrapy that does not render JavaScripts on its own, Pyspider is good at getting that done. However, when it comes to reliability and maturity, Scrapy is very much ahead of Pyspider. It allows for a distributed architecture and provides support for both Python 2 and Python 3.

It supports a good number of database systems and comes with a powerful WebUI for monitoring the performance of your crawler/scraper. For it to run, it needs to be on a server.

How to Install Pyspider

Pyspider can be installed using the pip command below.

pip install pyspider

PySpider Code Sample

The below code is the sample code provided by Pyspider on its documentation page. It scrapes links on the Scrapy homepage.

from pyspider.libs.base_handler import *

 

class Handler(BaseHandler):

crawl_config = {}

@every(minutes=24 * 60)

def on_start(self):

self.crawl("https://scrapy.org/", callback=self.index_page)

 

@config(age=10 * 24 * 60 * 60)

def index_page(self, response):

for each in response.doc('a][href^="http"]').items():

self.crawl(each.attr.href, callback=self.detail_page)

 

def detail_page(self, response):

return {"url": response.url, "title": response.doc('title').text()

As stated earlier, Pyspider runs on a server. Your computer an server as a server and will listen from localhost so run

pyspider

command and visit http://localhost:5000/

Related,

Conclusion

When it comes to the number of tools, libraries, and frameworks available for web scraping in the python programming language, you need to know that there are many.

However, you cannot go about learning each and every one of them. If you are developing a simple scraper that does not require a complex architecture, then using the duo of Requests and BeautifulSoup will work – With the addition of Selenium if the site is JavaScript-heavy.

Selenium can even be used alone in such regards. However, when you are looking forward to developing a complex web scraper or crawler, Scrapy is the framework to use.