lxml Tutorial: Parsing HTML Documents & Web Scraping with lxml

Lxml has made processing XML and HTML documents easy and straightforward. If you are not skilled in using it, you can read this tutorial article to learn how to make use of it.

Lxml Tutorials

If you have tried processing an HTML or XML document programmatically before, you will discover how difficult and frustrating that can be, especially if you are dealing with a badly written document with bad structuring. Generally, parsing HTML is difficult, especially if you want to make use of the out of the box solutions provided by each of the popular programming languages except PHP.

With a programming language like Python, the best you will get is the html.parser library – and I tell you what; it is not beginner-friendly. For more robust solutions, libraries like lxml were introduced.

Libraries for handling and processing HTML and XML abstract away the difficulties and make them easy to handle. While HTML still retains its place as the language for coding the structure of a page, XML has been losing ground.

Take, for instance, JSON is becoming the preferred choice for data exchange while HTML has dislodged it from the web frontend. However, it has its place as one of the popular data formats, and some use it at the frontend in software development. Our focus in this tutorial will be on HTML.


What is lxml?

In simple words, we can say that the libraries like lxml is a python library for processing XML; and HTML documents. But that’s in simple terms, lxml is feature-rich and cannot only be used for handling and parsing HTML and XML documents but can also be used for creating new HTML and XML documents from scratch.

This library is built on the libxml2 and libxslt C libraries and can be seen as their Pythonic binding. This has made it a very fast and efficient library while being simple because of the Python API.

lxml

Aside from the easy python API and speed provided, this library is a choice among many python developersbecause of its documentation, which has been written absolutely right, and you do not have to deal with memory management and segmentation fault.


XML and HTML: How they stand

XML and HTML
Before we continue, it is important you know what these two actually means and what they are made for. While both of them are markup languages, XML stands eXtensible Markup Language, and its tags are not predefined – you will have to define the tags yourself. On the other hand, HTML (Hypertext Markup Language) has its tags predefined.

In terms of case sensitivity, XML is case sensitive while HTML is not. The main use of the XML document is for transferring data, while HTML is meant for the presentation of data. Another important difference is that XML; is extensive while XML is not extensive.

There are many other differences both the ones stated are some of the most important ones.


Lxml Coding Guide

As stated earlier, this tool is for processing HMTL and XML files, but our focus will be on HTML documents. At the end of this tutorial, you are going to be learning how to create HTML documents using lxml, how to parse and carry out other processing on them.

  • Installation Guide

There are two requirements you need to get started with the lxml library. The first is having the python programming language installed on your computer. Without Python, lxml would not even have an environment to function. I assume you already have Python installed. However, if you have not done so, you can visit the download page on the official python website for a guide on how to download and install Python.

With Python installed correctly, you are set to installing lxml library. There are a good number of ways you can install lxml. Let take a look at them below.

  • pip

If you have pip installed on your computer, you can install lxml easily. Pip is the Python package manager that you can use to install libraries and packages and all of their dependencies in just a command. To install lxml via pip, use the command.

pip install lxml
  • apt-get

If you are using Linux or macOS, you can make use of the command below.

sudo apt-get install python-lxml

  • How to Create Html Documents Using lxml

The lxml package has an etree module, which is useful for creating elements/nodes. With this module, you can create an HTML/XML document from the ground up.

My style of tutorials is coming up with a project and working on it so that you will have a better understanding than just describing classes and methods – if that’s what you want, reading the official documentation is your best source. Below is a simple HTM document that will be replicated programmatically using Python and lxml.

<html>
<head>
<title>lxml tutorials</title>
</head>
<body>
<h1>Learn lxml on this page </h1>
</body>
</html>

To create the above HTML document using lxml, create a new python file (module) named lxml_tutorial.py. As stated earlier, the etree module is the module useful for creating document nodes. Looking at the document above, we have the html opening and closing tag, a head tag with a title tag inside, and a body tag with the h1 tag inside. Let start by importing the etree module into the lxml_tutorial.py file.

from lxmlimport etree

There are two functions in the etree module that are of importance to us. This is the Element and SubElement functions. We will be using the Element function for creating the html tag, while other tags will be created using the SubElement function. The code for creating the above document is below.

from lxmlimport etree
# create html element
html = etree.Element("html")
#create head element
head = etree.SubElement(html, "head")
title = etree.SubElement(head, "title")
title.text = "lxml tutorials"
body = etree.SubElement(html, "body")
heading = etree.SubElement(body, "h1")
heading.text = "Learn lxml on this page "
print(etree.tostring(html, pretty_print=True).decode())

If you run the code above, the below will be displayed on your screen, which is the same as the HTML document we were trying to replicate.

<html>
<head>
<title>lxml tutorials</title>
</head>
<body>
<h1>Learn lxml on this page </h1>
</body>
</html>

Now let take a look at the code. The etree.Element takes a minimum of one parameter while etree.SubElement accepts a minimum of two parameters. This is because the Element function is for creating the root element, which in this case, the html tag element.

The SubElement accepts atleast two-argument with the first one being the parent element, and the second one is the name of the element/tag. As you can see from the code, the title tag is a child of the head element, and as such, the code is title = etree.SubElement(head, “title”).

Aside from the compulsory arguments, you can add as many arguments as you like. Other arguments to follow the compulsory arguments are for defining attributes. Let say you want to set the font size of the h1 element to be 15. You can do that by adding it as an argument when creating the h1 element.

heading = etree.SubElement(body, "h1", style="font-size:15pt")

It is that simple.


  • Parsing HTML Documents With lxml

Now that you know how to create HTML and XML documents using lxml, it is time to move on to traversing and parsing HTML and XML documents. The module for parsing documents in lxml is the parser module. For the parsing tutorial, you will need to create a new HTML document and name it htmlinput.html, then paste the below HTML code in it.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Learning how to parse HTML document using lxml</title>
</head>
<body>
<a href="https://fjhfgjhfdjsjd.com" class="urls">First URL</a>
<a href="https://fjhfjdsjdfhdsjhsdd.com" class="urls">Second URL</a>
<a href="https://fjhfjdsjddfjkdfjds.com" class="urls">Third URL</a>
<a href="https://fjhfjdsjdhjfjhds.com" id="oly">Fourth URLS</a>
<h1>HELLO PARSER</h1>
</body>
</html>

Before you can parse the third, you will need to load it into memory using the parse method. After loading, youwill need to call the geotroot() function to get the root element, and from there, you can start parsing.

from lxmlimport etree

html = etree.parse("htmlinput.html")
html = html.getroot()
etree.dump(html)

the last line will print out the content of the HTML file to the console – it is just there for debugging purposes.  Now that we have the file content in memory, it is now time to start parsing. If we want to find all the URL elements, we can do that using the code below.

from lxmlimport etree

html = etree.parse("htmlinput.html")
html = html.getroot()
urls = html.findall("body/a")
all_url = []
for iin urls:
    d = {i.text: i.values()[0]}
all_url.append(d)
print(all_url)

if you run the code above, you will get the below as a response in the console.

[{'First URL': 'https://fjhfgjhfdjsjd.com'}, {'Second URL': 'https://fjhfjdsjdfhdsjhsdd.com'}, {'Third URL': 'https://fjhfjdsjddfjkdfjds.com'}, {'Fourth URLS': 'https://fjhfjdsjdhjfjhds.com'}]

You can see how we parse all URLs from the document using the findall(“body/a). The argument shows the path to be traversed in getting to the required data. If you are not looking forward to scraping a list but just a single element, you can use the find method. Here an example below.

from lxmlimport etree

html = etree.parse("htmlinput.html")
html = html.getroot()
urls = html.find("body/h1")
print(urls.text)

the above should print “HELLO PARSER” to the console.


How to Web Scrape Using Requests and lxml

Most web scraping tutorials features either requests and beautifulsoup. What many do not know is that beautifulsoup is not an actual parser, and as such, it requires an actual parser to function. If you do not specify an actual parser, you get a warning outed on the console, and then the default html.parser is used.

Lxml is one of the fastest parsers you can use for Beautifulsoup. Interestingly, you can use it as a standalone tool. Let me show you how to make use of the Requests and lxmlcombo for web scraping. The below code shows you how to scrap all the links on a page using Requests and lxml.

import requests
from lxmlimport html

page_download = requests.get("https://www.worldometers.info/geography/how-many-countries-in-asia/")
doc = html.fromstring(page_download.text)
table_rows = doc.findall("body/a")
print(table_rows)

You can see how easy it is using lxml for web scraping. However, this is because we are only parsing links. It can get messy at some points. I must confess, Beautifulsoup caught it for me – it is still the easiest to make use of out there because of its simplicity. I would rather make use of lxml as the parser behind Beautifulsoup.


What Next After this Introduction?

If you have gone through the tutorial, you can tell that it is shallow – and yes, this is not without reason – and done intentionally. We had tohide all of the details and get you to understand the basics required so that the not so fast learners will be able to grasp it faster.

Even though the guide is shallow, you cannot deny the fact that the ground has been watered for you. Now that you can code a little bit using lxml, it is time to read more about the lxml library.

While there are many websites and blogs that provide guides on how to make use of lxml, I will advise you to head straight to the official lxml website to learn more about it. Make sure you pay attention to the tutorial and documentation section of the website. From the official website, you can now explore other content online.

Conclusion 

If you have gone through the tutorial from the beginning to the end, I expect you to have acquired basic knowledge of HTML and XML processing using lxml. HTML and XML are not some of the easiest data to handle. They can be difficult to parse depending on the tool you make use of.

Without using a library, you will have to depend on regular expression, which is known to be messy. Thanks to the likes of lxml, parsing HTML and XML has been made easier, faster, and less error-prone.

Aside from parsing, another thing you will come to like about this XML and HTML parsing library is its support for creating and modifying HTML and XML files. As far as lxml is concerned, what is contained in this tutorial article is just the tip of the iceberg. There is more to learn – go and explore.