The easiest websites to scrape data from are static pages that all content is downloaded upon request. Sadly, these types of sites are gradually fading out, and dynamic websites are gradually taking over.
Selenium WebDriver – an Overview
Selenium was not initially developed for web scraping – it was initially developed for testing web applications but has found its usage in web scraping. In technical terms, Selenium or, more appropriately, Selenium WebDriver is a portable framework for testing web applications.
In simple terms, all Selenium does is to automate web browsers. And as the team behind Selenium rightfully put it, what you do with that power is up to you! Selenium has support for Windows, macOS, and Linux. In terms of browser support, you can use it for automating Chrome, Firefox, Internet Explorer, Edge, and Safari. Also important is the fact that Selenium can be extended using third-party plugins.
With Selenium, you can automate filling of forms, clicking buttons, taking a snapshot of a page, and other specific tasks online. One of these tasks is web extraction. While you can use it for web scraping, it is certainly not a Swiss Army Knife of web scraping; it has its own downside that will make you avoid using it for certain use cases.
The most notable of its downsides is its slow speed. If you have tried using Scrapy or the combo of Requests and Beautifulsoup, you will have a speed benchmark that will get you to rank Selenium slow. This is not unconnected to the fact that it makes use of a real browser, and rendering has to take place.
For static sites that you can quickly replicate API requests, and all content is downloaded upon loading, you will want to use the better option, which is Scrapy or the duo of Requests and Beautifulsoup.
Selenium is a third-party library, and as such, you will need to install it before you can make use of it. Before installing Selenium, make sure you already have Python installed. To install Python, you can visit the Python official download page. For Selenium to work, you will need to install the Selenium package and then the specific browser driver you want to automate. You can install the library using pip.
pip install Selenium
For browser drivers, they have support for Chrome, Firefox, and many others. Our focus in this article is on Chrome. If you don’t have Chrome installed on your computer, you can download it from the official Google Chrome page. With Chrome installed, you can then go ahead and download the Chrome driver binary here.
Make you download the driver for the version of Chrome you have installed. The file is a zip file with the actual driver inside of it. Extract the actual Chrome driver (chromedriver.exe) and place it in the same folder as any Selenium script you are writing.
Selenium Hello World
As it is in the coding tutorial tradition, we are starting this selenium guide with the classical hello world program. The code does not scrape any data at this point. All it does is attempt to log into an imaginary Twitter account. Let take a look at the code below.
import time from selenium import webdriver from selenium.webdriver.common.keysimport Keys username = "concanated" password = "djhhfhfhjdghsd" driver = webdriver.Chrome() driver.get("https://twitter.com/login") name_form = driver.find_element_by_name("session[username_or_email]") name_form.send_keys(username) pass_form = driver.find_element_by_name(("session[password]")) pass_form.send_keys(password) pass_form.send_keys((Keys.RETURN)) time.sleep(5) driver.quit()
the username and password variables’ values are dummies. When you run the above code, it will launch Chrome and then open the Twitter login page. The username and password will be inputted and then sent.
Because the username and password are not correct, it displays an error message, and after 5 seconds, the browser is closed. As you can see from the above, you need to specify the specific web browser, and you can see we did that on line 7. The get method sends GET requests. After the page has loaded successfully, we use the
method to find the username and input elements and then use
for filling the input fields with the appropriate data.
Sending Web Requests
Sending web requests using Selenium is one of the easiest tasks to do. Unlike in the case of other tools that differentials between POST and GET requests, in Selenium, they are sent the same way. All that’s required is for you to call the get method on the driver passing the URL as an argument. Let see how that is done in action below.
from selenium import webdriver driver = webdriver.Chrome() # visit Twitter homepage driver.get("https://twitter.com/") # page source print(driver.page_source) driver.quit()
Running the code above will launch Chrome in automation mode and visit the Twitter homepage and print the HTML source code of the page using the
. You will see a notification below the address bar telling you Chrome is being controlled by an automated test software.
Chrome in Headless Mode
From the above, Chrome gets launched – this is the headful approach and used mainly for debugging. If you are ready to launch your script on a server or in a production environment, you wouldn’t want Chrome launched – you will want it to work in the background. This method of running the Chrome browser without it launching is known as the headless Chrome mode. Below is how to run Selenium Chrome in headless mode.
from selenium import webdriver from selenium.webdriver.chrome.optionsimport Options # Pay attention to the code below options = Options() options.headless = True driver = webdriver.Chrome(options=options) # visit Twitter homepage driver.get("https://twitter.com/") # page source print(driver.page_source) driver.quit()
Running the code above will not launch Chrome for you to see – all you see is the source code of the page visited. The only difference between this code and the one before it is that this one is running in the headless mode.
- Headless Browser 101: Chrome Headless Vs. Firefox Vs. PhantomJS
- Use Chrome Headless and Dedicated Proxies to Scrape Any Website
Accessing Elements on a Page
There are basically 3 things involved in web scraping – sending web requests, parsing page source, and then processing or saving the parsed data. The first two are usually the focus as they present more challenges.
You have already learned how to send web requests. Now let me show you how to access elements in other to parse out data from them or carry out a task with them. In the code above, we use the
method to access the page source. This is only useful when you want to parse using Beautifulsoup or other parsing libraries. If you want to use Selenium, you do not have to use the
is for retrieving page title
for retrieving the URL of the page in view.
for retrieving an element by its name, e.g., password input with name password.
for retrieving element by tag name such as a, div, span, body, h1, etc.
for retrieving element by class name
for finding element by id.
For each of the
methods, there is a corresponding method that retrieves a list of elements instead of one except for
. Take, for instance, if you want to retrieve all elements with the “thin-long” class, you can make use of the
. The difference is the plurality of the element keyword in the function.
Interacting with Elements on a Page
With the above, you can find specific elements on a page. However, you do not just do that for doing sake; you will need to interact with them either to trigger certain events or retrieve data from them. Let take a look at some of the interactions you can have with elements on a page using Selenium and Python.
will retrieve the text attached to an element
will trigger the click action and events that follow that
is meant for filling input forms
is for detecting if an element is visible to real users or not -this is perfect for honeypot detection.
for retrieving the value of an element’s attribute. You can change the “class” keyword for any other attribute.
With the above, you have what is required to start scraping data from web pages. I will be using the above to scrape the list of US states their capital, population (census), and estimated population from the Britannica website. Take a look at the code below.
from selenium import webdriver from selenium.webdriver.chrome.optionsimport Options # Pay attention to the code below options = Options() options.headless = True driver = webdriver.Chrome(options=options) driver.get("https://www.britannica.com/topic/list-of-state-capitals-in-the-United-States-2119210") list_states =  trs = driver.find_element_by_tag_name("tbody").find_elements_by_tag_name("tr") for iin trs: tr = i.find_elements_by_tag_name("td") tr_data =  for x in tr: tr_data.append(x.text) list_states.append(tr_data) print(list_states) driver.quit()
Looking at the above, we put into practice almost all of what we discussed above. Pay attention to the trs variable. If you look at the source code of the page, you will discover that the list of states and the associated information are contained in a table. The table does not have a class neither does its body.
Interestingly, it is the only table, and as such, we can use the find.element_by_tag_name(“tbody”) method to retrieve the tbody element. Each row in the tbody element represents a state and its information, each embedded in a td element. we called the find.elements_by_tag_name(“td”) to retrieve the td elements.
The first loop is for iterating through the tr elements. The second one is for iterating through the td elements for each of the tr elements. Element.text was used for retrieving text attached to an element.
You Have Learnt the Basics: Now What?
From the above, we have been able to show you how to scrape a page using Selenium and Python. However, you need to know that what you have learned is just the basics. There is more you need to learn. You will need to know how to carry out other moves and keyboard actions.
One thing you will come to like about Selenium is that it makes the whole process of scraping easy as you do not have to deal with cookies and replicating hard to replicate web requests. Interestingly, it is easy to make use of.