Big data come with some challenges that require some kind of technique different from conventional techniques of collecting data. Let me show you how to collect big data on the Internet.
Data has become a very crucial part of most businesses, and they are required to make business decisions. Fortunately, unlike in the past, that availability of data is a huge problem, and businesses have to go the extra mile to collect data they require for their business operations; data is now readily available. The major problem businesses are facing now is the collection of data not generated by their own systems. One of the major sources of big data is the Internet.
It might interest you to know that the New York Stock Exchange about one terabyte of data daily. This is even small when compared to user-generated data on social media platforms. Take, for instance, about 500 terabytes of data is generated daily on Facebook. Looking at these huge data, you will realize that using the traditional method of web scraping to collect and process them is not feasible. The volume, speed of generation, and variety of data types associated with big data require some special techniques. These techniques will be discussed. Before that, let take a look at big data.
What is Big Data?
The Big in Big Data is what differentiates Big Data from regular data. Interestingly, that word Big makes a whole lot of difference and changes everything. The Big Data term is used for describing a collection of data that’s huge in volume and increases exponentially over time. Big Data is huge, large, and complex. The size and complexity of Big Data make traditional data-handling tools and techniques not suitable for handling them.
Big data can exist as structured data, as in the case of data stored in spreadsheet formats. It can also be in the form of semi-structured data such as data stored in XML files. Another type of data associated with Big Data is unstructured data, which is a string of text. Big Data can exist as a mix of structured, semi-structured, and unstructured data.
Characteristics of Big Data
The characteristics of Big Data can be identified by looking at what is known as the 5 V’s of Big Data. These 5 Vs are volume, velocity, variety, veracity, and value.
What’s makes a data collection a Big Data in the first place is its volume. If a dataset is not huge and bigger than regular data that can be processed with traditional data processing tools, you can not call it Big Data.
Another thing that characterizes Big Data is the speed at which the data is generated and processed. It might interest you to know that a jet engine can generate about 10 terabytes of data within 30 minutes of its operation. Big Data is generated at a very fast speed.
The variety refers to the types of data. Big Data contains data from a lot of sources, and the data is heterogeneous –the datasets that makeup Big Data are structured, semi-structured, and unstructured data types. The sources of the data that makes up the data set are usually more than one.
Another characteristic of Big Data that’s very important is its quality. For a data collection to be referred to as Big Data, that data must be correct and has no missing values that will lead to a distorted analysis.
Lastly, Big Data adds value to the business or research agency that posses them. Big Data has helped a lot in retail sales for e-commerce, prevent serious health problems, and even uncover fraud, among other things.
How to Collect Big Data
Big Data can come from a variety of sources. Some are automatically generated, as in the case of AutomaticIdentification and DataCapture (AIDC). Examples of devices that generate data via AIDC are smart card readers, voice recognition systems, QR code reader, facial recognition systems, and optical character recognition systems. Humans can also be directly involved in collecting Big Data. Another source of Big Data is the Internet, where you can use automated bots to collect data publicly available on web pages.
Each of the methods has its own unique way of collecting it. We’ll be focusing majorly on how to collect Big Data from the Internet via web scraping. It might interest you to know that you cannot collect Big Data from the Internet and process them the way you do for smaller data. Some engineering has to be involved before you can get it right. Let take a look at these below. We assume you already have web scraping skills and already know about the use of proxies, Captchas solvers, and other advanced techniques of evading anti-scraping systems.
Web Scrape in Parallel Using Distributed Web Scrapers
When collecting small data from a few pages, you can get away with using simple crawlers that work on a single thread – it goes to a page, download it, scrape, and store before moving to another page. This is the simplest of all kinds of scrapers – it is also the slowest of all. If you intend to collect data from hundreds of thousands of pages, simple scrapers that work on a single thread will take a whole day or even more to complete scraping. For this reason, multithreaded web scrapers that can scrape data from many web pages concurrently were introduced.
Multithreaded web scrapers work on one computer but have many workers (scrapers) that scrape on different threads. This fastens the whole process of collecting data from thousands and even millions of pages. However, when posses with the tasks of collecting Big Data from the Internet, they also lack the processing power to get the job done at the required time. For Big Data collection, you will need distributed web scrapers that work on computer clusters in parallel to speed up the process.
For distributed web scrapers, you can have them working on many computers (clusters). They all work together in parallel to achieve the same result. The web scraper should have a way to distribute the workloadamong the computers in the cluster. By doing this, the tasks are not left for one computer to handle but on many computers. If you are not ready to manage a computer/server cluster on your own, you can make use of a public cloud provider that will provide you storage space and servers as you require to collect the required data. Some of the most popular public cloud providers include the Amazon EMR, Google Cloud Dataprep, and Microsoft Azure hdinsight.
These public cloud providers are Big Data platforms and come with some tools that simplified your Big Data project. By using your own computer clusters or the service of any of the public cloud providers, you will achieve the speed you require to achieve your data collection task. Take, for example, Googlebot is a distributed web crawler that works on server clusters to achieve its goal of updating the Google index. Without using a distributed system, it will become extremely difficult for Google to crawl the Internet as fast as it does.
Use a Distributed File Storage System
We are talking about big data here, and most likely, the data you will need to collect will be larger than the capacity a single machine can store. Even if you’re not in that phase yet, you’ll keep collecting until a time comes that a single computer can no longer hold the data. So when planning on collecting Big Data, you will have to ditch your regular database systems. What you need is a distributed file storage system. One of the most popular disturbed file storage systems used for storing Big Data is the Hadoop Distributed File Systems (HDFS).
Your distributed web scraper will require a distributed storage system to function. Because you’re using a web scraper working parallel on a computer cluster, you will need to guide against a lot of things. The most notable of them is read and write error or having a database damaged. HDFS is scalable, fault-tolerant, and extremely simple to use and expand. Aside from the time of collection, distributed storage systems also have their place in the processing phase as the processing tools for Big Data are often distributed too.
Dealing with Edge Cases and Exceptions
When collecting data from the Internet, in many cases, the data collection process does not turn out the way you expect it, especially if you do not think of possible things that could go wrong. However, when collecting Big Data, there are a lot of possible things that you did not even think of. As a web scraper collecting Big Data, your worse nightmare is your bot throwing an exception and ending operation every now and then. You will need to make use of web scrapers that, despite being distributed, they put into consideration a lot of edge cases and have laid down methods of dealing with them gracefully.
Some of the edge cases you have to consider and have your scrapers treat them gracefully includes incompatible encoding of characters on some pages, changes in page structures to discourage scraping, all kinds of HTTP errors, huge/large web pages, incorrect HTML/XML markups, page loading problems, and many more. Distributed web scrapers meant for collecting Big Data should consider lots of exceptions and handle them. For errors that can’t be handled automatically they should be logged so that you can take care of themlater on. If you can take care of hangs, memory leaks, crashes, and many other unexpected errors without the bot stopping work, then collecting Big Data becomes possible.
Be Polite to Websites
You can scrape and collect data from thousands of pages without having any adverse effect on a website. Most times, website administrators will not even notice that web scrapers have been used to collect data from their website. However, this is not the case with collecting Big Data. Even though you can use generic user agent string, rotate proxies, and header values, websites’ administrators will know that their website has been invaded by a distributed web scraper because of the unnatural traffic.
This is because the too many requests sent from your distributed architectures will most likely affect your websites of interest – you can end up slowing them down, crashing their servers,or even add to the running cost of their servers. For this reason, you’re advised to be polite and respect the robots.txt files of websites. The robots.txt files contain a set of directives issued to web bots by their websites of interest.
It is important I stress here that you can choose to ignore the file as most websites do not like automated access and, in most cases, will want to even prevent you from accessing its content. But for websites that support scraping but provides directives on the delay between crawl and other crawling directives, you can respect the directives. Web scraping is generally legal, provided the information is publicly available and not behind any form of payment wall or authentication system. However, what you use the data can make it illegal.
Looking at the above, you can see that collecting Big Data from the Internet is quite different from collecting a smaller dataset. The most important difference for scraping Big Data is that the scraper, file system, and any other architecture used need to be distributed across computers/servers. This is to meet the need for Big Data collection and processing as the requirement is more than what one computer can handle.