Are you looking forward to using the Octoparse scraping tool for collecting public data on the Internet? Then you are on the right page as we would be providing you a step-by-step guide on how to scrape data using Octoparse.
In the past, web scraping requires you to directly write code or get someone that can write code to provide you a web scraper for collecting the data you are interested in. This is no longer the case as there are a good number of web scrapers that have been developed for those with no coding knowledge.
These web scrapers range from visual web scrapers such as Octoparse, ParseHub, and WebHarvy, as well as some web scrapers that provide you structured data without you carrying out any data identification task such as Bright Data’s Data Collector.
Our focus is on the Octoparse web scraper which is a visual web scraping tool. In theory, Octoparse is an easy-to-use web scraper — all you have to do is use the point and click interface provided to select some of the data of interest and the tool would automatically identify similar elements on the page.
In practice, if you never learned how to make use of it, you might find it difficult to use especially if you are not a technical person. In this article, we would be providing a tutorial on how to make use of Octoparse to collect data from the Internet. Before doing, let’s take a look at an overview of Octoparse.
Overview of Octoparse
The Octoparse tool claims to make web scraping easy for everyone regardless of coding skills. With this web scraper, you will not need to write a single line of code to extract data from web pages on the Internet. All you need is to have the skills of using the mouse which is what is required to use the point and click interface.
However, you will need to provide proxies to avoid getting blocked. You can download scraped data in a good number of file formats such as CSV, Excel, JSON, and databases.
Octoparse is available as a Windows application (Support Mac also). The service also has support for a cloud scraping solution that you can use without necessarily installing any software. This one even makes it possible to schedule scraping tasks and get them delivered to you at specific periods.
Latest Version of Octoparse
On Feb, The new 8.5 version of Octoparse software have released, Octoparse now is available on Mac as well.
What's new in Octoparse 8.5 version?
Live logs for troubleshooting local runs
Boost Mode for up to 3X faster local runs
Auto-backup local data to the Cloud
Manage your tasks with batch actions
You can learn more from here, and you can watch this video to lean the details,
Octoparse is one of the best web scrapers out there if you do not have coding skills. The Octoparse service also offers a professional data service. This service is for those that d not want to be involved in the web scraping workflow. All they need to do is provide information on the data they are interested in and the Octoparse professionals will scrape and provide the data for them.
The Octoparse web scraper is a GUI application. That is, it provides you with a user interface that you will use to access the service you want. This user interface is quite simple and easy to get used to even without technical knowledge. In this section of the article, we would be describing the user interface and letting you know what to expect if you intend to make use of this web scraper. This section will be filled with screenshots of the UIs.
If you log into an account, you will see the interface above. The interface is divided into two sections. The left-hand side of the interface is the sidebar and contains the menu. You can also call it the navigation area. The area labeled the home screen is the main page.
The sidebar menu is the section where all of the navigation options are available. From the main page above, you can see that it holds the main navigation — dashboard, quick filters, recent tasks, team collaboration, data service, and contact us. A click on each of the options would lead to a change in the main interface.
However, the sidebar would remain the same in most cases, it is only in a few cases like in the case of the workspace page such ad the below that even the content of the sidebar changes.
The home screen as shown in the first screenshot is rightly called the main area of the application. Octoparse calls this section the workspace. This is because the main tasks, including point and click would be done in this section of the interface. The sidebar in most cases is just for navigation.
When you start creating any new scraping task, this section would change — and even the navigation section would change to reflect that. With a new task started, the sidebar would become the workflow area. Aside from the workflow section, 3 other interface exists — data preview, browser, and smart tips.
Workflow: The workflow section shows you a flow chart of all of the actions you take using the in-browser tool to access the data to be scrapped. It is this workflow the tool will follow in getting the data you are interested in scraping.
In-browser: Remember from the beginner of the article we started that web scraper is a visual web scraper right? Well, it does come with its own browser and this browser is what you use to access the page that holds the data you are interested in. It is from the browser that you point and click on the required data — the browser is the point and click interface been taunted.
Data Preview: This section provides you with details of the data you are scrapping. Without this section, you might not know whether you are scrapping the wrong data or not. You can actually hide this section if you wish.
Smart Tips: This UI element is available at the top right side of the page. The Smart Tips section comes with a toggle button which is meant for removing it if you do not want it. This element is optional and you wouldn’t need it when you know how to make use of the tool.
Step By Step Guide on How to Use the Octoparse
The Octoparse web scraper can be used to collect different kinds f data ranging from job listing to price info, and even emails on forums, among other things. While the process of using the software to scrape each detail might vary a bit, the fundamentals remain the same. All you will need is to make use of the paint and click interface provided to identify important data points.
Interestingly, you might not even need to do that manually for some data points as templates have been built which you can just select and use in other to save time and effort. Follow the steps below to learn how to make use of the Octoparse tool.
Step 1: Download the Octoparse desktop client. Make sure you download it from the official website of Octoparse to avoid downloading a malicious copy from third parties. The client is available for Windows and Mac. If you are using a different Operating System other than these two, then you will need to run it on a virtual machine.
Step 2: The file comes packed in a zip file and as such, you will need to unzip it to find the installer. Before installing it, you should turn OFF any anti-virus software you have running. This is because Octoparse has some features that make it looks like a virus or malware to anti-virus software and as such, you need to stop the anti-virus from working to get them installed. With the anti-virus out of the way, you can then install the software.
Step 3: Launch the Octoparse application and provide your authentication details. If you enter the right information, you will be greeted with the home screen and sidebar. There are basically two methods of scraping data using the Octoparse web scraper — template mode and advanced mode.
Template Mode: Octoparse has prebuilt templates for popular websites of the Internet which you can use to collect data from them right away.
Advanced Mode: If you have a special need not captured in the template mode, then you can customize the scraper to collect the specific data you need. This is the mode you use for other kinds of websites not supported by the template mode.
Step 4: Let’s start a new task using the advanced mode. For the advance mode, all you need to do is provide the URL of the site you want to collect data from. As you enter the URL in the URL input field, Octoparse will automatically detect the website.
Step 5: Once the website loads, you are taken to an in-browser software. As you navigate the website, the actions you take are added to the workflow section of the page — the in-browser tool occupies the main area of the page.
Step 6: The tool automatically identifies important data points on a page. If the data you are interested in is not highlighted, then you will need to do that yourself using the point and click tool. As you click on the data, similar elements are highlighted too.
Step 7: Once you are done, you can check the preview section which is directly below the in-browser tool. From the preview section, you can remove any column or rows you are not interested in. If you are satisfied with the preview, then check the workflow to make sure it is correct to avoid getting undesirable results.
Step 8: Click on “Save” and then “Run” you have the option of running it on your machine or on the cloud. For now, we would be using our computer so go with the first option. You would then choose the format you want the data downloaded.
How to Setup Proxies for Octoparse
In the above, we didn’t set up proxies. While it might work if the tool will only has to send a few requests, it will not work if too many requests would be sent. For this reason, we would need to use proxies if we would be sending many requests in other to anonymize our requests.
Below is the method to follow to configure proxies for Octoparse.
Step 1: The first step is to get your hands on high-quality proxies. You can purchase good proxies from Bright Data, Smartproxy, or Soax. There are many other providers out there. Regardless of the provider you choose, what you require are a proxy address (host) and port.
You will also require a username and password for authentication except if the service supports IP authentication. With the proxy address, port, username, and password, you can move to the next step.
Step 2: Create your task as described above. When you are done creating the task, do not click the “Save” or “Run” button yet.
Step 3: Instead, click on the “Settings” icon, and an interface for setting will popup.
Step 4: Scroll down to the Anti-blocking settings and check the “Use Proxies” option.
Step 5: This will open another interface for entering the list of proxies. You can set the interval of IP rotation from the interface. If you bought your proxies from either Bright Data, Smartproxy, or Soax, you can generate a list of proxies and paste them in the “IP Proxies” field.
However, because the proxies are rotating proxies, you can use just the proxy address and port pair generated earlier and the proxy service will take care of the IP rotation on your behalf.
Step 6: Click on the “OK” button and then go back to the top of the page and click on the “Save” button then “Run” and your scraping task will be done via different IP addresses, thereby making it possible for you to send too many requests without getting detected and blocked.
How to Schedule Web Scraping Task Using Octoparse
One good feature you will come to like with Octoparse is its support for scheduling scraping tasks. Yes, If there is data that you want to collect periodically, that is not a reason to ditch this tool. You can create the task and then get it to run on its own.
However, this option is only available to paid users as scheduled scraping only works on the cloud. Follow the steps above to schedule a scraping task.
Step 1: Make sure you have a paid plan before attempting to schedule a scraping task. Octoparse does have a free plan but this plan does not give you access to the cloud platform which supports schedule scraping.
Step 2: Create a scraping task as stated above. Click on the “Save” button at the top left corner of the page and click on the “Run” button. 3 run options are presented —run local extraction, cloud extraction, and set a schedule.
Step 3: Click on the “Set a Schedule” button and the interface for configuring the schedule scraping will show on the main area of the page.
Step 4: The first thing to set is the frequency — once, weekly, monthly, or interval. The interval option gives you more control if the frequency is not once, weekly, or monthly.
Step 5: Next is the days of the weeks you want the task to run. You have the option of choosing the days you want.
Step 6: In the “Run at” section, choose the time you want the task to run for the periods you choose above.
Step 7: When you are done, save the settings, and then you have successfully set up a scraping task that would run periodically. You do not even have to keep your computer ON as scraping would be done from the cloud.
FAQs About Octoparse
Q. Is Octoparse Free to Use?
The Octoparse service is not a non-profit service and as such, the tool wasn’t introduced into the market for it not to generate funds for its developers. However, it does have support for free use. If you visit the homepage of the web scraper, you will see that you can use it for free for 14 days.
However, a look at the pricing page will reveal to you that there is a free plan available with lots of limitations. Fortunately, the free plan is still useful to a lot of people and there won’t be the need for opting in for their paid plan. However, the features you get from the paid plan are unrivaled.
Q. Can Octoparse Scrape Modern Websites?
The major requirement for scraping is that the process is replicable and predictive and once that is met, the data of interest can be reached. However, if you are dealing with complex data or websites, the process can be messy at times and you will need to spend some time learning.
Q. Is Proxy Usage a Must for Octoparse?
The Octoparse web scraper does have support for proxies and IP rotation. However, it does not make that compulsory for you. This does not mean you can escape the use of proxies in most cases.
You can only avoid the use of proxies if you will only scrape data from a few pages. If you need to send too many requests, then proxies are non-negotiable as websites would block your IP address if they get too many requests from your IP address.
Octoparse is one of the popular options out there in terms of code-free web scraping and a good number of businesses rely on it to gather data from public sources online. While it is developed to make web scraping easy, that does not mean its usage is easy at first glance.
You still need guidance and time to grasp the full skills required to use the tool at a pro-level. With the guide provided above, you have the base skill required to make use of Octoparse for web scraping.