Scrapy Proxy 101: How to Set Up Proxies for Scrapy

While Scrapymakes it easy for you to develop scalable web scrapers and crawlers, without proxies, they are useless in many situations. Let me show you how to setup proxies for Scrapy and the best proxies to use.

Set Up Proxies for Scrapy

As a developer interested in web scraping, I need you to have it at the back of your mind that even though web scraping is ethical and legal, websites do not like their pages and data to be scraped – and they would do anything within at their disposal to stop you. Unfortunately, only a few effective techniques are available to them, with IP tracking as the most important one since IP addresses are the universal means of identification in computer networks such as the Internet. While IP tracking seems effective to them, we, as web scrapers and automation developers, are fortunate enough to have proxies that help make IP tracking and blocking less effective.

Scrapy is a popular web scraping framework you can use to develop scalable scrapers and crawlers. As a web scraping tool, Scrapy has support for proxies, and you will most likely make use of proxies in your scraping project. If you do not know how to set up proxies in scrappy, then keep reading this article, and I will be showing you how to get it done in this article.


Scrapy – An Overview

Unlike the likes of Requests and BeautifulSoup, Scrapy is a complete web scraping and crawling framework – you cannot just use it for sending HTTP requests; you can also use it to parse HTML documents and carry out other tasks. In fact, Scrapy alone is just like the combination of Requests, BeautifulSoup, and other scraping libraries. One thing you will come to like about this tool is that it is extensive, and there is the option for you to add custom functionality. With Scrapy, you cannot only build a web scraper or crawler, but you can easily deploy it to the cloud.

This scraping framework was developed by Scrapinghub, a popular data service provider with interest also in the development of data extraction tools. Scrapy was first released in 2008. The tool was written with Python, and for Python spider development. It is arguably the fastest Python framework – also the most popular and quite powerful too. The major problem associated withScrapy is its learning curve and the fact that it is helpless when faced with a JavaScript-rich website.


Scrapy Proxy Settings

Compared to the combo of Requests and BeautifulSoup, Scrapy has a steeper learning curve. However, there is no doubt that it is more scalable and better suited for complex development. When it comes to setting proxies, it might interest you to know that the procedures are actually very easy. There are two methods you can follow to set up proxies in Scrapy. These are discussed below.

  • Method 1: Setting Proxies by passing it as a Request Parameter

The easiest method of setting proxies in Scrapy is y passing the proxy as a parameter. This method is perfect if you want to make use of a specific proxy. There is a middleware in Scrapy called HttpProxyMiddleware, which takes the proxy value from the request and set it up properly. Below is a sample code of how to set up proxies in Scrapy via Requests parameter.

def start_requests(self):

    for url in self.start_urls:

        return Request(url=url, callback=self.parse,

                       headers={"User-Agent": "scrape web"},

                       meta={"proxy": "http:/154.112.82.262:8050"})
  • Method 2: Create Custom Proxy Middleware

For a more modular approach, I will advise you to create a custom middleware. I will show you how to create a custom proxy middleware and add it to your list of middleware. A middleware is basically a piece of code that Scrapy will run when processing requests. Below is a custom middleware template you can use.

from w3lib.http import basic_auth_header

class CustomProxyMiddleware(object):

    def process_request(self, request, spider):

request.meta[“proxy”] = "http://192.168.1.1:8050"

request.headers[“Proxy-Authorization”] =

basic_auth_header(“<proxy_user>”, “<proxy_pass>”)

After writing the above, you can then enable it and put it before the HttpProxyMiddleware, as shown below.

DOWNLOADER_MIDDLEWARES = {

    'myproject.middlewares.CustomProxyMiddleware': 350,

    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,

}

How to Verify Scrapy Proxies

If you followed any of the procedures above and fill in the correct proxy data, then your proxies have been set from the Scrapy end. But are they working? You will need to test them out. To do that, send a request to any of the proxy lookup tools such as Whatismyip.com – if it shows your real IP address, then something is wrong either from the setup or from the proxies you are trying to use.


Rotating Proxies for Scrapy Projects

Scrapinghub has its own proxy service known as Crawlera, which you can use together with Scrapy. Crawlera is priced per request. If you are looking forward to using a general proxy service, then you can make a choice from one of the proxy providers below – they are all residential proxies, and we have tested their proxies – and they can be trusted to sell only working proxies. You will need rotating proxies to avoid the hurdle of rotating proxies and managing a list of proxies that could get bad quickly.


Luminati

Luminati

  • IP Pool Size: Over 72 million
  • Locations: All countries in the world
  • Concurrency Allowed: Unlimited
  • Bandwidth Allowed: Starts at 40GB
  • Cost: Starts at $500 monthly for 40GB

Luminati is arguably the best proxy provider in the market. They sell residential proxies, mobile proxies, and datacenter proxies. One thing you will come to like about this provider is that they offer a 7 days free trial. They have the largest residential proxy pool with over 72 million residential IPs from all countries and major cities in each supported country.

They have one of the fastest speeds as far as scraping is concerned, and they have proven to work with all the popular web services. Their proxies are rotating proxies and change IP address after every request or a defined period of time.

Luminati rotating residential proxy


Smartproxy

smartproxy for twitter

  • IP Pool Size: Over 40 million
  • Locations: 195 locations across the globe
  • Concurrency Allowed: Unlimited
  • Bandwidth Allowed: Starts at 5GB
  • Cost: Starts at $75 monthly for 5GB

While Luminati proxies are secure, reliable, and fast, the minimum monetary requirement will scare most small scale scrapers aware as you require a minimum of $500. For a premium provider with a smaller minimum monetary requirement, you can go for Smartproxy. Smartproxy has a pool with over 40 million residential IPs.

They have support for both high rotating proxies and session-based proxies that will help you maintain sessions for up to 10 minutes. Smartproxy has support for about 195 countries and 8 major cities around the world.

Smartproxy for etsy


Shifter

Shifter Formerly Microleaves

  • IP Pool Size: Over 31 million
  • Locations: 130 countries
  • Concurrency Allowed: Unlimited
  • Cost: Starts at $249.99 monthly for 10 ports

Shifter has support for major cities and many countries around the globe. From the supported locations, it has a total of over 31 million residential backconnectIPs, making it one of the largest proxy networks in the market. Shifter proxies are priced based on ports and not bandwidth, like the others discussed above.

With Shifter, you will enjoy unlimited bandwidth usage, which makes it perfect for scraping tasks that consume much bandwidth. Shifter backconnect proxies change IP after every 5 minutes. You can use their proxies for a good number of scraping tasks as they are fast, secure, and also reliable.

shifter backconnect proxies

Conclusion

With or without using Scrapy, proxies are a must if you intend to scrape websites. Scrapy has support for using proxies, and setting proxies is incredibly easy in Scrapy. With the method discussed above, you should be able to set up proxies on Scrapy in no time. I also made recommendations on the best rotating proxies to use with Scrapy if you do not want to use Crawlera.