The world of proxy connections can be quite complex when you’re getting into automating software and connections. It’s easy enough to use a proxy for web browsing, but what about using half a dozen proxies for running something like Scrapebox targeting Google, notorious for detecting bots and stopping them in their tracks?
You have to worry about perhaps a lot more than you might realize. Do you know the diﬀerence between SOCKS4 and SOCKS5? Do you know what ports are required to forward proxy traﬃc? Do you know what a residential IP means? Thankfully, you have me to go over it for you.
I’m often asked questions about individual pieces of software as well. “Does application X work with your proxies?” Generally, the answer is going to be yes, but I’d rather educate you about why that is than just have you blindly accept it. So, first, I’m going to cover different important factors you might encounter with proxies and what they mean, then I’m going to cover some of the most popular applications you might use with proxies, and what they require.
Things to Check
As I said, the first thing I’m going to do is go over the common aspects you might find when dealing with proxy connections and using them to do your bidding. Some of this might get a little technical, so if all you’re looking for is application compatibility, feel free to skip to the next section.
HTTP vs SOCKS4 vs SOCKS5
This is the first and possibly most important compatibility issue; the type of connection the proxy can use. SOCKS is the default kind of proxy connection. A proxy server using SOCKS sits in the middle, between a client and a server destination. For example, it would sit between you and Google if you were using something like Scrapebox. SOCKS itself stands for SOCKet Secure.
The diﬀerence between SOCKS4 and SOCKS5 is that SOCKS5 includes authentication. With a SOCKS4 proxy, there is no ability to use a login and password as a requirement to use it, or to use authentication information at the destination server. In other words, if you’re trying to scrape data on a page you need to log in to access, you need to use a SOCKS5 proxy server.
So what about HTTP? HTTP is more specialized, and thus more limited. You might recognize HTTP as the beginning of the common URL. That’s because it’s the common protocol used for standard web traﬃc.
SOCKS is a protocol used for server to server communication, and has no interpretation of the data; it just passes it from point A through point B and to point C unchanged.
HTTP connections at point B, however, have the chance to interpret and forward the traﬃc. This is useful to streamline some aspects of scraping. For example, if you were scraping Amazon traffic, an HTTP connection is capable of recognizing and caching common elements, to minimize what your scraper needs to download from Amazon itself.
That said, HTTP connections are limited to just HTTP communications. If you’re trying to access a server that does not allow HTTP connections, but your software requires you to use HTTP connections, you won’t be able to make the connection in the first place.
Ports Necessary for Communications
Ports are another part of internet communication that is both foundational and utterly ignored by most people unless they have a need to mess with them. They essentially act like radio channel frequencies or TV channels. Another analogy might be an apartment building; it as one street address, which is the IP address. The port would specify the apartment itself.
Different ports are often used to differentiate the service being used to make the connection.
- Port 21 is typically used for FTP connections
- port 22 is used for SSH connections
- port 53 is used for DNS servicing
- Port 80 is almost always used exclusively for HTTP communications, and that’s a limitation on proxies as well.
If your proxy only supports HTTP, it will be limited to port 80. If the proxy uses SOCKS, it can typically use any port, so you will have to tailor your port to the destination’s requirements.
Secure Transmission of Data
This is another concern you might have about a proxy server, but is unrelated to the SOCKS and Port factors above. It’s all about how secure the connection is through the proxy.
Many public proxies are not secure at all; they are routed through eastern European servers, which inject ads into the traffic or route it through an overlay. You never know what kind of software might be running on that server to snoop on the connections being made and data being sent.
By contrast, private proxies tend to have more security, because the proxy servers themselves are located in more secure locations.
They are also designed for more advanced users, users who would take umbrage to their data being snooped. You also might need a secure connection to access some websites, particularly websites through SCOKS5 that require authentication. Always avoid putting in sensitive login information to an unsecured proxy.
Anonymous or Not
The issue of anonymity is one that is central to the idea of proxy connections. Many people use proxies for simple web browsing, because they don’t want their home IP address associated with their browsing habits. They might just not want to be tracked by large entities like Facebook, Google, or the big ad networks.
Alternatively, they might be doing something borderline illegal – or actually illegal – and want to hide from law enforcement or the NSA. This isn’t always possible, of course. Just look at the Silk Road, and all the users who thought they were safe before the FBI raided the place and arrested the worst oﬀenders. The false sense of security that comes from perceived anonymity, which itself comes from the idea that hiding behind a proxy makes you untraceable.
There are different levels of anonymity with proxies. Some of them will forward pretty much all the normal information you normally forward, and don’t actually provide you any anonymity at all. They would tell the destination server their IP address for access, but will say “by the way, my actual IP is , in case that matters.” It won’t, unless someone wants to track you, in which case they can find your real IP right there.
Higher levels of security don’t forward as much information. The next step up are called distorting proxies, and will not reveal your IP address, but WILL reveal that they are a proxy connection. The destination server will know someone is connecting through a proxy, but won’t know the originating IP address.
The highest level of anonymity comes from top tier proxies that emulate real connections. These don’t even reveal that they are proxies, though sometimes user behaviors will give them away.
Capable of Passing Search Engine Blocks
This is a factor people refer to as “Google safe” for proxies, and all it means is that the IP address of the proxy is not known to be a proxy server, and has not been abused in the past. Google has aggressive anti-proxy and anti-bot measures in place, and will time out your connections if it detects abuse and botting.
A proxy being Google safe is not necessarily a factor of the proxy itself; it is often more a matter of user behaviors. If you’re making a lot of similar repeated requests from one IP address, it looks like a bot. If you’re varying the IP address for those requests, and varying the timing on them, it looks more like organic users. This is why you should use proxy lists rather than a singular proxy, and why you should set delays and asynchronous connections.
This final factor is just a matter of where the proxy server is coming from. There are two main categories for this.
The first category is geographic. If you’re trying to log into a US-centric website, it’s probably not a good idea to use a proxy server located in the Ukraine. A lot of sites commonly targeted by scrapers will block foreign IPs, or re-route them to foreign versions of the site; not valuable to your needs.
The other category is usage. Does the IP come from a data center, or does it come from a residential neighborhood? This is possibly the most important factor on this list. Many large entities, like Google and Amazon, will detect when a connection is being made from a data center. It’s one of the ways they can detect proxy and scraper abuse. It’s always better to be coming in from a residential IP location, because it’s more like their typical user behavior.
Apps and Their Compatibility
There are a bunch of common apps or pieces of software you might want to use with proxies. They usually scrape data automatically in some form, though others will submit data in bulk. Usually sites don’t like robots making these kinds of actions, because it’s how spam and fake accounts come about. I’m not here to judge you on your usage; I’m sure you know what you’re doing. I also take no responsibility for how you choose to use proxies. All I’m doing is reviewing common programs and telling you their requirements. As a disclaimer, I do not necessarily support or condone black hat usage of the following applications; what you do is up to you.
Possibly one of the most powerful tools usable in both black and white hat operations, this is an incredibly robust data harvester. It’s used equally by black hat SEOs and top Fortune 500 companies. Multithreaded operations support numerous connections, and it is Google-safe as long as you’re using it properly. It is, of course, possible to be banned according to your usage. That’s why you need numerous proxies, asynchronous and varied requests, and delays on submissions. Use with caution.
- Supports both HTTP and SOCKS connections.
- Supports both private and public proxies, though private proxies are preferred.
- Highly recommended that you use a large, rotating list of proxies rather than a short, static list.
This is another link building SEO application that primarily focuses on web forums with some residual value. It also targets blog comments, journal guestbooks, link directories, social networks, social bookmarking sites, and more. It includes captcha bypassing for a number of common systems, including the textual Q&A systems. To avoid spam labels, it tries to customize posts according to the theme of the forum or board being targeted.
- Supports both HTTP and SOCKS connections.
- Prefers private proxies to avoid trying to use previously banned IP addresses.
SEnuke TNG is an older program designed for SEO, which was used as the base for SEnukeX, a more advanced version. This new version was created from the ground up to include more features, including a basic tutorial, a process diagram, and scheduling out for weeks. It strives to stay on the good side of Google, by appearing as natural as possible. The application has a 14-day trial and a 30-day money back guarantee.
- Requires HTTP connections exclusively.
- Prefers private proxies to avoid common issues with public proxy servers.
Tweet Attacks Pro 4, the current version of Tweet Attacks, is a piece of software made to manage as many as several thousand Twitter accounts at any given time. It allows automatic follows, unfollows, return follows, tweets, retweets, replies, likes, deletes, and really any other action you could want to take through Twitter. IT also allows individual customization of those Twitter accounts, to eliminate the “egg” problem when running networks of simulated accounts. Costs vary depending on the tier of program you prefer.
- Requires HTTP connections exclusively, due to Twitter’s authentication requirements.
- Supports both private and public proxies, though private proxies are preferred to avoid detection.
- Recommended that you use numerous proxies to manage your accounts, though you don’t need to go so far as to have a dedicated proxy for every account.
This is a general category for any number of different Ticketmaster ticket-buying bots. There are a wide range of them, including one named TicketMaster, the TicketMaster Spinner, and TicketBots. All of these have their requirements in common, because they access the same site with the same goal; buying numerous tickets to shows to then re-sell the tickets for a profit. This sort of ticket scalping is not illegal unless done physically on the premises of the venue. Some states may have stricter laws regarding ticket resales, however.
- Requires HTTP connections to the Ticketmaster website for authentication and appearance purposes.
- Residential IP addresses preferred, as Ticketmaster is prone to revoking sales made to data center IPs and other non-native IPs that signal a bot.
Twitter Account Creation
For using bots like the Twitter manager above, you need to mass-create Twitter accounts. There are a number of different bots that allow this, such as Twitter Mass Account Maker or Twitter Account Creator Bot. Like Ticketmaster bots, these all have similar requirements.
- Requires HTTP connection for authenticity and login authentication to Twitter’s servers.
- Prefers residential IP addresses, usually private rather than public, though the occasional data center IP is not unexpected due to Twitter’s agency and corporate usage.
Facebook Account Creation
This is the same in many ways as the Twitter bots listed above.
Some common Facebook account bots include Facebook Account Creator and FBDevil.
- Requires HTTP connection for authenticity and login authentication to Facebook’s servers.
- Prefers residential IP addresses, and typically prefers private over public addresses.
Email Account Creation
Email accounts are capable of being created in bulk much the same way as social profiles, though there are as many different bots as there are email providers. Every provider is diﬀerent and every bot is diﬀerent, so make sure the requirements are met before you buy or use a proxy list.
Typically the requirements will be the same as the social requirements above: HTTP connections and residential IPs. Some email systems are okay with other connections or with data center IPs, though.