Using Proxies to Scrape Whois Domain Data

Using Proxies to Scrape Whois Domain Data

Proxies are used for an awful lot of things. Illegal, legal, grey areas — you name it, proxies will be there. They’re a constant middle man to the portals of the web and a bouncer at your front door. I enjoy thinking of them in this animated way; it gives a relatively benign entity some character and makes it easier to imagine the hard work they do (or are programmed to do).

One of the many uses of proxies is the scraping of data. Proxies are effectively used as in-betweeners and workhorses so that you, the scraper, don’t get in trouble or have to stay up all night, every night in order to get your data. The data in question can take any number of

forms. Testing proxy ports and speed is one type of data (and one that uses proxies to find proxies), while amassing domain names is another type of data.

The data I’ll address today is the Whois domain data of a website.


What is Whois Domain Data

You might need a refresher on what exactly Whois Domain data is; don’t worry, it’s not very complicated. Every single website in the world is registered and paid for by somebody. These websites are more officially known as “domains”, and are called by specific “domain names”. The registration process requires individuals or companies (whoever owns the website) to list contact information for their domain.

Often the contact information is a name, an email address, a phone number, and a physical address. It’s a bit like signing up for a credit card, except there’s no credit check.

The Whois Domain Data is the technical term for all of that information. In your quest for building a website and purchasing a domain, I’m sure you were asked at some point if you’d like to pay for Domain Privacy Protection . This is often offered by your web host or domain host and will make sure the Whois data (name, email address, phone number, and physical address) is not associated with you at all. Purchasing a Privacy Protection plan should cost around $1 per month, and is a good idea if you don’t want your information out in the public.


Finding Whois Domain Data

Having explained how to hide your information, know that the rest of this article will be dedicated to actually finding the people who haven’t put Privacy Protection on their domains. I will explain how to find Whois domain data in a simple way, and then how to do it with proxies.

ICANN

The Internet Corporation for Assigned Names and Numbers (ICAAN) is the official place where all names and numbers are stored when it comes to websites. It is impartial, separate from any government, and holds a very large amount of data and power today.

ICANN is also the first real source for a Whois lookup . Any individual can go on the website, type in a domain, and find the information associated with that domain. If the information is private, it will either list nothing, or will list a generic corporation that is hosting that information. If it’s public, you’ll probably find a name and address of a very real person.

GoDaddy has a version of this Whois lookup as well. While both sites (and many others) are great for looking at a single point of data, they aren’t much help if you want to pull reams of data.


Scraping for Whois Data

Scraping is the term used when pulling large (like, really huge) amounts of data from the internet. You can scrape all manner of data, but in this case you’d like to scrape Whois Data. ICANN won’t help you, nor will GoDaddy.

Before I get into the applications that will allow you to scrape in bulk amounts, it would be helpful to talk about the purpose of such scraping. There are a couple use-cases for scraping WhoIs Domain Data:


1. Cold-calling new domains with business services.

This is probably the most common reason to scrape Whois data. In fact, I just registered a new URL and have yet to protect its public information. I woke up to a dozen emails, all of which stated that my website could grow in leaps and bounds if I used the development services offered by this person, or this corporation. I bet you’ve received some of these. Or, maybe more likely, you are one of the people sending these emails.

When your information is public, it is simply public. People can and will find it, then try to contact you. Some of these cold calls (or icy cold emails) might result in a new business venture, but are often viewed as spam.

One thing to note at this juncture: If somebody replies that they are “opting out” of being contacted in this way, you must comply. The CAN-SPAM Act of 2003 has made it very clear that business emails in which services are offered or conducted can be opted out of by anyone that does not want them. You can read more about the Act for the full details, but know that a single instance (a.k.a. a single email) that violates the Act could cost you $16,000.


2. Finding companies in your niche market.

The last example focused on offering your services to every business on the public domain block. This one is more about finesse. Let’s say you’ve started a small niche blog or company with a specific product. You’re beginning to connect with brands and influences, maybe write for their websites. Doing the research for this can take a lot of time.

A WhoIs Domain scrape, when used correctly, can give you the email addresses or phone numbers of people who own domains related to your business. This is more of a collaborative contact, rather than a sales pitch.


3. Assembling a database of Whois information.

This step is frowned upon, so I won’t go into too much detail. Assembling large lists of “free” information that includes the personal details of domain registrants is technically legal because the information is just out there. However selling those lists is a legal and moral grey area, but people do it.


Proxies and Scraping

The applications below work best with proxies, as does any retrieval of large reams of data.

Proxies work best for two reasons:

  • They provide you anonymity, so any questionable activities can’t be traced back to you.
  • They harness the incredible power and speed of bots.

Those two factors are at the core of why anyone uses proxies for anything, but specifically why they’re great for scraping data. In general, scraping large amounts of data is not looked upon fondly by the internet community. As such, proxies give you privacy and anonymity to act as you will. Be cautious with this, and don’t pursue super illegal activities. You’ve probably heard that before.

The second step is speed, and that’s where these applications come in. There are countless applications out there that increase the ease at which you can scrape data as a whole, and Whois data specifically. Plenty of coders can write individual scripts to do this effectively, and if you want something like that you should visit Black Hat World to find that person.

However, there are a couple larger, more legitimate software applications that have Whois scraping built in.


ScrapeBox

ScrapeBox is software I will mention again and again. It is sort of the ultimate reigning king of the scraping world. This is the case because it’s a one-time purchase fee, there are dozens of free plugins, there are lots of tutorials for how to use ScrapeBox, and the app is coded to work well for your needs.

The Whois scraping part of ScrapeBox is actually a plugin. Once you purchase the software (a necessary step you’re going to have to take in most cases, unless you program the script yourself), open ScrapeBox and look in the AddOns drop-down menu. In there you will find the “ScrapeBox Who is Scraper.” Click Add, wait for it to install, then head back to the AddOns drop-down and you’ll see the Whois Scraper there.

Click that and a new window will pop up — this is portal for finding Whois domain data. You can load domains from the ScrapeBox harvester (which is the main thing ScrapeBox is used for), or you can load them from a file.

It’s important to note here that you will need a list of domains to scrape. There are countless ways to do this that I won’t get into. One of the simplest is using another Add-on in ScrapeBox: the Domain Availability Checker .

This allows you to look up domains based on keywords, which are either taken or not taken, and you can use this to generate a list of already registered domains in your niche whose Whois data you want to scrape. There are other ways to do this as well.

Let’s assume you have a nice long list of domains you want to get the Whois Data for. Paste in the URLs and click “Start.” The application will run and, in a short time (depending on how many domains you entered), you’ll have a list of Whois Data for every single domain.

You can export this information in a number of ways. You can include all the information provided, just the names, just the emails, just the phone numbers, etc., and in doing so assemble a list of contact information from every domain you scraped. While this will work for a lot of domains, many will be protected. There’s no real way to crack the privacy of these, as people pay for services that are meant to keep them private.

ScrapeBox will cost you $97, but you can usually find a discount online.


Atomic Whois Explorer

Atomic Whois Explorer, made by AtomPark Software , is another Whois scraping tool. This one is not rolled into the ScrapeBox suite, so may work better for people who want a standalone scraper or want to pay less. That said, Atomic Whois Explorer has many of the same functions as ScrapeBox. You can run a multithreaded search, filter your results, and download the information in a number of file types.

The Atomic Whois Explorer works with all kinds of domain names too — not just .com or .org. This is useful because new domain types are popping up all the time, and if you really want to get the full scope of Whois data, you’ll need to scrape a lot more than the common ones.

ScrapeBox has this capacity too, and its AddOn actually lets you add your own domain types, if any are missing.

The Atomic Whois Explorer is $49.85, and you can trial the software for free.


Proxy Purpose and Type

You can run a Whois scrape on either of these applications with your regular IP address. You don’t need proxies to do this in small doses, just like you don’t need a proxy to search a single site’s Whois data on ICAAN.

However, there are limits. If you use your single IP address your searches can be tracked back to you. You most likely don’t want that. You can also piss off your ISP quickly, because it’s clear what kind of data you’re bringing in through their services.

Proxies come in handy here (as they always do), to increase capacity and anonymity. Buying batches of proxies and plugging them into ScrapeBox will take the Whois Data scrape from 5 results in 20 seconds to 10,000 results in under a minute. It’s really not comparable; if you’re going for high volume Whois data scraping, you will need proxies.

The amount is up to you, but the fewer you get the more sensitive your settings should be in the applications. Don’t ping too often, make sure to rotate and rest proxies, and open only a few threads. This is true for all proxy use, so apply this methodology to everything you do.

One unique aspect of scraping Whois domain data is that it only works with SOCKS proxies . If you’re not sure if the proxies you have (or are about to buy) are SOCKS, ask the provider. Plugging regular HTTP proxies into either ScrapeBox or Atomic Whois Explorer will not work.


In Summary

You should now know nearly everything you need to scrape Whois data successfully. Think hard about the reason you’re doing it, and try not to be a bulk email spammer. That’s no fun; you don’t want those emails, so dishing them out isn’t very kind.

That said, using proxies to effectively get contacts in your niche is perfectly legitimate, and cuts down on all those late-night hours when you would scour the internet. Assemble your email lists efficiently and get your work done fast.