Overstand Logo

How to Verify Search Engine Bots Using IP Ranges

Frank Evans

If you run a website, it’s important to know which requests are real search engine crawlers (like Googlebot) and which are imposters pretending to be them. Accurately identifying bots helps to understand the crawling patterns for real search engines and bots you care about, and not be mislead by fake bots.

One traditional method for verifying bots is called a reverse DNS lookup, but it has some downsides. A more reliable approach is to check the bot’s IP address against official lists provided by search engines.

Why Reverse DNS Lookups Aren’t Enough

In the past, website administrators relied on reverse DNS lookups to verify bot identities. The process involved:

  1. Checking the IP address of the bot.
  2. Running a reverse DNS lookup to get the associated hostname.
  3. Confirming if the hostname matches official search engine patterns.
  4. Running a forwards DNS lookup on the hostname to check the IP address matches the original IP you started with

While this works in some cases, it has a few problems:

  • Some legitimate bots don’t have properly configured DNS records so reverse DNS lookup are not possible
  • Reverse DNS lookups can be slow to run on scale because they require multiple requests per IP address.

A More Reliable Approach: IP Range Validation

To address these issues, major tech companies now provide official lists of their bots' IP address ranges. By checking incoming bot traffic's IP addresses against these lists, you can quickly determine if a request is from a legitimate crawler.

How IP Validation Works

Many search engines and services publish their IP ranges in JSON or other formats. Some examples include:

When a bot visits your site, its IP address can be checked against the IP ranges in these official lists. Unlike a user-agent, an IP address cannot be spoofed, so if there’s a match you can be sure it’s a legitimate bot.

Benefits of Using IP Ranges

The main benefit of using IP ranges is the speed at which you can check IPs compared to reverse DNS lookups which requires multiple network requests.

Disadvantages of Using IP Ranges

The provided ranges can cover a large set of IP addresses, which makes it difficult to impossible to manually check a specific IP address, particularly with IPV6 which is near impossible to interpret. It usually requires some dedicated functions to be able to check if an IP exists within a given range.

The IP ranges are not provided using a consistent method which makes it hard to aggregate the IP ranges for different crawlers. The most common method is to provide a JSON file with a list of IP ranges, but some bots like Common Crawl just include the IP ranges in the middle of a documentation page. Facebook relies on a completely different system called Autonomous System Numbers (ASN), where you can use their ASN number to lookup the IP ranges using a whois request.

Because of the different approaches, you need to use multiple methods to pull together a list of all the IP ranges, and sometimes this involves manually copying from a web page making it difficult to fully automate.

Then the provided ranges can change at any point in time and there is no way to know without checking repeatedly. This means you need to check the sources frequently to pick up any changes and incorporate them into the combined set of IP ranges.

Our API

We created a bot IP validation API which does all the hard work of maintaining an up-to-date list of IP ranges and quickly checking a large number of IP addresses.

If you're not technical enough to write a script to use the API, you can pull the results into a Google sheet with a simple formula. You can create a sheet with a list of IP addresses in a column, then add a formula in the next column to pull in the results from our API.

=IMPORTDATA(CONCATENATE("https://bot-ip-list-validation.overstand.dev/api/v1/?attribute=name&ip=", A2))

Conclusion

Validating bots using official IP ranges is a fast, reliable, and secure way to distinguish between real search engine crawlers and fake ones. Unlike traditional DNS lookups, it is much more efficient. But the process of collecting the IP ranges using different methods and maintaining them adds a lot of additional complexity.

Using our IP List Validation API can make bot validation a much simpler process.

Tags

bot validation