Overstand Logo

Validating Bot and Crawler IP Addresses Using Reverse DNS Lookup

Frank Evans

In the world of web traffic, not all visitors are human. A significant portion comes from bots and crawlers - some legitimate (like search engines indexing your content) and others potentially malicious like scrapers. Being able to distinguish between these visitors is crucial for analytics accuracy, resource optimization and protecting your content.

Bots identify themselves in HTTP requests through the User-Agent header, however this header can be easily spoofed. A malicious crawler may set their User-Agent to the same as used by Googlebot, to try and bypass any restrictions and go unnoticed.

One effective method for validating whether an IP address truly belongs to the bot it claims to be is through reverse DNS lookup. This technique helps verify if a crawler is legitimate or an impostor and it cannot be spoofed.

Why Reverse DNS Matters for Bot Validation

By performing a reverse DNS lookup on an IP address to get a hostname, then doing a forward DNS lookup on the hostname to get the IP, you can verify if the IP truly belongs to the organization it claims to represent. Unlike user-agents, this cannot be faked.

Understanding Reverse DNS Lookup

In standard DNS (Domain Name System), a domain name is translated into an IP address. Reverse DNS does the opposite - it takes an IP address and attempts to find the associated domain name.

For example, if a visitor claims to be Googlebot with IP address 66.249.66.1, a reverse DNS lookup would return something like "crawl-66-249-66-1.googlebot.com".

Hostname Patterns for Major Crawlers

Different search engines and social media platforms use distinct patterns for their crawler hostnames:

  • Google: Googlebot IPs typically resolve to hostnames ending with .googlebot.com or .google.com
  • Bing: Microsoft's crawler IPs resolve to .search.msn.com
  • Baidu: The Chinese search giant uses .baidu.com or .baidu.jp
  • Facebook: Social media crawlers might use .fbsv.net or .facebook.com
  • Yandex: The Russian search engine uses .yandex.net or .yandex.com

Limitations and Considerations

While reverse DNS lookup is a valuable technique, it has some limitations:

  1. Performance impact: DNS lookups add latency to request processing. Consider caching results or performing lookups asynchronously.
  2. Incomplete coverage: Not all legitimate bots have reverse DNS entries set up correctly.
  3. IPv6 complexity: The verification process can be more complex with IPv6 addresses.
  4. Pattern changes: Search engines may change their hostname patterns over time so you need to keep your validation logic updated.

Using our API

You can test individual IP addresses quickly using our Reverse DNS Lookup tool, or use the API to test a larger number of IPs programmatically.

If you're not comfortable writing a script, you can integrate the API with a list of IPs in a Google Sheet, which is explained in our API documentation. Essentially you can create a sheet with a list of IP addresses in a column, then add a formula in the next column to pull in the results from our API.

=IMPORTDATA(CONCATENATE("https://reversedns.overstand.dev/api/v2/?attribute=reverse_dns_valid&ip=", A2))

Conclusion

Reverse DNS lookup provides a 100% reliable method to verify the authenticity of some crawlers visiting your site. By implementing this validation process, you can ensure that bots claiming to be from search engines actually are, helping to enhance your ability to manage bot traffic and maintain the integrity of your web presence.