How to prevent ‘GoogleBot’ from repeatedly taking your WordPress site offline

If your website is powered by WordPress and seems to go offline at different times for no reason at all, its possible that an automated system is calling one or more URLs on your WordPress site frequently enough to overwhelm the server.

This can happen as part of an automated process, looking for vulnerabilities on a WordPress based site. It can occur frequently enough to cause your web server to be too busy to handle normal user requests and take up too much CPU or Memory causing your site to go offline.

Rebooting the server can help, but if you find yourself on the receiving end of this, then it won’t be long until it starts again after a reboot and the your site goes down again.

The following assumes you are using a typical LAMP stack to host your WordPress site, consisting of a CentOS operating system, Apache web server, mySQL and PHP.

To find out if this is happening on your server your first step would be to look at your access logs, to see a list of recent visitors to your site. This sounds easy but if your server is currently in the middle of being overwhelmed with requests, then your access to the server will either be slow or impossible to use.

To get around this, the only solution is to reset your server if you have this option from your hosting providers control panel, then to SSH in to the server as fast as you can and temporarily disable Apache from handle incoming requests using:

sudo service httpd stop

This will stop the web sever from handling new requests but there may still be several in memory. To clear them use the following:

sudo pkill apache

At this point your website will be offline, but you should have a stable system to use to diagnose the problem.

The next step is to see if you have an access.log file. This will show you any recent visits to your site. The location of this file will depend on how your site is set up, but typically it will be in the root of your website, for example at /var/www/www.domain.com/access.log. The following Find command may help find it:

find / -name access.log

Once you find the file and change to that folder location, then type the following to list the contents of the file:

tail access.log

Or you can be more specific and type:

cat access.log | grep googlebot

In the output, check to see if “Googlebot” is appearing many times in a short space of time based on the time stamps.

Viewing a 'busy' server on New Relic

Viewing a server that’s under pressure – using New Relic

Its worth mentioning that this is probably not the real Google that is attacking your site. Its some other automated process calling itself GoogleBot, since Googlebot is usually a good thing to see on your site, so you know Google is indexing your site.

If you see a lot of mentions of ‘Googlebot’ in your access log then look for its IP address, usually to the left of the same line.

To verify if the IP address that is contacting your site actually belongs to Google, use the following to do a reverse lookup:

host [IP.ADDRESS.HERE]

In the response text, you should see some mention of Googlebot. There is more information here on what you might see in the response.

Edit: I’ve been told that using ‘Host’ above can be fooled easily enough, so a more thorough way to determine the owner of the IP address is to use ‘whois’ for a more authoritative source of information, using:

whois [IP.ADDRESS.HERE]

If you don’t see GoogleBot in the response, then this is something else, pretending to be Google and we should block it from accessing the site.

To block the IP address using ipTables, use the following:

iptables -A INPUT -s [IP.ADDRESS.HERE] -j DROP

There might be more than 1 IP address to block, so check your access.log again just in case.

If you are worried that you might have blocked a legitimate IP address, you can use the following commands to view blocked IP Addresses:

iptables -L INPUT -v -n

To remove an IP address from your list so that it can access your site again:

iptables -D INPUT -s [IP.ADDRESS.HERE] -j DROP

Of course, its possible for someone or some thing to keep putting pressure on your site from different IP addresses, but there is a cost related to that (unless the person has many Zombie machines acting on his or her behalf) and they’ll run out of IP Addresses eventually :)

Once blocked, make sure to restart or enable the Apache web server again so that normal visitors can access your site.

A calm server again

A calm server again

 

By performing the above steps,  you’ll have blocked any IP addresses identifying themselves as Googlebot but you won’t have blocked GoogleBot itself, so there won’t be any negative side effects to your SEO.

If it turns out that Googlebot itself, or any other bot is overwhelming your site, you could place a direction in your robots.txt file to ask it to return at a later time and not visit your site so often using the following

User-agent: *
Crawl-delay: 30

I haven’t tried it, but I’ve been told that WordFence is a good plugin for WordPress that can help with the above too.

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *