On 01 August, 2007, I noticed a significant drop in traffic to many of my sites, including my main website. Now, most webmasters will be familiar with the occasional blip when things go wrong, so you learn to sit tight and monitor the situation. Sometimes, these things can last anywhere from a day to 2 weeks, and the last thing you DON’T want to do is react in panic and make changes that aggravate the situation. However, after 3 weeks and no sign of recovery, I started to worry. I decided to do some checking on my own. Along the way, I learnt many lessons.

The symptoms

* Google Sitemaps reports “Errors” that alternated between :

  • Network unreachable: Network unreachable
    We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.
  • Network unreachable: robots. txt unreachable
    We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.

* Checking the Diagnostics tab, I started to see many pages listed as unreachable.

 Robots. txt unreachable

* Analysis of my cached in Diagnostics revealed Google’s last cached date for my robots. txt file was was back in July 13, 2007. This would suggest that Googlebot had problems accessing my robots. txt file (and my site) after that.

* A look into my server logs showed Googlebot’s last single visit for the day was on August 10. That seemed strange when in the past, Googlebot would visit me at least once every hour on average.

* The graph of Google’s crawl rate showed Google stopped crawling my site in late July, and the second graph showed erratic download times co-inciding with Googlebot’s absence.

googlebot_crawl_chart

googlebot_downloadtime_chart

I believe that as a result of Googlebot being unable to access my pages, my site dropped out of the Search Engine Ranked Pages (serps), so traffic that used to come from being ranked on page one for popular terms stopped. This caused an unnerving drop in revenue.

Checking Google
As usual, my first stop when these sort of things happen is the online forums. I needed to check if this was a widespread issue that webmasters were facing. I did find a number of threads on the same problem, but not enough to verify that this was a serious bug on Google’s part. However, the small number of forum threads about a particular problem doesn’t necessarily mean that it isn’t a Google bug. After all, Google IS a machine, made up of hundreds of thousands of servers and many, many algorithms. And if your site is not a heavy-weight traffic generator, it’s unlikely that a Google bug affecting your site is going to be top-priority for the guys at Mountain View.

Checking Myself and My Sites
Once I had more or less determined that this wasn’t a widespread Google bug, I needed to make sure Googlebot didn’t stop visiting because of my own doing.

* Did Googlebot stop visiting due to a penalty?
The only thing I could think of that would possibly cause a penalty by Google was reciprocal links pages. I had stopped reciprocal linking for a long time now BUT maintained those pages out of courtesy to the sites that still kept their links to me. However, recent updates in Google’s Webmaster Guidelines suggests that reciprocal linking will hurt your site :

Don’t participate in link schemes designed to increase your site’s ranking or PageRank. In particular, avoid links to web spammers or “bad neighborhoods” on the web, as your own ranking may be affected adversely by those links.

I wasn’t sure if my reciprocal links pages were part of the problem, but in the interest of self-preservation, I decided to remove all reciprocal link pages from all my sites.

* Was my robots. txt file okay?
First, the Google Webmaster Tools indicated to me that my robots. txt file was okay although its cached version was about 30 days old. The long lapse between the current date and the last cached date of the robots. txt file SHOULD have indicated to me that Googlebot had problems accessing the file. Robots. txt validators that I used below showed there was nothing wrong with my file :

http://www.invision-graphics.com/robotstxt_validator.html

http://validator.czweb.org/robots-txt.php

Conclusion : My robots. txt file seemed okay.

* Was my sitemap file okay?
Google’s sitemaps can seem like an unruly monster to the uninitiated. Many webmasters have reported that the Google sitemap reports have been known to be buggy at times and report errors when there were none. In any case, I used the following tools to check the validity of my sitemaps and its schemas :

http://www.validome.org/google/

http://www.smart-it-consulting.com/internet/google/submit-validate-sitemap/

http://www.xml-sitemaps.com/validate-xml-sitemap.html

https://www.google.com/webmasters/tools/docs/en/protocol.html

http://www.sitemaps.org/protocol.php

Conclusion : My sitemap.xml file seemed okay and passed validation by the tools above.

* Was my .htaccess file okay?
My investigations revealed that incorrect coding in the .htaccess file could cause unnecessary looping (meaning I could have been sending Googlebot in circles and it threw up a red flag), so this had to be checked out. There were no known validators that I could use to verify that my .htaccess file was okay so I took my problem to the forums :

Since I had no experience in .htaccess coding and debugging, I had to rely on the more experienced webmasters who contribute to the forum listed above. Thankfully, all who looked at the content of my .htaccess file cleared it, saying there wasn’t any problem with my .htaccess file.

Conclusion : My .htaccess file didn’t seem to contain coding errors

* Were my pages blocking Googlebot?
It was unlikely that my pages were blocking Googlebot since I did not make major changes and Googlebot was visiting them up till the last visit on July 13 1007. However, I needed to make sure I did not inadvertently place any tag like the following in my pages :

  • <META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
  • <META NAME=”GOOGLEBOT” CONTENT=”NOINDEX, NOFOLLOW”>

Update : Another way you may be blocking Googlebot is if you load your pages with scripts that leave Googlebot “stranded”.

Conclusion : All my pages DID NOT contain any tags that would have blocked Googlebot

Checking My Web Host Provider
The nature of the error in Google Sitemaps strongly suggested that Google was having problems accessing my site where other bots and search engines didn’t. It was a really perplexing situation. My web host initially replied that they had not blocked Googlebot, so it was left to me to find possible loopholes. I turned again to the boards. The difficult part initially was trying to find out the right question to ask. So I started by describing my problem. Then one by one, contributors made suggestions and I followed through on every one of those suggestions.

A contributor commented that using her header check tool, she found all my URLs were returning a 403 (forbidden) status. I have not placed a live link to her tool because she has since taken it offline for upgrading and maintenance. In any case, to verify the results her tool was giving me, I used another header check tool. Indeed using this tool, most of my URLs returned a “Operation Timed Out” error.

I then wondered if my site was the only one that was experiencing 403 Forbidden and time out error. I copied the URLs of all the clients listed on their “prominent clients” page and checked them with the 2 tools. Indeed, I found many of those sites returned the same 403 - Forbidden status on the first tool and timed out on the second tool. I reported this finding to my web host.

The first sign that I was probably on to something was when there was a reply from them stating they needed time to “conduct tests”. I signed up with a trial account with WatchMouse.com, a website monitoring service to see if any more timeout errors were popping up and indeed they were.

watch mouse website monitoring service

Again I reported this to my web host. Their reply stated that timeouts could be caused by many other reasons other than the servers, which I accepted. However, I noted that they had requested their Security Team to investigate the matter, which was another indication that we were on the right track.

Hurray!
Checking my stats on August 30th, I was surprised to see that the server had registered over a hundred visits from Googlebot. I immediately contacted my web host and asked if they had made any changes and they confirmed that they did. So here’s what caused the inadvertent blocking of Googlebot according to them :

Our firewall has an automated mechanism which will block IP addresses deemed to be making too many concurrent connections to our server in a short time. Our security department has whitelisted the google network range that is noticed to make these connections. On top of that we have made the firewall less stringent in the sense we will allow a higher threshold of concurrent connections compared to previously. Based on your feedback, the configuration is just right.

It is not the server that has the problem but the datacenter network that is not reachable from certain locations. We have not change any settings at the time. However, it is possible that there are more users who use Google Sitemap, causing increased concurrent connections to the server. For the current issue, it appears that our firewall’s stringent policy has temporary block the bot.

Lessons Learned
In hindsight, it occured to me that modification to my .htaccess file could have caused an increase in the concurrent connections to the server. I had modified my .htaccess file to solve canonicalization problems by redirecting :

  • redundant URLs to new URLs
  • non-www URLs to www URLs
  • index.php to root

I theorize that since these redirections involved hundreds of URLs it’s possible that when I deployed the changes in my .htaccess file in mid July, it triggered the “increase” in concurrent connections as the bots were redirected to the correct pages. In other words Googlebot attempted to make 2 connections for every page - once to the old/non-www URL and then to the new/www URL. As the concurrent connections increased, it triggered the automated mechanism that blocked Googlebot’s IP address. This in turn caused more time-out errors. The spikes in the Googlebot Download Time chart (above) indicates long download times which eventually ended in timeouts. Unfortunately, this affected one of the most important files - the robots. txt file - which every bot needs to before it accesses a site’s pages. These time-outs also made my sitemap inaccessible, so since Googlebot could not access these 2 important pages, it could not confirm my site still existed!

However, my web host’s analysis of the situation brought them to the conclusion that more of their clients (like myself) had begun to use Google Sitemaps and this is the reason for the increase in concurrent connections. Whatever the reason, at last check, Googlebot has resumed its crawl of my site and I must say it is a welcome sight.

Googlebot visits

I’ve learnt a lot as I struggled with this problem and I hope this post helps some of you who may be wondering why Googlebot has stopped visiting your site or you are experiencing the “Network unreachable: Robots. txt unreachable” error.

Popularity: 89% [?]

Share this blog post with a friend:

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Technorati
  • Netscape
  • Reddit
  • YahooMyWeb
  • StumbleUpon
  • Linkter
  • SphereIt