Wednesday, April 2, 2008

Could the New Google Spider Be Causing Issues With Websites?

Around the time of the announcement " Google Big Daddy ", there was a new Googlebot roaming on the Web. Since then, I have heard the stories of clients and servers and websites are always previously unindexed content indexed.
I began to dig into this and you would be surprised at what I found out.
First, we see the chronology of events:
In Late September some astute observers of spider on Webmaster World spotted unique Googlebot activity. In fact, it was in this thread that the bot was first reported. He referred to some posters who thought that perhaps this could be regular users masquerading as the famous bot.
Early it also seems that the new bot was not obey the Robots.txt file. This is the protocol which allows or denies crawling to parts of a website.
Speculation grew on what the new crawler was mentioned to Matt Cutts a new Google Test Data Center. For those who do not know, Matt Cutts, a senior engineer at Google, and one of the few Google employees talking with us " normal people. " This happened in November.
There was not much mention of Big Daddy until early January of this year, when Matt again blogged about asking for feedback.
Much feedback was on the accuracy of the results. There were also those who ask whether the Mozilla Googlebot (also known as " Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html) in " your visitor logs) and Big Daddy were related, but no response was made.
Now I will start some of my own speculation:
I in fact believe that the two are related. In fact, I think this new crawler replace the old crawler such as Big Daddy will replace the current data infrastructure.
Why is this so important?
Based on my observations of these crawlers can do much more than the old crawler.
For one, it emulates a newer browser. The old bot was based on the Lynx text-based browser. Although I am sure that Google features such as time went on, the basic Lynx browser is just that - basic.
Which explains why Google could not with things like JavaScript, CSS and Flash.
However, with the new spider, based on the Mozilla engine, there are so many possibilities.
Just look what your Mozilla or Firefox browser, you can - CSS, read and execute, JavaScript, and other scripting languages, even emulate other browsers.
But that is not all.
I & 39; ve talked to a few of my clients and their sites are always hammered by this new spider. It has gotten so bad that some of their servers because of the traffic from this spider!
On the positive side, I have clients who are from a few hundred thousand indexed pages to over 10 million in just a few weeks! Literally since December 2005, there has been a 3,500% increase in indexed pages over an 8-week period! Just so you know, this is also the client side, was due to the huge volume of crawling happening.
But that is not all.
I have another client, the IP recognition to serve content based on a person, its geographic location. If you live in the USA, you will get American content and pricing; if you live in the United Kingdom you receive the content and pricing. As you can imagine, the United Kingdom, American, Canadian and Australian content is all very similar. In fact, the only thing at significantly different aspect is the pricing.
This is my concern - if the duplicate content indexed by Google, what will they do? There is a good chance that the site would be penalized or even banned, for the violation of the webmaster quality guidelines set forth by Google.
This is the reason why we have implemented IP recognition - so that Googlebot, which the US crawls IP addresses only sees one version of the site.
However a review of the server logs shows that this new Googlebot visited not only the United States but also the contents of the other areas of the site. Naturally, I wanted to verify that the IP recognition was working. It is. This then leads me to ask, can this browser spoof its location and / or a proxy server?
Imagine - the browser is smart enough to some of his own examination by the site of several IP addresses. If this is the case, then those who cloak sites are problems.
In any case, from the limited observations I have made, this new Google - both the data center and the spider - will change the way we do things.
If you have experienced something similar in the past few months to do with Google, be sure to add it to our comments below. lillie aracelis



Bookmark it: del.icio.usdigg.comreddit.comnetvouz.comgoogle.comyahoo.comtechnorati.comfurl.netbloglines.comsocialdust.comma.gnolia.comnewsvine.comslashdot.orgsimpy.com

No comments: