Search Engine Spider Detection With PHP

Part of any search engine optimisation strategy should always be that the user and the search engine see the same thing. If you start delivering different content you will either end up not performing or just getting outright banned. However, there are certain circumstances where you will want to detect the presence of a search engine spider. For example, let's say that you had a link to a section of your site, and you wanted to add a counter to it that registered an action every time a user clicked on the link. One way of doing this would be to add a parameter to the URL of the link, if the parameter is present then it is a user going through the site and so the action will be registered. You don't want to register every time a search engine bot spiders the site so using the following function will allow you to turn off this parameter for these spiders.

function spiderDetect() {
 $agentArray = array("ArchitextSpider", "Googlebot", "TeomaAgent",
  "Zyborg", "Gulliver", "Architext spider", "FAST-WebCrawler",
  "Slurp", "Ask Jeeves", "ia_archiver", "Scooter", "Mercator",
  "crawler@fast", "Crawler", "InfoSeek Sidewinder",
  "almaden.ibm.com", "appie 1.1", "augurfind", "baiduspider",
  "bannana_bot", "bdcindexer", "docomo", "frooglebot", "geobot",
  "henrythemiragorobot", "sidewinder", "lachesis", "moget/1.0",
  "nationaldirectory-webspider", "naverrobot", "ncsa beta",
  "netresearchserver", "ng/1.0", "osis-project", "polybot",
  "pompos", "seventwentyfour", "steeler/1.3", "szukacz",
  "teoma", "turnitinbot", "vagabondo", "zao/0", "zyborg/1.0",
  "Lycos_Spider_(T-Rex)", "Lycos_Spider_Beta2(T-Rex)",
  "Fluffy the Spider", "Ultraseek", "MantraAgent","Moget",
  "T-H-U-N-D-E-R-S-T-O-N-E", "MuscatFerret", "VoilaBot",
  "Sleek Spider", "KIT_Fireball", "WISEnut", "WebCrawler",
  "asterias2.0", "suchtop-bot", "YahooSeeker", "ai_archiver",
  "Jetbot"
 );
 
 $theAgent = $_SERVER["HTTP_USER_AGENT"];
 
 foreach ($agentArray as $anAgent) {
  if (stripos($theAgent, $anAgent) !== false) {
   return true;
  };
 };
 return false;
}

The function works by finding the current user agent string of the visitor and the comparing it to the list of user agents in an array. If the user agent is found then true is returned, otherwise the return value is false.

With this function present you can now include an if statement to see if the user agent is a search engine spider or not.

if ( spiderDetect() ) {
 // do something for spiders
}else{
 // do something for users
}

Please be careful with this function. If you server different content to users and search engine spiders you will more than likely get banned for your efforts.

Also, this might be an incomplete list of search engine spider user agents, if you know any more then please write a comment and I will add them onto this list.

Update: Reworked part of the finding code to make it more efficient.

Comments

Hi, in the loop you use: $agentArray[$f] but I guess you meant: $agentArray[$i] Great function, thanks for publishing!
Permalink
Nice spot! I've corrected that now :)
Name
Philip Norton
Permalink

Thanks for your marvelous posting! I truly enjoyed reading it, you could be a great author. I will remember to bookmark your blog and will often come back at some point. I want to encourage one to continue your great job, have a nice afternoon!

Permalink

Thanks for your marvelous posting! I truly enjoyed reading it, you could be a great author. I will remember to bookmark your blog and will often come back at some point. I want to encourage one to continue your great job, have a nice afternoon!

Permalink

Add new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
4 + 0 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.