WikiTips and Tricks ⇒ Bots and crawlers  

Glossary
The Project
Install
Dragonfly admin interface
Dragonfly public view
Dragonfly Themes
Build local server
Running Dragonfly CVS
Tips and Tricks
Rules & Regulations
v9 Developer's Manual
v10 Developer's manual
10.1: Bots and crawlers Parent

As you all noticed, thanks to LEO more and more bots are overloading your and our website. Most of them are good and some are very evil. Now you want to know which bots can be trusted and which are evil. To get this all listed we've started this page and you may edit/add more details about bots and crawlers.

NOTE: To disallow a bot user the specified User-agent from the bots below and ad the following line below the User-agent line.

Disallow: /



Google

Google has several flavours of bots, they are friendly and is maybe the most used searchengine ever.
  • Googlebot:
    the main searchengine that respects your robots.txt and visits often
  • Googlebot-Image:
    used to collect all images from your site for the google images search
  • Mediapartners-Google:
    thisone is used by the Google Adsense system, every page that shows google ads is also visited by this bot at the same time. This is done on purpose to show the visitor advertisements that comply with the content of the page, so everywhere you go adsense follows you. If you use adsense be aware that the bandwidth of your website is doubled and the site statistics is twice as high then without adsense.
To ban the bots in robots.txt use:
  • User-agent: Googlebot
  • User-agent: Googlebot-Image
  • User-agent: Mediapartners-Google

ia_archiver

Alexa's web crawler

User-agent: ia_archiver

MSIECrawler

This is not a true web robot. It is usually to be found in site logs if someone bookmarks a page whilst offline. Internet Explorer will then download the page and all links related to it, including links, images, JavaScript and Style sheets, when the user is next online.

User-agent: MSIECrawler

msnbot

  • User-agent: searchpreview
  • User-agent: msnbot
This bot also supports

Crawl-delay: 120

Scooter

AltaVista's very friendly crawler, it will never exceed 1% of your server resources due to a nice algorythm that calculates how long it takes to fetch a page and multiplies that time with 100 before it fetches the next page.
To ban the bots in robots.txt use:

User-agent: scooter

Teoma

Ask Jeeves webcrawler

User-agent: Teoma

Yahoo! Slurp

User-agent: Slurp


More to come like:
IBP, ccubee, FAST MetaWeb Crawler, NutchCVS, Findexa Crawler, Vagabondo, W3C-checklink, Wget, Openbot, noxtrumbot, Minuteman, btbot, Java, 1Noonbot, genieknows, YahooFeedSeeker, VoilaBot, w3search, RPT-HTTPClient, MJ12bot, BDFetch, aipbot, Filangy, Baiduspider, appie, Bilbo, Yahoo-MMCrawler, Pogodak, etc.
And ofcourse the BAD bots like:
LinkWalker, WebReaper, Schmozilla, OmniExplorer, Picture Finder, etc.

 
Updated: Saturday, August 26, 2006 (05:07:46) by alva
Created:  Saturday, June 25, 2005 (15:50:30) by DJMaze

stopsoftwarepatents.eu petition banner

You are seeing squares or questionmarks on this page?

All content of this website is copyrighted by the Creative Commons NC-SA
The logos and trademarks used on this site are the property of their respective owners
We are not responsible for comments posted by our users, as they are the property of the poster.
Our server runs on a P3 1.2GHz with 512MB RAM with no accelerators
Support GoPHP5.org
This page generated in 0.5561 seconds with 15 DB Queries in 0.3018 seconds
Memory Usage: 1.48 MB
Interactive software released under GNU GPL, Code Credits, Privacy Policy