Wiki ⇒ Tips and Tricks ⇒ Bots and crawlers ⇒ CPG Dragonfly™ CMS
WikiTips and Tricks ⇒ Bots and crawlers

10. 1: Bots and crawlers Parent


As you all noticed, thanks to LEO more and more bots are overloading your and our website. Most of them are good and some are very evil. Now you want to know which bots can be trusted and which are evil. To get this all listed we've started this page and you may edit/add more details about bots and crawlers.

NOTE: To disallow a bot user the specified User-agent from the bots below and ad the following line below the User-agent line.

Disallow: /



Google

Google has several flavours of bots, they are friendly and is maybe the most used searchengine ever.
  • Googlebot:
    the main searchengine that respects your robots.txt and visits often
  • Googlebot-Image:
    used to collect all images from your site for the google images search
  • Mediapartners-Google:
    thisone is used by the Google Adsense system, every page that shows google ads is also visited by this bot at the same time. This is done on purpose to show the visitor advertisements that comply with the content of the page, so everywhere you go adsense follows you. If you use adsense be aware that the bandwidth of your website is doubled and the site statistics is twice as high then without adsense.
To ban the bots in robots.txt use:
  • User-agent: Googlebot
  • User-agent: Googlebot-Image
  • User-agent: Mediapartners-Google

ia_archiver

Alexa's web crawler

User-agent: ia_archiver

MSIECrawler

This is not a true web robot. It is usually to be found in site logs if someone bookmarks a page whilst offline. Internet Explorer will then download the page and all links related to it, including links, images, JavaScript and Style sheets, when the user is next online.

User-agent: MSIECrawler

msnbot

  • User-agent: searchpreview
  • User-agent: msnbot
This bot also supports

Crawl-delay: 120

Scooter

AltaVista's very friendly crawler, it will never exceed 1% of your server resources due to a nice algorythm that calculates how long it takes to fetch a page and multiplies that time with 100 before it fetches the next page.
To ban the bots in robots.txt use:

User-agent: scooter

Teoma

Ask Jeeves webcrawler

User-agent: Teoma

Yahoo! Slurp

User-agent: Slurp


More to come like:
IBP, ccubee, FAST MetaWeb Crawler, NutchCVS, Findexa Crawler, Vagabondo, W3C-checklink, Wget, Openbot, noxtrumbot, Minuteman, btbot, Java, 1Noonbot, genieknows, YahooFeedSeeker, VoilaBot, w3search, RPT-HTTPClient, MJ12bot, BDFetch, aipbot, Filangy, Baiduspider, appie, Bilbo, Yahoo-MMCrawler, Pogodak, etc.
And ofcourse the BAD bots like:
LinkWalker, WebReaper, Schmozilla, OmniExplorer, Picture Finder, etc.

Created: Saturday, June 25, 2005 (15:50:30) by DJMaze
Updated: Saturday, August 26, 2006 (05:07:46) by alva