Glossary
The Project
Install
Dragonfly admin interface
Dragonfly public view
Dragonfly Themes
Build local server
Running Dragonfly CVS
Tips and Tricks
Rules & Regulations
v9 Developer's Manual
v10 Developer's manual
|
10.1: Bots and crawlers  As you all noticed, thanks to LEO more and more bots are overloading your and our website.
Most of them are good and some are very evil. Now you want to know which bots can be trusted and which are evil. To get this all listed we've started this page and you may edit/add more details about bots and crawlers.
NOTE: To disallow a bot user the specified User-agent from the bots below and ad the following line below the User-agent line.
Disallow: /
Google
Google has several flavours of bots, they are friendly and is maybe the most used searchengine ever.
- Googlebot:
the main searchengine that respects your robots.txt and visits often
- Googlebot-Image:
used to collect all images from your site for the google images search
- Mediapartners-Google:
thisone is used by the Google Adsense system, every page that shows google ads is also visited by this bot at the same time. This is done on purpose to show the visitor advertisements that comply with the content of the page, so everywhere you go adsense follows you. If you use adsense be aware that the bandwidth of your website is doubled and the site statistics is twice as high then without adsense.
To ban the bots in robots.txt use:
User-agent: Googlebot
User-agent: Googlebot-Image
User-agent: Mediapartners-Google
ia_archiver
Alexa's web crawler
User-agent: ia_archiver
MSIECrawler
This is not a true web robot. It is usually to be found in site logs if someone bookmarks a page whilst offline. Internet Explorer will then download the page and all links related to it, including links, images, JavaScript and Style sheets, when the user is next online.
User-agent: MSIECrawler
msnbot
User-agent: searchpreview
User-agent: msnbot
This bot also supports
Crawl-delay: 120
Scooter
AltaVista's very friendly crawler, it will never exceed 1% of your server resources due to a nice algorythm that calculates how long it takes to fetch a page and multiplies that time with 100 before it fetches the next page.
To ban the bots in robots.txt use:
User-agent: scooter
Teoma
Ask Jeeves webcrawler
User-agent: Teoma
Yahoo! Slurp
User-agent: Slurp
More to come like:
IBP, ccubee, FAST MetaWeb Crawler, NutchCVS, Findexa Crawler, Vagabondo, W3C-checklink, Wget, Openbot, noxtrumbot, Minuteman, btbot, Java, 1Noonbot, genieknows, YahooFeedSeeker, VoilaBot, w3search, RPT-HTTPClient, MJ12bot, BDFetch, aipbot, Filangy, Baiduspider, appie, Bilbo, Yahoo-MMCrawler, Pogodak, etc.
And ofcourse the BAD bots like:
LinkWalker, WebReaper, Schmozilla, OmniExplorer, Picture Finder, etc.
|