Listing of web crawlers that do not support compression

If you are the author of any of these spiders, then please add support for content compression when you crawl the web. This will save you bandwidth on your crawling system, and it saves bandwidth on the servers that you crawl.

Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the addition of a single line of code will enable compression support.

$ua->default_header('Accept-Encoding' => 'gzip');
and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.

For other languages, all you need to do is to add

Accept-encoding: gzip
to the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.

Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.

I typically send an email of the form:

Hi,

I noticed that your web spider does not support content compression. This is a pity as it causes increased network bandwidth usage on my web server for no good reason. Of course, it also increases your bandwidth consumption, but that isn't my problem!

Adding support for content compression can be very easy depending on the implementation language of your spider. See the page http://www.gladstonefamily.net/cgi-bin/shame.pl for a list of the current spiders that do not support content compression, and more information on how to fix it.

Thanks, Philip

Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.

CrawlerLast IP used
envolk/1.7 (+http://www.envolk.com/envolkspiderinfo.html)98.173.26.142
Gigabot/3.0 (http://www.gigablast.com/spider.html)66.231.188.22
GurujiBot/1.0 (+http://www.guruji.com/en/WebmasterFAQ.html)72.20.109.37
ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)72.44.39.21
ia_archiver-web.archive.org207.241.229.141
Java/1.6.0_0790.35.212.27
Mozilla/4.0208.115.138.254
Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)80.218.126.90
Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)67.19.79.218
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; obot)194.153.113.20
Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)58.239.46.142
Mozilla/4.0 (compatible; NaverBot/1.0; http://help.naver.com/delete_main.asp)114.111.36.26
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14192.100.116.143
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.769.58.178.30
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.4) Gecko/20061201 Firefox/2.0.0.4 (Ubuntu-feisty)75.70.76.95
msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)65.55.212.88
psbot/0.1 (+http://www.picsearch.com/bot.html)217.212.224.186
Python-urllib/2.5195.46.41.175
Scooter/3.369.147.79.37
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)61.247.222.53

Comments, problems etc to
Philip Gladstone

Last modified Saturday, 07 June 2008