Listing of web crawlers that do not support compression

If you are the author of any of these spiders, then please add support for content compression when you crawl the web. This will save you bandwidth on your crawling system, and it saves bandwidth on the servers that you crawl.

Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the addition of a single line of code will enable compression support.

$ua->default_header('Accept-Encoding' => 'gzip');
and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.

For other languages, all you need to do is to add

Accept-encoding: gzip
to the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.

Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.

I typically send an email of the form:

Hi,

I noticed that your web spider does not support content compression. This is a pity as it causes increased network bandwidth usage on my web server for no good reason. Of course, it also increases your bandwidth consumption, but that isn't my problem!

Adding support for content compression can be very easy depending on the implementation language of your spider. See the page http://www.gladstonefamily.net/cgi-bin/shame.pl for a list of the current spiders that do not support content compression, and more information on how to fix it.

Thanks, Philip

Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.

CrawlerLast IP used
envolk/1.7 (+http://www.envolk.com/envolkspiderinfo.html)98.173.26.143
Gigabot/3.0 (http://www.gigablast.com/spider.html)67.16.94.2
ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)67.202.55.193
ia_archiver-web.archive.org207.241.227.92
Jakarta Commons-HttpClient/3.1209.133.125.59
libwww-perl/5.80572.18.151.7
Mail.Ru/1.094.100.181.108
Mozilla128.114.56.139
Mozilla/4.0208.115.138.254
Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)64.56.68.210
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2; .NET CLR 3.0.04506.648)147.174.1.24
Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)76.212.217.18
Mozilla/4.0 (compatible; NaverBot/1.0; http://help.naver.com/customer_webtxt_02.jsp)114.111.36.26
Mozilla/4.5 [en] (Win98; I)196.205.136.227
MOZILLA/5.0211.203.180.144
Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)208.96.54.76
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14204.137.64.112
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.770.161.40.105
Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.494.38.84.125
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.769.58.178.30
Mozilla/5.0 (Yahoo-MMCrawler/4.0; mailto:vertical-crawl-support@yahoo-inc.com)76.13.20.59
Ocelli/1.4 (http://www.globalspec.com/Ocelli)64.128.171.4
SapphireWebCrawler/1.0 (Sapphire Web Crawler using Nutch; http://boston.lti.cs.cmu.edu/crawler/; mhoy@cs.cmu.edu)64.88.164.198
Scooter/3.369.147.79.37
Wget/1.10.2212.49.89.136
Wget/1.10.2 (Red Hat modified)189.27.8.19
Yandex/1.01.001 (compatible; Win16; P)77.88.22.251
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)61.247.222.54

Comments, problems etc to
Philip Gladstone

Last modified Saturday, 07 June 2008