Log Jam

Ever since I started my first web page back in 1995, I have been fascinated with the server logfiles. Laughing Squid provides access to my logs through the Plesk interface, so I get to scan them occasionally. It’s good for a laugh.
Huh? She laughs over server logs? What a geekess!
Well, duh! They’re, like, sooooo awesome!
Yahoo!’s crawler doesn’t sip, it slurps:

68.142.251.21 – – [23/Nov/2005:04:35:30 -0800] “GET /robots.txt HTTP/1.0” 200 440 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)”

But what does that jumble mean? It’s easy when you break it down and look at each part.
68.142.251.21 is the Internet Protocol (IP) address of the bot. This address traces the access to the bot’s Internet Service Provider (ISP), and sometimes even to the exact computer. One way to get more information is to use a traceroute program. If you don’t have one, try the online form at http://www.nwtools.com/. Cut and paste the IP address into the form, select “Lookup”, and there’s the host name, lj2411.inktomisearch.com. Try again, only select XWhois to find out who owns lj2411.
[23/Nov/2005:04:35:30 -0800] is the time and date of the access. Laughing Squid is in San Francisco, so the time is given in Pacific time or GMT-0800.
“GET /robots.txt HTTP/1.0” is the command string the bot sent to my server and the protocol it is using.
200 440 is a “200” status code that the server sent to Yahoo! Slurp to tell it that the GET command was successful and the file size was 440 bytes.
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) is the identifying info sent by the bot. In this case, Yahoo! Slurp is crawling using a browser that looks to the server like a Mozilla-compatible program. Yahoo! Slurp is polite enough to give me a URL I can go to if I have questions.
msnbot seems to be crawling the site almost every day. I should take a half hour and calculate what percentage of my traffic is msnbot. In the meantime, I’ve set a time delay between requests, plus disallowed msnbot from large sections of the site for now. MSN has a separate bot that crawls the net looking for images, and since I’m not an artiste I’ll block that one altogether.
Googlebot isn’t so bad, in part because Google has webmasters post sitemaps that tell the bot exactly what’s on the site, and how often to check back. The details are on Google.
I also like to see who reads robots.txt before crawling. Polite bots read and obey the robots.txt file. There’s not much you can do at the user level if a bot ignores robots.txt.

Comments are closed.

Bad Behavior has blocked 4617 access attempts in the last 7 days.