Source Code for Web Robot Spiders

Robots (also known as spiders, wanderers, worms, crawlers, gatherers, intelligent agents) follow links from one web page to another. They work with indexing code to store data for later searching.

Robot Source Code

There is a good deal of free open source code available -- you don't have to start from scratch. Take a look at some of the options below, in the programming language best suited for your needs. If you'd like to contract your robot out, see the Robots Consultants page.
Useful Links

* Robot Spider Coding Checklist at SearchTools.com

* Bot2001
o In Search Of Search Bots by Brian Profitt
Describes a presentation by Sundar Kadayam, CTO of Intelliseek on the nature of sophisticated search bots, thinking beyond simply gathering static data. Describes how an advanced metadata agent (such as Intelliseek) works by selecting the best information sources, sending the query and receiving results, post-processing to organize results, present them, and offer updates on the query in the future.

o BotSpot Feb. 14 2001 Newsletter
conference panel suggestions for learning to program robots

Perl

Harvest NG
The Gatherer module is the robot which follows the links
Combine Harvesting Robot
Powerful and flexible robot control
Libwww (Perl 5) and Libwww (perl 4)
Perl modules for accessing Web pages, including some examples of following links.
Agent Perl WebReview.com, August 29, 1997 by Ben Smith
Nice tutorial about writing a search indexing spider or robot using Libwww.
MOMspider (Multi-owner Maintenance Spider)
Designed for checking links on multiple servers.
WWW-Robot 0.021 (alternate 0.011 version)
Configurable web traversal engine

Java

Class Acme.Spider
A web-robot that performs a breadth-first crawl and returns URLConnections. Written by the inimitable Jef Poskanzer.


Writing a Web Crawler in the Java Programming Language Java Developer Connection, January 1998 by Muscle Fish developers
Describes an example program following links to get files, keeping track of those already found. Honors robots.txt. Source code available.

BDDBot
Java robot / search engine / web server

NQL (Network Query Language) Java version

SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers
Sophisticated article from WWW7 conference about the issues involved in robot crawling. The implementation is in WebSPHINX.

C and C++

W3C Webbot - Libwww Robot
HTTP robot source code in C based on "Libwww", primarily designed to test HTTP/1.1 pipelining, but usable for other purposes.

ht://Dig
Full-featured search engine in C++, contains a sophisticated robot.

SWISH-E
Another full search engine with a robot spider.

Pavuk
A program designed to copy entire sites by following links and gathering the pages. Implemented with an interface for Mac OS X Server as epicware WebGrabber.

Pre-emptive Multithreading Web Spider MFC Programmer's SourceBook article, June 21, 1998 by Sim Ayers
Tutorial article on making a spider in MFC with a lot of explanation.

Other

TkWWW Robot
Robot code in Tcl/Tk

Commercial Products

Tenmax Dataplex Robot
High capacity web spider can handle millions of pages per day, complex HTML and even JavaScript.

0 comments: