Robots (also known as spiders, wanderers, worms, crawlers, gatherers, intelligent agents) follow links from one web page to another. They work with indexing code to store data for later searching.
Robot Source Code
There is a good deal of free open source code available -- you don't have to start from scratch. Take a look at some of the options below, in the programming language best suited for your needs. If you'd like to contract your robot out, see the Robots Consultants page.
Useful Links
* Robot Spider Coding Checklist at SearchTools.com
* Bot2001
o In Search Of Search Bots by Brian Profitt
Describes a presentation by Sundar Kadayam, CTO of Intelliseek on the nature of sophisticated search bots, thinking beyond simply gathering static data. Describes how an advanced metadata agent (such as Intelliseek) works by selecting the best information sources, sending the query and receiving results, post-processing to organize results, present them, and offer updates on the query in the future.
o BotSpot Feb. 14 2001 Newsletter
conference panel suggestions for learning to program robots
Perl
Harvest NG
The Gatherer module is the robot which follows the links
Combine Harvesting Robot
Powerful and flexible robot control
Libwww (Perl 5) and Libwww (perl 4)
Perl modules for accessing Web pages, including some examples of following links.
Agent Perl WebReview.com, August 29, 1997 by Ben Smith
Nice tutorial about writing a search indexing spider or robot using Libwww.
MOMspider (Multi-owner Maintenance Spider)
Designed for checking links on multiple servers.
WWW-Robot 0.021 (alternate 0.011 version)
Configurable web traversal engine
Java
Class Acme.Spider
A web-robot that performs a breadth-first crawl and returns URLConnections. Written by the inimitable Jef Poskanzer.
Writing a Web Crawler in the Java Programming Language Java Developer Connection, January 1998 by Muscle Fish developers
Describes an example program following links to get files, keeping track of those already found. Honors robots.txt. Source code available.
BDDBot
Java robot / search engine / web server
NQL (Network Query Language) Java version
SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers
Sophisticated article from WWW7 conference about the issues involved in robot crawling. The implementation is in WebSPHINX.
C and C++
W3C Webbot - Libwww Robot
HTTP robot source code in C based on "Libwww", primarily designed to test HTTP/1.1 pipelining, but usable for other purposes.
ht://Dig
Full-featured search engine in C++, contains a sophisticated robot.
SWISH-E
Another full search engine with a robot spider.
Pavuk
A program designed to copy entire sites by following links and gathering the pages. Implemented with an interface for Mac OS X Server as epicware WebGrabber.
Pre-emptive Multithreading Web Spider MFC Programmer's SourceBook article, June 21, 1998 by Sim Ayers
Tutorial article on making a spider in MFC with a lot of explanation.
Other
TkWWW Robot
Robot code in Tcl/Tk
Commercial Products
Tenmax Dataplex Robot
High capacity web spider can handle millions of pages per day, complex HTML and even JavaScript.
Source Code for Web Robot Spiders
Wednesday, October 29, 2008 at 2:51 AM Posted by Vasu
Subscribe to:
Post Comments (Atom)
Blog Archive
-
►
2009
(1)
- ► 01/04 - 01/11 (1)
-
▼
2008
(153)
- ► 12/14 - 12/21 (2)
- ► 12/07 - 12/14 (13)
- ► 11/30 - 12/07 (11)
- ► 11/23 - 11/30 (8)
- ► 11/16 - 11/23 (7)
- ► 11/09 - 11/16 (5)
- ► 11/02 - 11/09 (2)
-
▼
10/26 - 11/02
(20)
- Optimizing Web Site Navigation
- Methods to Support Search Engines in Crawling and ...
- About /robots.txt
- Robots Exclusion Protocol: now with even more flex...
- I Robot | Robots.txt Help | SebastianX of Sebastia...
- Internal Links - Only The First Link Counts in Goo...
- 25 Web Form Optimization Tips
- Image Optimization Part 1: The Importance of Images
- Google Adds RSS Feeds For Web Search Results
- Removing your entire website using a robots.txt file
- Source Code for Web Robot Spiders
- Checklist for Search Robot Crawling and Indexing
- How To Handle Redirecting default.asp in IIS? Dupl...
- 5 Tools for On-page Image Usage Analysis
- Beyond Link Building Tools
- 8 Social Media Sites for Local Networking
- Google Penalty Myths
- General Search Ranking Penalties
- Reinclusion Requests: How to perform successful re...
- Effective Keyword Discovery and Traffic-on-Investment
- ► 10/19 - 10/26 (6)
- ► 10/12 - 10/19 (79)
0 comments:
Post a Comment