WebTango

Automating Web Site Evaluation

Home

People

Tools

Publications

Talks

Site Crawler Tool

We have temporarily disabled the Site Crawler Tool. Please try again in a few days. We apologize for this inconvenience.



*Email:



Crawl Options

*Start
   URL:

  Depth:

  Level 1 Pages:   Level 2 Pages:

  Level 3 Pages:   Level 4 Pages:

  Level 5 Pages:   Level 6 Pages:



Batch Options

Batch Mode (Starting URL is a file - one url per line)

Ids Specified (Each line is formatted as id|url)



Server Information

  Server
   URL:

  (e.g., http://www.mydomain.com/data/; where you
   intend to store the archive on your Web server)



Please address all questions and comments to the WebTANGO Team (tango@sims.berkeley.edu)

Overview

This tool downloads pages from a starting url and stores page contents (images, iframes, stylesheets, etc.) into a directory structure.

When crawling a site, the tool ignores links for ads, guestbooks, shopping, chat rooms, login, and flash pages; it only follows text links. At each level, the tool only follows links that are inaccessible from the previous level versus relying on misleading directory structure in urls.

The tool will only spend 8 minutes collecting data from each url. Given these constraints, the tool may not be able to store the specified number of pages at each level.

When the tool finishes collecting the requested data, the data is archived in a gzipped tar file and an email notification is sent to the specified email address. This message contains instructions for downloading and unpacking the archive.

Email

Please specify an email address for sending a link to the requested data.

Start URL

The starting URL can be one url or a file containing multiple urls (one per line). It is suggested that home page(s) be specified as the starting url(s) for compatibility with the other TANGO tools.

Depth

The crawling depth can be set from 1 to 6 levels from the starting URL. The number of pages to download at each level can also be specifed. By default the tool crawls 3 levels and downloads 15 pages at level 1 and 3 pages from each of the level 1 pages. This configuration could produce 61 pages (1 + 15 + 15*3) from a site. Please use this default configuration if you intend to use this data for subsequent analysis with other TANGO tools.

Batch Mode

The tool can be run in batch mode where the starting url points to a file containing one url per line.

Ids Specified

An optional id can be supplied for each url in the batch file by separating the id from the url with a | (i.e., id|url). Otherwise, ids are generated automatically for each url.

Server URL

The tool make substitutions using supplied server and path information; this indicates where you intend to store the archive. If the server URL is not specified or subsequently changed, you will need to edit the metrics.input.data and scent.input.sorted.data files. Refer to the README file in the generated archive for more information.