This tool downloads pages from a starting url and stores page contents (images, iframes, stylesheets, etc.) into a directory structure.
When crawling a site, the tool ignores links for ads, guestbooks, shopping, chat rooms, login, and flash pages; it only follows text links. At each level, the tool only follows links that are inaccessible from the previous level versus relying on misleading directory structure in urls.
The tool will only spend 8 minutes collecting data from each url. Given these constraints, the tool may not be able to store the specified number of pages at each level.
When the tool finishes collecting the requested data, the data is archived in a gzipped tar file and an email notification is sent to the specified email address. This message contains instructions for downloading and unpacking the archive.
Please specify an email address for sending a link to the requested data.
The starting URL can be one url or a file containing multiple urls (one per line). It is suggested that home page(s) be specified as the starting url(s) for compatibility with the other TANGO tools.
The crawling depth can be set from 1 to 6 levels from the starting URL. The number of pages to download at each level can also be specifed. By default the tool crawls 3 levels and downloads 15 pages at level 1 and 3 pages from each of the level 1 pages. This configuration could produce 61 pages (1 + 15 + 15*3) from a site. Please use this default configuration if you intend to use this data for subsequent analysis with other TANGO tools.
The tool can be run in batch mode where the starting url points to a file containing one url per line.
An optional id can be supplied for each url in the batch file by separating the id from the url with a | (i.e., id|url). Otherwise, ids are generated automatically for each url.
The tool make substitutions using supplied server and path information; this indicates where you intend to store the archive. If the server URL is not specified or subsequently changed, you will need to edit the metrics.input.data and scent.input.sorted.data files. Refer to the README file in the generated archive for more information.