Webmirror 2.0, Retrieval Domain

Web mirroring programs usually mirror a single site, or a certain subdirectory from a single site. WEBMIRROR PRO v2.0 on the other hand mirrors a whole domain of web pages.

The user in the RDF file define several include and exclude commands that decide which urls to download and which urls are not to download.

A page is downloaded if its url matched any of the include command and does not match any of the exclude commands. All these pages form the retrieval domain.

To be 100% precise a page is in the retrieval domain if

  1. URL matches any of the patterns defined in the include commands.
  2. URL does not match any of the patterns defined in the exclude commands.
  3. The page can be reached from at least one of the start pages following standard html links with less than n steps going through pages that are all in the retrieval domain. n is defined in the command level.

TOC