This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
SiteDig does currently to some extent have the following features:
SiteDig is a command line tool, meaning that you start it with a set of options and let it do its job without bugging you. This way, it's easy to call from other processes and scripts.
SiteDig goes through documents found on the internet (protocols supported depends on java implementation, but typically http, https and ftp) looking for href and src (and user defined) tags, thus finding new documents to parse. It also changes the value of these tags to relative paths, ensuring that links will work when reading downloaded documents offline (though ths can be turned off using the -dontrefactor flag). Store downloaded copies on CDs, handheld devices, and more.
SiteDig supports multiple threads. Execution speed is no longer limited by upload speed of a single server.
SiteDig let's you set a maximum recursion level, and limit what's downloaded to "this dir only"/"this dir and subdirs"/"this host only"/"this host and subdomains", and only URLs following certain patterns.
File renaming scheme to include the http query string
Many pages today uses URLs such as show.cgi?page=start. SiteDig addresses this problem by saving pages with names like this as show_cgi_page_start.html. This way, there is no name conflict when saving files.
File type determination
Using both MIME headers and file extensions, SiteDig determines what files are text based and thus parseable, and which are binary and should be left intact.
SiteDig let's you send http headers of your choise with each http request. Set your own user agent, referer page and cookies using this option.
SiteDig understands server Set-Cookie headers and sends appropriate cookies accordingly (use the -cookies flag)
SiteDig let's you guess URLs based on current ones. This includes trying only partial paths, different hosts and IPs, and a regex based matching procedure that finds URLs not included in src and href tags.
Free, GPLed, open source.