Untitled Document

Web Archiving - Harvesting, Capture, Management, Access

Archive-It (Internet Archive)
http://www.archive-it.org/
BnfArcTools (BAT)
http://bibnum.bnf.fr/downloads/bat/ - Perl package for processing ARC (web page content), DAT (metadata) and CDX (index) files
Combine - Lund University
http://combine.it.lth.se/ - open system for crawling and indexing Internet resources
DataFountains
http://ivia.ucr.edu/manuals/stable/DataFountains/current/ - "tool for discovering and describing Internet resources through the use of three distinct Crawlers: Expert Guided Crawler, Targeted Link Crawler, and the Nalanda iVia Focused Crawler"
DeDuplicator
http://deduplicator.sourceforge.net/ - add-on module for Heritrix to reduce the amount of duplicate data collected in a series of snapshot crawls
DeepArc - National Library of France (BnF)
http://deeparc.sourceforge.net/ - graphical editor for generating XML from relational databases, in order to capture and archive deep web content
Domain History - DomainTools
http://domain-history.domaintools.com/ - fee-based service that provides access to historical data about registered domains in .com, .net, .org, .biz, .us, and .info (starting with 2000) from the whois database, which can help to determine who used to own a given domain name
Find it! Keep it! - Ansemond
http://www.ansemond.com/ - for personal capture and management of web pages
Flash Video Downloader - Apple http://www.apple.com/downloads/macosx/internet_utilities/flashvideodownloader.html
Furl
http://www.furl.net/ - bookmarking service that supports saving of the web resources bookmarked, which can also be exported as ZIP files (and metadata in XML), though saved copies within FURL are only available to the person who bookmarked them and reportedly images are not saved
Hanzo:web
http://www.hanzoweb.com (See also Hanzo Forge and Hanzo Enterprise at http://www.hanzoarchives.com/home/)
Heritrix
http://crawler.archive.org/ - Internet Archive's open-source web crawler project
HTTrack Web Site Copier
http://www.httrack.com/ - open-source offline browser, which attempts to reflect the site's relative link structure, can update an existing mirrored site, and supports resuming interrupted downloads (see also ProxyTrack, http://www.httrack.com/proxytrack/ - "a standalone project aimed to help web archivists to easily build caches based on websites downloaded by httrack"
IIPC Toolkit
http://netpreserve.org/software/toolkit.php
InfoMonitor - Cooper, Brian F., and Hector Garcia-Molina. "InfoMonitor: Unobtrusively Archiving a World Wide Web Server." International Journal on Digital Libraries 5, no. 2 (2005): 106-19.
http://www.brianfrankcooper.net/pubs/fmpaper.pdf
iPROXY - web pages are captured when proxy requests them from the server See: Rao, Herman Chung-Hwa, Yih-Farn Chen, and Ming-Feng Chen
"A Proxy-Based Personal Web Archiving Service." ACM SIGOPS Operating Systems Review 35, no. 1 (2001): 61-72
http://doi.acm.org.libproxy.lib.unc.edu/10.1145/371455.371462
Kellogg, David. "Evaluation of Open Source Spidering Technology" 2004
http://www.diglib.org/aquifer/oct2504/spidereval.pdf
metacafe-dl
http://www.arrakis.es/~rggi3/metacafe-dl/ - command-line application for downloading videos from MetaCafe.com
MetaProducts
http://www.metaproducts.com/ - several commercial capture and off-line browsing tools
mod_oai
http://www.modoai.org/ - module to add Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) support to an Apache web server, which can support web capture by identifying all existing or recently added resources from a site
Nalanda iVia Focused Crawler (NIFC) - Soumen Chakrabarti, Indian Institute of Technology Bombay
http://ivia.ucr.edu/projects/Nalanda/ - "identifies significant Internet resources within specific communities of shared subject interest, and represents an appropriately scaled approach for many library and academic community applications"
NetarchiveSuite - State and University Library of Denmark and The Royal Library of Denmark
http://netarchive.dk/suite and Netarchive.dk - Source Code
http://netarkivet.dk/kildetekster/index-en.php [Java ARC utilities; ProxyViewer for browsing ARC files]
NutchWAX (Nutch - Web Archive eXtensions)
http://archive-access.sourceforge.net/projects/nutch - bundling of Nutch and extensions that can be used to search Web Archive Collections (WACs)
Off-line Browsing Bots - BotSpot
http://www.botspot.com/BOTSPOT/Windows/Download_Bots/Off-line_Browsing_Bots/index.html
PageVault - Project Computing
http://www.projectcomputing.com/products/pageVault/ - supports the capturing and indexing of all HTTP responses generated by a web server, as well as defining types of information that will not be captured
RafaBot - Spadix Software
http://www.spadixbd.com/rafabot/ - "can download websites from a starting URL, search engine results or web dirs and able to follow external links," supports filtering and crawling of password-protected sites (their free off-line browser is BackStreet Browser - http://www.spadixbd.com/backstreet/index.htm)
Spurl
http://www.spurl.net/ - bookmarking service that supports saving of the web resources bookmarked (not just their URLs)
SuperBot - Sparkleware
http://www.sparkleware.com/superbot/index.html - commercial off-line browser
SurfSaver - askSam Systems
http://www.surfsaver.com/ - commercial off-line browsing add-on to Internet Explorer
Teleport Webspiders - Tennyson Maxwell Information Systems
http://www.tenmax.com/teleport/home.htm - a variety of features to support multithreaded retrieval, password-protected access, filtering, batch capture and management of derived databases
Thomasen, Bo Hovgaard. "Test of Software and Strategies for Micro-Archiving Websites." Aarhus, Denmark: Centre for Internet Research, 2004
http://www.cfi.au.dk/publikationer/archiving/test.pdf (See also Test of archiving software for micro-archiving websites
2004
http://www.cfi.au.dk/publikationer/archiving/#test)
Tools for Setting up a Web Archiving Chain
International Internet Preservation Consortium
http://www.netpreserve.org/software/downloads.php
O - TTApache - "transaction-time HTTP server that supports document versioning" by indicating, at the time of an HTTP requestn, when the last update was made to any of the files that is part of a document on the Web ( See: Dyreson, Curtis E , Hui-ling Lin, and Yingxia Wang. "Managing Versions of Web Documents in a Transaction-Time Web Server." In Proceedings of the 13th International Conference on World Wide Web, 422-32
New York, NY: Association for Computing Machinery, 2004
http://doi.acm.org.libproxy.lib.unc.edu/10.1145/988672.988730
Virtual Remote Control - Cornel University LibraryWeb Tool Resources, http://prism.library.cornell.edu/VRC/resources.html and Tool Inventory, http://prism.library.cornell.edu/VRC/tool.php
Warrick
http://warrick.cs.odu.edu/ - command-line utility for reconstructing or recovering a website when a back-up is not available
WAXToolbar
http://archive-access.sourceforge.net/projects/waxtoolbar/ - Firefox extension with a search field for querying the Wayback Machine or searching a full-text NutchWAX index, depends on the Wayback Machine
Wayback (Internet Archive) - open source java implementation of the Internet Archive Wayback Machine
http://archive-access.sourceforge.net/projects/wayback/
Wayfinder
http://wayfinder.webarchivist.org/
Web Archives Workbench (WAW) - OCLC
http://webarchives.oclc.org/WAW/
Web Archiving Resources - Harvard University Library
http://hul.harvard.edu/ois/projects/webarchive/resources.html (see especially the Tools Section, http://hul.harvard.edu/ois/projects/webarchive/resources.html#tools)
WebArchivist Software Suite - University of Washington and the SUNY Institute of Technology
http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html
WebBase - Stanford
http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/
Web Capture Tools
http://www.ils.unc.edu/callee/ermlinks/cap_web.htm
WebCopier - MaximumSoft
http://www.maximumsoft.com/index.html - supports parsing and integrity checking for various formats, crawl scheduling and extraction of links from compressed Flash (.SWF) files
Web Curator Tool (WCT) - National Library of New Zealand and British Library, initiated by the International Internet Preservation Consortium, developed by Sytec Resources Ltd
http://webcurator.sf.net/ - tool for managing the selective web harvesting process, designed for use in libraries and other collecting organisations by "non-technical users"
Web Page Archiver - INEXP Software
http://www.inexp.com/page_archiver/ - add-on for Microsoft Internet Explorer for saving sites in Compiled HTML Help (CHM) format
WERA (Web ARchive Access)
http://archive-access.sourceforge.net/projects/wera/ - free software for searching and navigating archived web document collections, much like Wayback Machine but also allows for full-text search
Wget ( GNU Project)
http://www.gnu.org/software/wget/
youtube-dl
http://www.arrakis.es/~rggi3/youtube-dl/ - command-line application for downloading videos from YouTube
Zylox - Internet Researcher and Offline Commander
http://www.zylox.com/ - commercial off-line browsing software for Windows