HTTrack Website Copier - Cache format specification

Open Source offline browser

Cache format specification

For updating purpose, HTTrack stores original (untouched) HTML data, references to downloaded files, and other meta-data (especially parts of the HTTP headers) in a cache, located in the hts-cache directory. Because local html pages are always modified to "fit" the local filesystem structure, and because meta-data such as the last-Modified date and Etag can not be stored with the associated files, the cache is absolutely mandatory for reprocessing (update/continue) phases.

The (new) cache.zip format

The 3.31 release of HTTrack introduces a new cache format, more extensible and efficient than the previous one (ndx/dat format). The main advantages of this cache are:

One single file for a complete website cache archive
Standard ZIP format, that can be easily reused on most platforms and languages
Compressed data with the efficient and opened zlib format

The cache is made of ZIP files entries ; with one ZIP file entry per fetched URL (successfully or not - errors are also stored).
For each entry:

The ZIP file name is the original URL [see notes below]
The ZIP file contents, if available, is the original (compressed, using the deflate algorythm) data
The ZIP file extra field (in the local file header) contains a list of meta-fields, very similar to the HTTP headers fields. See also RFC.

The ZIP file timestamp follows the "Last-Modified-Since" field given for this URL, if any

Example of cache file:

$ unzip -l hts-cache/new.zip
Archive:  hts-cache/new.zip
HTTrack Website Copier/3.31-ALPHA-4 mirror complete in 3 seconds : 5 links scanned, 
3 files written (16109 bytes overall) [17690 bytes received at 5896 bytes/sec]
(1 errors, 0 warnings, 0 messages)
  Length     Date   Time    Name
 --------    ----   ----    ----
       94  07-18-03 08:59   http://www.httrack.com/robots.txt
     9866  01-17-04 01:09   http://www.httrack.com/html/cache.html
        0  05-11-03 13:31   http://www.httrack.com/html/images/bg_rings.gif
      207  01-19-04 05:49   http://www.httrack.com/html/fade.gif
        0  05-11-03 13:31   http://www.httrack.com/html/images/header_title_4.gif
 --------                   -------
    10167                   5 files

Example of cache file meta-data:

HTTP/1.1 200 OK
X-In-Cache: 1
X-StatusCode: 200
X-StatusMessage: OK
X-Size: 94
Content-Type: text/plain
Last-Modified: Fri, 18 Jul 2003 08:59:11 GMT
Etag: "40ebb5-5e-3f17b6df"
X-Addr: www.httrack.com
X-Fil: /robots.txt

There are also specific issues regarding this format:

The data in the central directory (such as CD extra field, and CD comments) are not used
The ZIP archive is allowed to contains more than 2^16 files (65535) ; in such case the total number of entries in the 32-bit central directory is 65536 (0xffff), but the presence of the 64-bit central directory is not mandatory
The ZIP archive is allowed to contains more than 2^32 bytes (4GiB) ; in such case the 64-bit central directory must be present (not currently supported)

Meta-data stored in the "extra field" of the local file headers
The extra field is composed of text data, and this text data is composed of distinct lines of headers. The end of text, or a double CR/LF, mark the end of this zone. This method allows to optionally store original HTTP headers just after the "meta-data" headers for informational use.

The status line (the first headers line)
Status-Line = HTTP-Version SP Status-Code SP X-Reason-Phrase CRLF

Other lines:

Specific fields:

X-In-Cache

X-StatusCode

X-StatusMessage

X-Size

X-Charset

X-Addr

X-Fil

X-Save

Standard (RFC 2616) "useful" fields:

Content-Type
Last-Modified
Etag
Location
Content-Disposition

Specific fields in "BNF-like" grammar:

X-In-Cache          = "X-In-Cache" ":" 1*DIGIT
X-StatusCode        = "X-StatusCode" ":" 1*DIGIT
X-StatusMessage     = "X-StatusMessage" ":" *<TEXT, excluding CR, LF>
X-Size              = "X-Size" ":" 1*DIGIT
X-Charset           = "X-Charset" ":" value
X-Addr              = "X-Addr" ":" scheme ":" "//" authority
X-Fil               = "X-Fil" ":" rel_path
X-Save              = "X-Save" ":" rel_path

RFC standard fields:

Content-Type        = "Content-Type" ":" media-type
Last-Modified       = "Last-Modified" ":" HTTP-date
Etag                = "ETag" ":" entity-tag
Location            = "Location" ":" absoluteURI
Content-Disposition = "Content-Disposition" ":" disposition-type *( ";" disposition-parm )

And, for your information,

X-Reason-Phrase     = *<TEXT, with a maximum of 32 characters, and excluding CR, LF>

Note: Because the URLs may have an unexpected format, especially with double "/" inside, and other reserved characters ("?", "&" ..), various ZIP uncompressors can potentially have troubles accessing or decompressing the data. Libraries should generally handle this peculiar format, however.