HTTrack Website Copier
Open Source offline browser

Option panel : Spider




  • Accept cookies

  • Accept cookies generated by the remote server
    If you do not accept cookies, some "session-generated" pages will not be retrieved


  • Check document type

  • Define when the engine has to check document type
    The engine must know the document type, to rewrite the file types. For example, if a link called /cgi-bin/gen_image.cgi generates a gif image, the generated file will not be called "gen_image.cgi" but "gen_image.gif"
    Avoid "never", because the local mirror could be bogus


  • Parse java files

  • Must the engine parse .java files (java classes) to seek included filenames?
    It is checked by default


  • Spider

  • Must the engine follow remote robots.txt rules when they exist?
    The default is "follow"


  • Update hack

  • Attempt to limit transfers by wrapping known bogus responses from servers. For example, pages with same size will be considered as "up to date", even if the timestamp seems different. This can be useful for many dynamically generated pages, but this can also cause not-updated pages in rare cases.

  • Tolerant requests

  • Tolerate wrong file size, and make requests compliant with old servers
    It is unchecked by default, because this option can cause files to become bogus


  • Force old HTTP/1.0 requests

  • This option forces the engine to use HTTP/1.0 requests, and avoid HEAD requests.
    Useful for some sites with old server versions, or with many dynamically generated pages.






Back to Home