Retrieve website with wget

Snippet
-nc  no clobber, if the file has already been downloaded, don't overwrite and don't make a second (or multiple) copy
-e robots=off ignore robots.txt
-nH don't create host subdirectories
-r recursive
--user-agent="..." user agent string, since wget is often banned
--random-wait don't slurp, adjust wait time from 0.5 to 1.5 of wait time
--wait=3 wait time in seconds, basis for --random-wait
-o output file location
http://example.com/ the website

Will also likely want the -p and -k options.

Consider also:

       -np
       --no-parent
           Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option, since it guarantees that
           only the files below a certain hierarchy will be downloaded.

And

  -A acclist --accept acclist
  -R rejlist --reject rejlist
        comma-separated lists of file name suffixes or patterns to accept or reject. ex: -A "*.mp3" or -A '*.mp3'
wget --random-wait -nc --wait=3 -e robots=off -nH -r --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" -o ../path/to/output.log http://example.com/
 
wget -A '*.png' --random-wait -nc --wait=2 -e robots=off -nH -r -np --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" -o output.log https://example.org/images/

Add new comment

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.