Retrieve website with wget
Snippet
Posted by kitt at 17:52 on 6 December 2015
-nc no clobber, if the file has already been downloaded, don't overwrite and don't make a second (or multiple) copy -e robots=off ignore robots.txt -nH don't create host subdirectories -r recursive --user-agent="..." user agent string, since wget is often banned --random-wait don't slurp, adjust wait time from 0.5 to 1.5 of wait time --wait=3 wait time in seconds, basis for --random-wait -o output file location http://example.com/ the website
Will also likely want the -p
and -k
options.
Consider also:
-np --no-parent Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
And
-A acclist --accept acclist -R rejlist --reject rejlist comma-separated lists of file name suffixes or patterns to accept or reject. ex: -A "*.mp3" or -A '*.mp3'
wget --random-wait -nc --wait=3 -e robots=off -nH -r --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" -o ../path/to/output.log http://example.com/ wget -A '*.png' --random-wait -nc --wait=2 -e robots=off -nH -r -np --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" -o output.log https://example.org/images/
Add new comment