[wget] How to store an entire website offline

wget --random-wait -r -p -e robots=off -U mozilla http://yoursite.com

OR, as a bash function added to ~/.bash_profile (MacOS) or ~/.bashrc (Linux)

# Download an entire website
# -p --page-requisites: get all the elements that compose the page (images, CSS and so on)
# -e robots=off you don't want wget to obey by the robots.txt file
# -U mozilla as your browsers identity.
# --random-wait to let wget chose a random number of seconds to wait, avoid get into black list.
# Other Useful wget Parameters:
# -k --convert-links: convert links so that they work locally, off-line.
# --limit-rate=20k limits the rate at which it downloads files.
# -b continues wget after logging out.
# -o $HOME/wget_log.txt logs the output

getwebsite() {
    wget --random-wait -r -p -e robots=off -U mozilla $1
}

To use the bash function

getwebsite http://websitelink.com

wget options

  • --random-wait This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds

  • -r recursive retreiving

  • -p for --page-requisites, download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

  • -e for executing commands. -e robots=off is the command being sent in this instance.

  • -U for --user-agent equal to --user-agent=mozilla. Identify as Mozilla to the HTTP server.