[wget] How to store an entire website offline
wget --random-wait -r -p -e robots=off -U mozilla http://yoursite.com
OR, as a bash function added to ~/.bash_profile
(MacOS) or ~/.bashrc
(Linux)
# Download an entire website
# -p --page-requisites: get all the elements that compose the page (images, CSS and so on)
# -e robots=off you don't want wget to obey by the robots.txt file
# -U mozilla as your browsers identity.
# --random-wait to let wget chose a random number of seconds to wait, avoid get into black list.
# Other Useful wget Parameters:
# -k --convert-links: convert links so that they work locally, off-line.
# --limit-rate=20k limits the rate at which it downloads files.
# -b continues wget after logging out.
# -o $HOME/wget_log.txt logs the output
getwebsite() {
wget --random-wait -r -p -e robots=off -U mozilla $1
}
To use the bash function
getwebsite http://websitelink.com
wget options
--random-wait
This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds-r
recursive retreiving-p
for--page-requisites
, download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.-e
for executing commands.-e robots=off
is the command being sent in this instance.-U
for--user-agent
equal to--user-agent=mozilla
. Identify as Mozilla to the HTTP server.