Mirroring a website with wget

Clueless clients and bad briefs can make a developers life very difficult at times so sometimes it comes in handy to be able to replicate a site in its entirety to review.

Rsync could be used to mirror the files to a local location but it doesn’t really fit this situation as it requires you have access to the server the site is hosted on. Wget on the other hand will retrive all the pages over standard http and store them locally to get started you simply need wget installed and then use the -m (mirror) flag to replicate a site i.e

wget -m http://sitetomirror.com

The mirror flag is the equivalent of executing wget get with a few options set. To achieve the same thing you could alternatively execute:

wget -r -N -l inf –no-remove-listing http://sitetomirror.com

But the mirror flag is simply alot easier to remember!

When a replicating a web site there are a few other flags which can come in handy, some of these are:

-w or –wait
Waits a specified number of seconds between each request to the server. Using this flag is normally recommended as it helps reduce the load on the server especially when replicating larger sites containing lots of files.

-k –convert-links
Once the site is downloaded convert any internal links so they will work when viewed locally. This flag is normally a great match with the mirror flag when grabbing a copy of a website for review. As it allows you to easily browse your copy of the site in its entirety locally.

–randomwait
Some intrusion detection systems or firewalls may notice the requests from wget being made at similar intervals and block access to the site. The random wait flag will add a random delay between requests that varies between 0.5 and 1.5 seconds.

Some things to keep in mind:

  • Remember any server side code will be executed when the server gives the page to wget so you are only getting a copy of what was presented when you mirrored the site.
  • Just because you can mirror the content from someones site, its still protected by copyright and should not be used without the permission of the owner.

For more information on wget and how it can be used visit http://www.gnu.org/software/wget/