Use wget to grab the data you need
I’m starting a series of code snippets here, which will be short bits of useful code. Let’s start with something basic. Need to grab a whole directory of files from an http or ftp server? Use wget.
Grab all of the files with suffix “.tar.gz”:
wget -l2 -r -A tar.gz http://site.goes/here/
If they are a lot of files, or they’re huge, consider using the –wait flag to pause for a few seconds between downloads, so that you don’t slam their server.
wget -l2 -r -A txt.gz --wait=10 http://site.goes/here/
If the site owner blocks you from using wget with a robots.txt file, think carefully – they probably have good reason for doing so. If you’re still convinced that wget is the right way, and you’re sure that you won’t be crippling their server (or skyrocketing their bandwidth bill), you can have wget ignore the robots.txt file:
wget -l2 -r -A txt.gz --wait=10 -e robots=off http://site.goes/here/
March 8th, 2010 • 1 Comment » • Tags: snippets •


