Use wget to grab the data you need

March 8th, 2010

I’m starting a series of code snippets here, which will be short bits of useful code. Let’s start with something basic. Need to grab a whole directory of files from an http or ftp server? Use wget.

Grab all of the files with suffix “.tar.gz”:

wget -l2 -r -A tar.gz http://site.goes/here/

If they are a lot of files, or they’re huge, consider using the –wait flag to pause for a few seconds between downloads, so that you don’t slam their server.

wget -l2 -r -A txt.gz --wait=10 http://site.goes/here/

If the site owner blocks you from using wget with a robots.txt file, think carefully – they probably have good reason for doing so. If you’re still convinced that wget is the right way, and you’re sure that you won’t be crippling their server (or skyrocketing their bandwidth bill), you can have wget ignore the robots.txt file:

wget -l2 -r -A txt.gz --wait=10 -e robots=off http://site.goes/here/


Tags: | 1 Comment »

One Response to “Use wget to grab the data you need”

  1. Thanks! Very useful!

  • March 8, 2010 at 8:37 pm Ruchira S. Datta
    If they've blocked robots, it's probably more polite to write a small script that includes small sleep delays between fetching files. Here urllib2 is your friend: http://docs.python.org/library/urllib2.html though admittedly a bit more work then your wget commands.
  • March 8, 2010 at 8:57 pm Chris Miller
    how is using the "--wait" option in wget any different from what you suggest?
  • March 9, 2010 at 2:43 am Ruchira S. Datta
    ah, my eyes skipped right over that--it probably isn't (unless you wait a random interval, but I don't do that often). Thanks!

Add a comment on FriendFeed