Use ‘parallel’ for easy multi-processor execution

I just discovered the parallel utility, which is a easy way to make use of multiple processors while munging data from the shell. I especially like that I can pipe data directly in and out of it, just like the other shell utils (sed, awk, cut, sort, etc).

From the examples on that page:
Use imagemagick’s “convert” command to downsize many images to thumbnails:
ls *.jpg | parallel -j +0 convert -geometry 120 {} thumb_{}

“-j +0″ means use as many processes as possible, based on the number present in the system

Or do it recursively using find:
find . -name '*.jpg' | parallel -j +0 convert -geometry 120 {} {}_thumb.jpg

I suspect this will become an integral part of my pipelines soon.

sort and uniq

This is part two in my series of short primers on useful tools for bioinformatics. I’m continuing the ‘back to basics’ theme here and talking about two shell commands today. sort and uniq are two utilities that I wish I had known about a few years sooner, as I now use them pretty much daily while munging data.

Sort does exactly what it says on the box: lets you order your data. In the simple case, you might have a file containing a list of gene names, and you’d sort it like this:

sort infile > outfile

Or it might contain numbers, so you’d want to sort it numerically, like so:

sort infile > outfile

Now, let’s say you’re working with tab-delimited data, which might contain genomic coordinates and scores:

chr1 100100 100200 19
chr1 100050 100099 3
chr1 100300 100600 20

To sort it numerically by the 4th column, we can use the -n and -k flags:

sort -n -k4 infile > outfile

If we want to reverse that order, add the -r flag:

sort -nrk4 infile > outfile

What about sorting by multiple columns? By default, -k uses a key that starts with the specified column and continues to the end of the line. Since we want only the nth column, we specify that using the comma syntax. So, to sort the above file by genomic coordinates (chr, start, stop), we can use something like this:

sort -k1,1 -k2,2n -k3,3n infile > outfile

After sorting your data, you can then scan it for unique lines using the uniq command. Basic usage will just keep the first instance of every value, throwing away the duplicates:

sort infile | uniq > outfile

You can also output only the unique values (throwing away any anything that appears more than once):

sort infile | uniq -u > outfile

Or output just the duplicates (throwing away anything that appears only once):

sort infile | uniq -d > outfile

It will even condense the lines and prepend a count of how many times each value appears in the file:

sort infile | uniq -c > outfile

As a simple example of real world usage, let’s imagine that I have lists of differentially expressed genes generated by two cell line experiments. I want to find out which genes are altered in both systems. One way to do this would be to write a little script to read in each file, then hash all the gene names and output any names that appear in both lists. With these tools, we can now do it in one line from the shell:

cat file1 file2 | sort | uniq -d

Lessons for article recommendation services

Today someone proposed the creation of a sub-reddit where scientists could recommend papers to each other. While it’s a nice thought, I can almost guarantee that it’s going to be a failed effort. There are already sites like Faculty of 1000, which try to use panels of experts to recommend good papers. In my experience, they mostly fail at listing things that I want to read.

The main reason such sites are useless is that we scientists are uber-specialized, so what you think is the greatest paper ever will likely have very litle interest for me. It’s not that I don’t want to read about cool discoveries in other fields, it’s just that I don’t have time to. Until they invent the matrix-esque brain-jack for rapid learning, I have to prioritize my time, and my field and my work will always come first.

There are only two systems I’ve found that work well. The first are recommendation systems based on what you’ve read in the past, and what your colleagues are reading. CiteULike, for example, recommends users that have bookmarked similar papers to you, and perusing through their libraries gives me an excellent source of material. The other quality source of recommendations is FriendFeed, where I can subscribe to the feeds of other bioinformaticians with similar interests, and we can swap links to papers and comments about those papers.

Both of these systems are all about building micro-communities, with a focus that you can’t achieve in larger communities like Reddit. In this way, it’s sort of like a decentralized version of departmental journal clubs, or specialized scientific conferences. Any site that ignores the value of creating this type of community is pretty much doomed to failure from the start.

(reposted from my personal blog)