sort and uniq

This is part two in my series of short primers on useful tools for bioinformatics. I’m continuing the ‘back to basics’ theme here and talking about two shell commands today. sort and uniq are two utilities that I wish I had known about a few years sooner, as I now use them pretty much daily while munging data.

Sort does exactly what it says on the box: lets you order your data. In the simple case, you might have a file containing a list of gene names, and you’d sort it like this:

sort infile > outfile

Or it might contain numbers, so you’d want to sort it numerically, like so:

sort infile > outfile

Now, let’s say you’re working with tab-delimited data, which might contain genomic coordinates and scores:

chr1 100100 100200 19
chr1 100050 100099 3
chr1 100300 100600 20

To sort it numerically by the 4th column, we can use the -n and -k flags:

sort -n -k4 infile > outfile

If we want to reverse that order, add the -r flag:

sort -nrk4 infile > outfile

What about sorting by multiple columns? By default, -k uses a key that starts with the specified column and continues to the end of the line. Since we want only the nth column, we specify that using the comma syntax. So, to sort the above file by genomic coordinates (chr, start, stop), we can use something like this:

sort -k1,1 -k2,2n -k3,3n infile > outfile

After sorting your data, you can then scan it for unique lines using the uniq command. Basic usage will just keep the first instance of every value, throwing away the duplicates:

sort infile | uniq > outfile

You can also output only the unique values (throwing away any anything that appears more than once):

sort infile | uniq -u > outfile

Or output just the duplicates (throwing away anything that appears only once):

sort infile | uniq -d > outfile

It will even condense the lines and prepend a count of how many times each value appears in the file:

sort infile | uniq -c > outfile

As a simple example of real world usage, let’s imagine that I have lists of differentially expressed genes generated by two cell line experiments. I want to find out which genes are altered in both systems. One way to do this would be to write a little script to read in each file, then hash all the gene names and output any names that appear in both lists. With these tools, we can now do it in one line from the shell:

cat file1 file2 | sort | uniq -d