JEXL type casting

TIL: when using GATK VariantFiltration, you can set up simple queries in JEXL like
    MQ0 / DP > 0.10
but that will not work as expected if `MQ0` and `DP` are integers
so you have to do:
    ((MQ0*1.0) / (DP*1.0)) > 0.10
Which is stupid, and I want my hour back.

Cleaning up after yourself

Whoops! It turns out that when you sort bams with sambamba, it uses /tmp/ instead of the working directory (who knew!). So, long story short, I ended up with temp dirs filled or half-filled on a bunch of the blades in our LSF-managed cluster.

To clean up the mess, my first instinct was just to write a little bash script that removed the appropriate directories owned by me, then loop through each blade like so:

bhosts | awk '{print $1}' | while read i;do
ssh $i bash ~/

This approach failed for a number of reasons. First of all, I’ve never logged into most of the hundreds of blades, so I was being prompted for input by ssh:

The authenticity of host 'xxxxxx' can't be established.
RSA key fingerprint is ...
Are you sure you want to continue connecting (yes/no)?

Secondly, even if the warning hadn’t popped up, STDIN is being gobbled up and only the first blade in the list was being accessed before the loop exited.

I was able to figure out that adding the following to my ssh command skips the auth warning:

-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no

And thanks to some help from stack exchange, using a different file descriptor prevented the input from being eaten. The final command looks like:

bhosts | awk '{print $1}' >hostlist;
while read -u10 HOST;do
  echo $HOST;
  ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no $HOST bash ~/;
done 10<hostlist

Twenty minutes later, all was right with the world and no admins were cursing my name.

Github tip

To ignore whitespace in a git diff, use ‘git diff -w’

To replicate this behavior on a github diff, add ‘?w=1′ to the URL

Creating Screenshots in IGV

Just added a little script to github that automatically creates static screenshots for a large number of sites in IGV. The tricky part is getting it to run on a headless computer cluster (since IGV is a memory hog and we may want to look at a large number of sites across a large number of samples).

This is complicated by the fact that IGV demands that it actually pump the data to a display server, whether or not anyone is actually looking at it. So what we do is first, find an open display port (since lots of jobs may be landing on the same machine).

until [[ $DISPLAY_NUM > 25 ]]; do
    echo "Looking for display on $DISPLAY_NUM..."
    Xvfb :$DISPLAY_NUM -nolisten tcp> /dev/null 2>&1 &
    sleep 2
    lsof -a -U -p $pid > /dev/null 2>&1
. . .

Then if we found an open port and were able to launch a non-displayed display server using Xvfb, we can set the DISPLAY env variable and run IGV in batch mode:

    if [ "$notfound" == "0" ];then
        export DISPLAY=:$DISPLAY_NUM
        echo $DISPLAY
        java -Xmx8000m -Dapple.laf.useScreenMenuBar=true -jar /gsc/scripts/pkg/bio/igv/installed/igv.jar --batch $1
. . .

Rubriq, Nature Scientific Reports, and paid peer-review

Nature Scientific Reports recently announced that they would be partnering with Rubriq to offer a fast-track peer review service for an additional fee. Of course, they’re careful to say that it will not affect editorial decisions, but there’s more than a little bit of criticism of this as a ‘pay-to-be-published’ deal. This includes one editor’s public resignation via twitter.

I signed up for Rubriq as a reviewer about a year ago, figuring I’d see what all the hype was about. At the time, their business model was to offer pre-submission peer review. In other words, people who weren’t very confident in their abilities could pay to get some feedback on their publication and experiments from someone in the field before sending it off to a journal. In this sense, it’s just like a paid editing or consulting gig. This seems totally kosher to me.

Though it’s taken some heat, I also don’t have a problem with the idea of standardization of peer review. Their checklist isn’t that dissimilar from what a lot of journals use now (rank the quality of the research from 1-10, rank the quality of the writing from 1-10, etc) and includes plenty of space for long-form text critiques of the paper. If they could get all publishers to use this, so that submitting to a new journal after a rejection didn’t necessarily mean getting all-new reviews, that would be a good thing.

I’m not sure that I’m comfortable with this next step, though. Payment for peer review that will actually be used by the journal raises some uncomfortable questions. While I have no evidence to suggest that paid reviewers would feel obligated to give good reviews, it’s certainly not a large conceptual leap. Given how much flak science takes from the popular press already, even the appearance of impropriety is something that needs to be guarded against carefully

For what it’s worth, I’ve only been given one paper to review by Rubriq, (well before this deal was announced) and I pretty well ripped it to shreds (politely) before collecting my 100 dollars. I honestly hope that those authors learned something, went back, and tried some of the experiments I suggested. If not, then I feel like their money was wasted, and my consulting fee wasn’t a useful expense for them.

The other part that rubs me the wrong way is that this allows labs with more money to get faster peer-review. That division into haves and have-nots seems counter to the fundamentally democratic ideas underlying the scientific enterprise.

New paper and open publishing

I had a paper go live last week, and you can read it here: Discovering functional modules by identifying recurrent and mutually exclusive mutational patterns in tumors

The interesting thing about this journal is that it’s not only open-access, but the peer-review process is completely open. You can see the original article that we submitted, the comments from the peer reviewers (and their names!) and the revisions that we made in response.

For non-scientists, or early grad students who have never submitted a paper, it’s an interesting look behind the curtain.

As someone who has published a few times, it was kind of refreshing to be able to attribute the review comments to specific authors. I think it helped in understanding where they were coming from with their criticisms and how to address them.

On the whole, the experience was a positive one. I feel like signing your name to a review probably makes people less likely to be dismissive (and sometimes just plain mean), and forces reviewers to justify their comments a little better.

Is it the future of publishing? Only time will tell, but I’d certainly publish or review papers for such a journal again.

Bad Project

This Lady Gaga parody brought to you by the Zheng lab here at BCM.

PacBio Revealed

Oliver Elemento has done a pretty remarkable in-depth analysis of the first publicly available PacBio data. It’s all up on his blog, so jump over and read the whole thing, but here are a few of the highlights:

  • The machine only produces about 48k reads per run. By Oliver’s reckoning, this works out to about 6,400 runs to get 10x coverage of a human genome. Ouch.
  • Single-pass sequence accuracy is remarkably low, at just over 80%. I heard rumors that PacBio had accuracy problems, but didn’t expect the error rate to be that ugly.
  • On a more positive note, read length is very high, with several runs *averaging* 2,300 bp, and overall read length averages ~850 bp.
  • Interesting, there is a positive correlation between read length and quality. This is somewhat different from what we see from other platforms, where read length is limited by the huge drops in quality near the end of the read.

The bottom line, in my mind, is that unless PacBio can solve their problems in accuracy and throughput, they’re going to be relegated to niche applications.

Using BioRuby to fetch citations from Pubmed

As I write my thesis, I’m pulling together information from multiple papers that I’ve been an author on. These papers are in different journals, with wildly different citation styles. Within my thesis, though, citations need to be presented in a consistent format.

Now, if I had all of these citations stored in a reference manager, it would be a piece of cake to just export them all in a common format. Unfortunately, I wasn’t the first author on some of these papers, so I don’t have that data. Fortunately, with Pubmed and a little BioRuby, it’s not too difficult to get it.

Assuming that the paper is archived in Pubmed Central, there is a link on the right-hand side of the page titled “References for this PMC Article”. If you click on it, then tweak the display options at the top, you can retrieve all of the PMIDs for the articles that were cited.

Save those as a list, then use the following BioRuby code to pull down the citation in BibTex format, for easy import into a citation manager:


require "rubygems"
require 'bio'[0]).each{|id|
  entry = Bio::PubMed.query(id)
  medline =
  reference = medline.reference
  puts reference.bibtex

I’ll spare you the story of how I started out trying to use regexen to parse through the text of the citations. I pulled together something that sort of worked, but required a seperate regex for each journal, and often returned multiple results that I had to manually disambiguate. Yuck.

Thanks goes to Martin over at FriendFeed for letting me know that PMC had citation info.

Bioinformatics in five years

Over at BioStar, Keith asked:

In five years time, how would the bioinformatics landscape be and what will probably be the main focus(es) in bioinformatics i.e the hottest areas in bioinformatics?

Perhaps you’re looking for daring predictions, but I see lots of incremental progress, especially on the following fronts:

If by “hottest”, you mean “number of employees”, I think that there will be large number of openings for Masters-level (or lower) bioinformatics staff. These are the folks who will handle routine munging of huge data sets at most sequencing centers. At the present, a lot of that is still handled by either PhDs or grad students. As tools and standards get entrenched, though, you’ll see more and more offloaded to technical staff.

There’s bound to be a lot of movement in the health informatics field, building tools that can take in your personal genome sequence and spit out useful medical advice (in a format that’s useful to both patients and clinicians). This involves not only genomics skills, but also mining of the medical literature and building useful and searchable databases.

Though systems biology has been muted a little as the hype wears off, it’s poised to undergo a huge leap forward. With high-throughput data from tens or hundreds of thousands of cells, our models of how the cell works at a network or pathway level are only going to improve.

Other things that will be in demand:

Database and other “big data” skills – how are you going to store and access data from millions of genomes? We’re talking petabytes of information here.

Visualization – the larger the data gets, the less we’re able to really wrap our heads around it. A few good pictures can often tell us more than a million lines of data.

Truly interdisciplinary scientists. Not CS people who picked up a little bit of biology, or Bio majors who hack a little perl. We’re going to see the first generation of scientists who have really been trained to straddle the boundary between the two. They’re going to be well-poised to not only do solid research on their own, but be the lynchpins of successful collaborations.

Now, if you asked me where I saw the state of genomics in 5 years, or the state of cancer research, I’d think I’d have a lot bigger, bolder predictions. I just don’t see that the basic computational and statistical skillsets that bioinformaticians use today are likely to change tremendously. They’ll just get applied to bigger data, become more parallel, and be more in demand.

« Previous Entries |