PacBio Revealed

Oliver Elemento has done a pretty remarkable in-depth analysis of the first publicly available PacBio data. It’s all up on his blog, so jump over and read the whole thing, but here are a few of the highlights:

  • The machine only produces about 48k reads per run. By Oliver’s reckoning, this works out to about 6,400 runs to get 10x coverage of a human genome. Ouch.
  • Single-pass sequence accuracy is remarkably low, at just over 80%. I heard rumors that PacBio had accuracy problems, but didn’t expect the error rate to be that ugly.
  • On a more positive note, read length is very high, with several runs *averaging* 2,300 bp, and overall read length averages ~850 bp.
  • Interesting, there is a positive correlation between read length and quality. This is somewhat different from what we see from other platforms, where read length is limited by the huge drops in quality near the end of the read.

The bottom line, in my mind, is that unless PacBio can solve their problems in accuracy and throughput, they’re going to be relegated to niche applications.

The problem of prioritization

Dan Kobolt just threw up a great post talking about sifting through the hundreds of mutations that we’re finding in each genome to find those that actually, y’know, mean something.

we are facing two significant challenges. First, identifying the subset of variants have functional significance – separating the wheat from the chaff, if you will. Second, understanding how these functional variants contribute to a phenotype. This is soon to be the frontier in genetics and genomics.

I couldn’t agree more, and since my comment on his post got to be a little longer than I intended, I decided to reproduce it (edited slightly) over here.

I’ve been tackling similar ideas as part of my thesis work. Specifically, we’ve been developing tools that go beyond simple recurrence and look at mutational patterns that can give insight into the significance and functional role of mutations.

The easiest one to think about is mutual exclusivity. If I have part of an oncogenic pathway with two genes (A and B), then we expect that mutations in either one may be enough to disrupt the system, and there will be no selective pressure for mutation in the other. So if we assay a panel of tumors and see that half the tumors have a mutation in gene A, and the other half have a mutation in gene B, with no overlap, it’s quite likely that the mutations play similar functional roles. By detecting these patterns, we can create testable hypotheses about how genes interact, even if they’re not represented in functional databases.

It’s also important to remember that pathways can be disrupted in multiple ways. Exome sequencing to find point mutations may not be enough, as we know that copy-number alterations may lead to altered expression levels, or aberrant methylation may cause dysregulation. A integrative approach is going to be key as we move forward.

So my point is, yeah, there are absolutely people working on improving this process and doing a better job of prioritizing these mutations for in vivo validation. It’s an exciting place to be working right now, as it’s a major bottleneck preventing us from translating ubiquitous sequencing into personalized medicine. I’m glad to be working here in the thick of it.

Third-Gen Sequencing: Pacific Biosciences

A few months ago, I meandered over to Rice to see a talk from Mark Maxham, who manages the instrument software group at PacBio. He gave a good basic overview of their technology, which works by fixing DNA polymerase molecules at the bottom of tiny wells, which they call zero-mode waveguides.


The size of these wells is key, because they’re smaller than the wavelength of the light being beamed in. Thus, the laser illuminates only the very bottom portion of the well, right where the polymerase sits. Then they feed in single-stranded DNA and fluorescently labeled nucleotides, and watch the bases being incorporated in real time.

Before I dive into the details of the sequencer, let me say that he talked a lot about the engineering challenges posed by working on such a minute scale, and it’s quite impressive that the system works at all. In addition to having super-sensitive CCDs to capture the tiny amounts of fluorescence given off as a base is incorporated, they have to hold those sensors almost perfectly still to avoid crosstalk between wells. Then, once they get the input from the sensors, they use Hidden Markov Model-based algorithms to process it and extract the signal from all the noise.

All of that is just engineering talk, though. The real question is, as a consumer of genomic data, what does this give me, and how does it stack up to the other sequencing platforms that are in development?

Read lengths on the PacBio machine are impressive – Maxham showed data from the E. coli genome where they had average read length of 500bp, and maximum read lengths of 3.2 kbp. That’s long enough to get across almost any repeat region, meaning that doing assembly from this platform will be a breeze. In order to get even longer reads for scaffolding, they also have a “pulsed mode” where the laser turns on and off at regular intervals. This produces reads containing short gaps and doubles the maximum read length to about 6.4 kbp.

The limiting factor in the length of reads they get from the system turns out to be the polymerase. The laser light used for detection heats up the system and “burns out” the polymerase after a while. In order to increase this read length, they’re going to need to either engineer a more robust polymerase or produce more sensitive CCDs so that they can lower the intensity of the laser. I’m sure they’re working on both approaches.

Another nice feature about the PacBio system is the ability to circularize reads by placing “dumbells” (which they call SMRTbells) on both ends of a double-stranded read.

pacbio SMRTbell

Poor naming choice aside, it’s a clever little system. They prime inside one of the dumbells and as the polymerase reads this dsDNA, it’s unzipped into a circular fragment. The polymerase will happily read this fragment over and over. until the aforementioned burnout occurs. This gives very deep coverage of short sequences, and drives the error rate to almost nil.

I asked if there was any sequence specificity, and whether it has trouble handling any particular sequence motifs. (454, for example, has poor handling of homopolymer runs). He replied no, and that the only thing he knew of was that it performs slightly better on GC-rich content. I don’t know this for a fact, but I’d hypothesize that because of the extra hydrogen bond, the guanines and cytosines spend a little bit more time inside the active site of the polymerase, which gives a stronger signal and makes detection easier.

In read length and accuracy, this platform is impressive, but the big question is going to be what their throughput and price are like. Mark wasn’t allowed to go into detail on those points, but I hope that we’ll be hearing more from PacBio at the AGBT conference, which is coming up at the end of the month.

Image sources:
1) Eid, et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science doi:10.1126/science.1162986
2) PacBio Website