PacBio Revealed

Oliver Elemento has done a pretty remarkable in-depth analysis of the first publicly available PacBio data. It’s all up on his blog, so jump over and read the whole thing, but here are a few of the highlights:

  • The machine only produces about 48k reads per run. By Oliver’s reckoning, this works out to about 6,400 runs to get 10x coverage of a human genome. Ouch.
  • Single-pass sequence accuracy is remarkably low, at just over 80%. I heard rumors that PacBio had accuracy problems, but didn’t expect the error rate to be that ugly.
  • On a more positive note, read length is very high, with several runs *averaging* 2,300 bp, and overall read length averages ~850 bp.
  • Interesting, there is a positive correlation between read length and quality. This is somewhat different from what we see from other platforms, where read length is limited by the huge drops in quality near the end of the read.

The bottom line, in my mind, is that unless PacBio can solve their problems in accuracy and throughput, they’re going to be relegated to niche applications.

The Velluvial Matrix

Here’s one physician’s take on increasing complexity in medicine, given as a commencement speech at Stanford. I’ll let you jump over there if you want to find out the details about the “Velluvial matrix”, but it’s really just a device that lets him get at some big-picture ideas about the future of the field:

This is a deeper, more fundamental problem than we acknowledge. The truth is that the volume and complexity of the knowledge that we need to master has grown exponentially beyond our capacity as individuals. Worse, the fear is that the knowledge has grown beyond our capacity as a society. When we talk about the uncontrollable explosion in the costs of health care in America, for instance—about the reality that we in medicine are gradually bankrupting the country—we’re not talking about a problem rooted in economics. We’re talking about a problem rooted in scientific complexity.

This reminds me a bit of the PhD comic, which shows how grad school makes you dumber (a phenomenon closely related to the Dunning–Kruger effect). The more we learn about physiology and medicine, the more specialized each branch has to become, and the more likely that we are to make mistakes when crossing disciplinary boundaries:

Smith told me that to this day he remains deeply grateful to the people who saved him. But they missed one small step. They forgot to give him the vaccines that every patient who has his spleen removed requires, vaccines against three bacteria that the spleen usually fights off. Maybe the surgeons thought the critical-care doctors were going to give the vaccines, and maybe the critical-care doctors thought the primary-care physician was going to give them, and maybe the primary-care physician thought the surgeons already had. Or maybe they all forgot. Whatever the case, two years later, Duane Smith was on a beach vacation when he picked up an ordinary strep infection. Because he hadn’t had those vaccines, the infection spread rapidly throughout his body. He survived—but it cost him all his fingers and all his toes. It was, as he summed it up in his note, the worst vacation ever.

This is absolutely relevant to my last post, about intelligent systems that can make decisions. If designed properly, a machine will never miss a piece of evidence or forget to perform a step. (that’s a big “if”, but one we need to start tackling). Studies have already shown that simple checklists can dramatically reduce complications and deaths during surgery. Now we need to be designing systems that produce those checklists instantly and adapt to the specifics of the patient on the table.

This isn’t as sexy as the type of personalized medicine that relies on genetic screening but it’s probably even more important, in terms of the capacity to save lives in the short-term.

Third-Gen Sequencing: Pacific Biosciences

A few months ago, I meandered over to Rice to see a talk from Mark Maxham, who manages the instrument software group at PacBio. He gave a good basic overview of their technology, which works by fixing DNA polymerase molecules at the bottom of tiny wells, which they call zero-mode waveguides.


The size of these wells is key, because they’re smaller than the wavelength of the light being beamed in. Thus, the laser illuminates only the very bottom portion of the well, right where the polymerase sits. Then they feed in single-stranded DNA and fluorescently labeled nucleotides, and watch the bases being incorporated in real time.

Before I dive into the details of the sequencer, let me say that he talked a lot about the engineering challenges posed by working on such a minute scale, and it’s quite impressive that the system works at all. In addition to having super-sensitive CCDs to capture the tiny amounts of fluorescence given off as a base is incorporated, they have to hold those sensors almost perfectly still to avoid crosstalk between wells. Then, once they get the input from the sensors, they use Hidden Markov Model-based algorithms to process it and extract the signal from all the noise.

All of that is just engineering talk, though. The real question is, as a consumer of genomic data, what does this give me, and how does it stack up to the other sequencing platforms that are in development?

Read lengths on the PacBio machine are impressive – Maxham showed data from the E. coli genome where they had average read length of 500bp, and maximum read lengths of 3.2 kbp. That’s long enough to get across almost any repeat region, meaning that doing assembly from this platform will be a breeze. In order to get even longer reads for scaffolding, they also have a “pulsed mode” where the laser turns on and off at regular intervals. This produces reads containing short gaps and doubles the maximum read length to about 6.4 kbp.

The limiting factor in the length of reads they get from the system turns out to be the polymerase. The laser light used for detection heats up the system and “burns out” the polymerase after a while. In order to increase this read length, they’re going to need to either engineer a more robust polymerase or produce more sensitive CCDs so that they can lower the intensity of the laser. I’m sure they’re working on both approaches.

Another nice feature about the PacBio system is the ability to circularize reads by placing “dumbells” (which they call SMRTbells) on both ends of a double-stranded read.

pacbio SMRTbell

Poor naming choice aside, it’s a clever little system. They prime inside one of the dumbells and as the polymerase reads this dsDNA, it’s unzipped into a circular fragment. The polymerase will happily read this fragment over and over. until the aforementioned burnout occurs. This gives very deep coverage of short sequences, and drives the error rate to almost nil.

I asked if there was any sequence specificity, and whether it has trouble handling any particular sequence motifs. (454, for example, has poor handling of homopolymer runs). He replied no, and that the only thing he knew of was that it performs slightly better on GC-rich content. I don’t know this for a fact, but I’d hypothesize that because of the extra hydrogen bond, the guanines and cytosines spend a little bit more time inside the active site of the polymerase, which gives a stronger signal and makes detection easier.

In read length and accuracy, this platform is impressive, but the big question is going to be what their throughput and price are like. Mark wasn’t allowed to go into detail on those points, but I hope that we’ll be hearing more from PacBio at the AGBT conference, which is coming up at the end of the month.

Image sources:
1) Eid, et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science doi:10.1126/science.1162986
2) PacBio Website