Science Communcation

Over at BioStar, someone asked the question: How do you explain what you do to the guy on the street or your mum?

Poor science communication is a pet peeve of mine, so I wrote a rather long answer. First of all, I totally think that science communication should be a required course in every PhD program, and that you should have to practice explaining your work until you can do it in your sleep. Scientists can be their own best advocates, but they need to work at it.

As a scientist, you need to be able to explain your work at several different depths. The most important part of this process is accurately gauging the interest and experience of your audience, so you can choose the appropriate spiel. Here are a few of the explanations that you need to have ready:

For non-scientists:

  • The layperson’s 15-second elevator pitch: For the cocktail party or the new acquaintance who asks what you do. They should walk away understanding that you do science and that your work is trying to make the world a better place. (“better cancer treatments”, “new malaria drugs”)
  • The follow-up 2-minute overview if they ask for more details Still very high level, abstract, focused on where you’re trying to get with your research. (“understanding XYZ part of disease ABC by looking at things through a microscope”, “figuring out how the brain stores memories by sticking people in cool scanners”)
  • The full explanation. For the non-scientists who really want to wrap their head around what you do. Keep in mind that these people may not have taken a science course since high school. Avoid jargon and acronyms, and make sure that at the end, you leave them with the big picture idea of what you’re trying to accomplish and how that will advance humanity.

For scientists:

  • The scientists’s elevator pitch. For people who you’ll meet around your campus or at conferences. You may even need two or three of these, for use in different venues. (At a focused conference, you’ll be more specific and jargon-ythan at a departmental retreat)
  • The two-minute casual conversation. This one is tricky, because you need to read the person you’re talking to in a very short time. Do they know what RMA is? What about ERBB2? What’s their background, and how can I look at my problem from their angle, so as to best couch my answer in terms they’ll understand?
  • The 5, 15, and 30 minute presentations, often with slides or a poster. You should get drilled in these during your graduate school career, and if you didn’t, there’s no time like the present to start practicing.

Once you get these down, practice the final element, which is being enthusiastic about your work. After all, you probably think that what you’re researching is one of the coolest and most important things ever. If that comes across to your audience, they’ll be engaged and interested too.

sort and uniq

This is part two in my series of short primers on useful tools for bioinformatics. I’m continuing the ‘back to basics’ theme here and talking about two shell commands today. sort and uniq are two utilities that I wish I had known about a few years sooner, as I now use them pretty much daily while munging data.

Sort does exactly what it says on the box: lets you order your data. In the simple case, you might have a file containing a list of gene names, and you’d sort it like this:

sort infile > outfile

Or it might contain numbers, so you’d want to sort it numerically, like so:

sort infile > outfile

Now, let’s say you’re working with tab-delimited data, which might contain genomic coordinates and scores:

chr1 100100 100200 19
chr1 100050 100099 3
chr1 100300 100600 20

To sort it numerically by the 4th column, we can use the -n and -k flags:

sort -n -k4 infile > outfile

If we want to reverse that order, add the -r flag:

sort -nrk4 infile > outfile

What about sorting by multiple columns? By default, -k uses a key that starts with the specified column and continues to the end of the line. Since we want only the nth column, we specify that using the comma syntax. So, to sort the above file by genomic coordinates (chr, start, stop), we can use something like this:

sort -k1,1 -k2,2n -k3,3n infile > outfile

After sorting your data, you can then scan it for unique lines using the uniq command. Basic usage will just keep the first instance of every value, throwing away the duplicates:

sort infile | uniq > outfile

You can also output only the unique values (throwing away any anything that appears more than once):

sort infile | uniq -u > outfile

Or output just the duplicates (throwing away anything that appears only once):

sort infile | uniq -d > outfile

It will even condense the lines and prepend a count of how many times each value appears in the file:

sort infile | uniq -c > outfile

As a simple example of real world usage, let’s imagine that I have lists of differentially expressed genes generated by two cell line experiments. I want to find out which genes are altered in both systems. One way to do this would be to write a little script to read in each file, then hash all the gene names and output any names that appear in both lists. With these tools, we can now do it in one line from the shell:

cat file1 file2 | sort | uniq -d

Use wget to grab the data you need

I’m starting a series of code snippets here, which will be short bits of useful code. Let’s start with something basic. Need to grab a whole directory of files from an http or ftp server? Use wget.

Grab all of the files with suffix “.tar.gz”:

wget -l2 -r -A tar.gz http://site.goes/here/

If they are a lot of files, or they’re huge, consider using the –wait flag to pause for a few seconds between downloads, so that you don’t slam their server.

wget -l2 -r -A txt.gz --wait=10 http://site.goes/here/

If the site owner blocks you from using wget with a robots.txt file, think carefully – they probably have good reason for doing so. If you’re still convinced that wget is the right way, and you’re sure that you won’t be crippling their server (or skyrocketing their bandwidth bill), you can have wget ignore the robots.txt file:

wget -l2 -r -A txt.gz --wait=10 -e robots=off http://site.goes/here/


It’s a cliché that too little American talent goes into science, and that too many people go into banking, and that our education system is said to be failing because of those effects. To some substantial effects it must be true.

I found myself on a business trip to Europe and lucky for me I was sitting in business class. Seated not far from me in business class was a young woman who had graduated from Harvard 14 months before and who was working for a major financial institution and she was traveling to Europe and when people travel for that major financial institution they travel in business class. I like to walk when I’m on a plane, so I wandered. I walked back to coach and on that airplane in coach was a distinguished physicist who I had known when I was president of Harvard who I think probably is close to even money to win a Nobel Prize one day. He was going to a conference like professors of physics do and he was going like professors of physics like him go, which is in coach. And I didn’t say anything to either of them, but I thought to myself there was something odd about the reward structure of our society.

–Lawrence Summers

as quoted here

Advice for a potential PhD student

Over at AskMeFi, someone asks whether a PhD program is right for them:

On the one hand, I’d always sort of pictured myself as a PhD holder one day. On the other hand, I have no desire to do the level of grant-writing required of competent PIs, I don’t want to manage a bunch of other people in my research team – I want to do the research. . .

  1. Getting a PhD just because you’ve pictured yourself holding one is a horrible idea. There are lots of good reasons to get a doctorate, but they’re all about best positioning yourself for the career that you’d like to have. If you’re in a field where a masters is all that you need, then getting a PhD will be a colossal waste of time and earning potential. In those sorts of fields, you’d be better off with 5 years of job experience than with an extra degree.
  2. The main thing that you should do is think about what kind of job you’d ultimately like to get. Check the job postings – do they require masters? Do they mention PhDs as a plus? Dig around websites and find the bios of people who have that job already. Do they have PhDs?
  3. Talk to some faculty and/or grad students in the type of PhD program that you’d want to go to, and see what their thoughts are. They are immersed in the field and can tell you what kinds of job opportunities the PhD program opens up that might not be available to someone with a masters. (The most obvious, of course, being tenure-track faculty)
  4. As for not wanting to be a PI and spend your days writing grants, I understand that sentiment. Keep in mind that not all universities are research universities. If you’re happy working with smaller budgets and devoting lots of time to teaching, a gig at a small or liberal arts university might be right up your alley. As I said upthread, these will almost certainly require a PhD. Also look into research assistant or research faculty positions within labs, where you’ll have a boss who writes all the grants.


Man goes into a bar: Can I have a pint of adenosine triphosphate please?
Barman: Certainly sir, that’ll be 80p


A mosquito did cry out in pain,
“A scientist’s rotting my brain!”
The cause of his sorrow
was para-dichloro

(source unknown)

Third-Gen Sequencing: Pacific Biosciences

A few months ago, I meandered over to Rice to see a talk from Mark Maxham, who manages the instrument software group at PacBio. He gave a good basic overview of their technology, which works by fixing DNA polymerase molecules at the bottom of tiny wells, which they call zero-mode waveguides.


The size of these wells is key, because they’re smaller than the wavelength of the light being beamed in. Thus, the laser illuminates only the very bottom portion of the well, right where the polymerase sits. Then they feed in single-stranded DNA and fluorescently labeled nucleotides, and watch the bases being incorporated in real time.

Before I dive into the details of the sequencer, let me say that he talked a lot about the engineering challenges posed by working on such a minute scale, and it’s quite impressive that the system works at all. In addition to having super-sensitive CCDs to capture the tiny amounts of fluorescence given off as a base is incorporated, they have to hold those sensors almost perfectly still to avoid crosstalk between wells. Then, once they get the input from the sensors, they use Hidden Markov Model-based algorithms to process it and extract the signal from all the noise.

All of that is just engineering talk, though. The real question is, as a consumer of genomic data, what does this give me, and how does it stack up to the other sequencing platforms that are in development?

Read lengths on the PacBio machine are impressive – Maxham showed data from the E. coli genome where they had average read length of 500bp, and maximum read lengths of 3.2 kbp. That’s long enough to get across almost any repeat region, meaning that doing assembly from this platform will be a breeze. In order to get even longer reads for scaffolding, they also have a “pulsed mode” where the laser turns on and off at regular intervals. This produces reads containing short gaps and doubles the maximum read length to about 6.4 kbp.

The limiting factor in the length of reads they get from the system turns out to be the polymerase. The laser light used for detection heats up the system and “burns out” the polymerase after a while. In order to increase this read length, they’re going to need to either engineer a more robust polymerase or produce more sensitive CCDs so that they can lower the intensity of the laser. I’m sure they’re working on both approaches.

Another nice feature about the PacBio system is the ability to circularize reads by placing “dumbells” (which they call SMRTbells) on both ends of a double-stranded read.

pacbio SMRTbell

Poor naming choice aside, it’s a clever little system. They prime inside one of the dumbells and as the polymerase reads this dsDNA, it’s unzipped into a circular fragment. The polymerase will happily read this fragment over and over. until the aforementioned burnout occurs. This gives very deep coverage of short sequences, and drives the error rate to almost nil.

I asked if there was any sequence specificity, and whether it has trouble handling any particular sequence motifs. (454, for example, has poor handling of homopolymer runs). He replied no, and that the only thing he knew of was that it performs slightly better on GC-rich content. I don’t know this for a fact, but I’d hypothesize that because of the extra hydrogen bond, the guanines and cytosines spend a little bit more time inside the active site of the polymerase, which gives a stronger signal and makes detection easier.

In read length and accuracy, this platform is impressive, but the big question is going to be what their throughput and price are like. Mark wasn’t allowed to go into detail on those points, but I hope that we’ll be hearing more from PacBio at the AGBT conference, which is coming up at the end of the month.

Image sources:
1) Eid, et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science doi:10.1126/science.1162986
2) PacBio Website

Announcing BioBits

Lately, I’ve been struggling with the fact that I’ve been writing for two different audiences on my weblog. The first group consists of my family, friends, and people that have similar tastes in politics and culture. The second group are fellow scientists and techies who may be interested in what I’ve been up to in lab, or what new bioinformatics tools I’ve been using.

In the interest of pleasing both audiences and allowing you to more easily sort the wheat from the chaff, I’m moving my science writing to this space, while maintaining my personal blog at it’s original address. I want to emphasize that this isn’t about separating my personal and professional lives, as that’s a near impossibility in this age of the internet. Rather, it’s about offering focused content to just the people that want to hear it.

I’m also making a resolution to post on a near-daily basis over here, to get this site off on the right foot. Some days it may just be a code snippet, or a good quotation, but I’ll try to keep the content flowing with some more meaty posts about bioinformatics, article reviews, and thoughts about scientific culture.

Lessons for article recommendation services

Today someone proposed the creation of a sub-reddit where scientists could recommend papers to each other. While it’s a nice thought, I can almost guarantee that it’s going to be a failed effort. There are already sites like Faculty of 1000, which try to use panels of experts to recommend good papers. In my experience, they mostly fail at listing things that I want to read.

The main reason such sites are useless is that we scientists are uber-specialized, so what you think is the greatest paper ever will likely have very litle interest for me. It’s not that I don’t want to read about cool discoveries in other fields, it’s just that I don’t have time to. Until they invent the matrix-esque brain-jack for rapid learning, I have to prioritize my time, and my field and my work will always come first.

There are only two systems I’ve found that work well. The first are recommendation systems based on what you’ve read in the past, and what your colleagues are reading. CiteULike, for example, recommends users that have bookmarked similar papers to you, and perusing through their libraries gives me an excellent source of material. The other quality source of recommendations is FriendFeed, where I can subscribe to the feeds of other bioinformaticians with similar interests, and we can swap links to papers and comments about those papers.

Both of these systems are all about building micro-communities, with a focus that you can’t achieve in larger communities like Reddit. In this way, it’s sort of like a decentralized version of departmental journal clubs, or specialized scientific conferences. Any site that ignores the value of creating this type of community is pretty much doomed to failure from the start.

(reposted from my personal blog)

| Next Entries »