A friendly reminder

traffic fatalities v lemon imports

Correlation != Causation

Heilmeier’s Catechism

George Heilmer (somewhat) famously proposed a list of questions that you should be able to answer before starting any new research or development project:

  • What are you trying to do? Articulate your objectives using absolutely no jargon.
  • How is it done today, and what are the limits of current practice?
  • What’s new in your approach and why do you think it will be successful?
  • Who cares?
  • If you’re successful, what difference will it make?
  • What are the risks and the payoffs?
  • How much will it cost?
  • How long will it take?
  • What are the midterm and final “exams” to check for success?
  • Every second year graduate student should have a worksheet containing this list as they prepare to propose their thesis project. It’s equally applicable to grant applications, product design in industry – damn near everything you do in science really.

    quoted from Wikipedia, after being mentioned by Titus Brown

    Transparent overlapping histograms in R

    Today I wanted to compare the histograms from two data sets, but it’s hard to see the differences when the plots overlap. Here’s an example of the problem:
    [sourcecode]
    a = rnorm(10000,5)
    b = rnorm(10000,3)
    hist(a,xlim=c(0,10))
    hist(b,col="gray20",add=T)
    [/sourcecode]
    Output:
    normal histograms

    I want to know what’s going on behind all that solid grey!

    So I coded up a little function to add pseudo-transparency to pairs of histograms:
    [sourcecode]
    plotOverlappingHist <- function(a, b, colors=c("white","gray20","gray50"),
    breaks=NULL, xlim=NULL, ylim=NULL){

    ahist=NULL
    bhist=NULL

    if(!(is.null(breaks))){
    ahist=hist(a,breaks=breaks,plot=F)
    bhist=hist(b,breaks=breaks,plot=F)
    } else {
    ahist=hist(a,plot=F)
    bhist=hist(b,plot=F)

    dist = ahist$breaks[2]-ahist$breaks[1]
    breaks = seq(min(ahist$breaks,bhist$breaks),max(ahist$breaks,bhist$breaks),dist)

    ahist=hist(a,breaks=breaks,plot=F)
    bhist=hist(b,breaks=breaks,plot=F)
    }

    if(is.null(xlim)){
    xlim = c(min(ahist$breaks,bhist$breaks),max(ahist$breaks,bhist$breaks))
    }

    if(is.null(ylim)){
    ylim = c(0,max(ahist$counts,bhist$counts))
    }

    overlap = ahist
    for(i in 1:length(overlap$counts)){
    if(ahist$counts[i] > 0 & bhist$counts[i] > 0){
    overlap$counts[i] = min(ahist$counts[i],bhist$counts[i])
    } else {
    overlap$counts[i] = 0
    }
    }

    plot(ahist, xlim=xlim, ylim=ylim, col=colors[1])
    plot(bhist, xlim=xlim, ylim=ylim, col=colors[2], add=T)
    plot(overlap, xlim=xlim, ylim=ylim, col=colors[3], add=T)
    }
    [/sourcecode]

    The results are much easier to interpret:
    [sourcecode]
    a = rnorm(10000,5)
    b = rnorm(10000,3)
    plotOverlappingHist(a,b)
    [/sourcecode]

    overlapping histograms

    I’m sure this could be improved upon and generalized to multiple histograms, but this solves my problem, so I’m calling it quits for now. Let me know if you make improvements or find it useful.

    Bacteria and Bad Breath

    Over at Ask Metafilter, Srudolph asks:My son and his dad (my former husband) never have bad breath. How is this possible?

    Halitosis is mostly caused by microorganisms breaking down the food particles left in our mouths. Right, now, we know that people can have fairly different microbial populations, but don’t have a good handle on why that is. Some of it is certainly environmental, related to eating habits and such, but some of it also depends on personal factors. The immune system plays a role in culling these bacteria, and there are probably other contributors, like slight differences in the pH of your saliva, that might make all the difference.

    Factors like these last two may be partially genetic, which explains why your husband and son have similarly fresh breath. Since we don’t yet know enough about these dynamic microbial ecosystems to control them, your best bet is to brush after meals, and be glad that you don’t have to kiss a smelly spouse.

    If you’re interested in learning more about the trillions of microorganisms that inhabit your body, read up on the Human Microbiome Project, which is trying to characterize and understand them.

    Google Scholar simplifies my life (again)

    Google Scholar has largely replaced PubMed as the literature search engine of choice for my generation. It’s more intuitive, requires less futzing around with keywords, and generally seems to produce better results with less hassle. Today, Google Scholar got a little bit better, adding the ability to search within articles that cite a specific article. Sound confusing? Then let me explain:

    When researching a particular gene or reading up on a technique, you’ll eventually stumble upon a great paper. It’s clearly a seminal paper in the field and is perfectly related to what you’re looking for, but it’s a little out of date. Surely, you think, other people have been working on this problem over the last decade!

    The old way to find out was to click on the “Cited By” link and then scroll through all the papers that appeared, looking for relevant titles, then skimming their abstracts. (Being able to get even this much information was a major breakthrough not so long ago). On really solid papers, though, that cited-by list might number into the hundreds. Now, Google has made this step easier by allowing you to do full-text search within that list, effectively narrowing your search to a specific lineage of scientific discovery.

    In the future, I’d love to see this expanded to second or third-degree neighbors. I’d also like the ability to go the opposite direction, and search within all the papers that a specific paper cites. I can’t gripe too much about lack of features, though. Only a couple of decades ago, people were still following citation trails manually, by pulling the relevant journals off the shelf and making photocopies. It’s amazing that anything got done at all back then.

    The Velluvial Matrix

    Here’s one physician’s take on increasing complexity in medicine, given as a commencement speech at Stanford. I’ll let you jump over there if you want to find out the details about the “Velluvial matrix”, but it’s really just a device that lets him get at some big-picture ideas about the future of the field:

    This is a deeper, more fundamental problem than we acknowledge. The truth is that the volume and complexity of the knowledge that we need to master has grown exponentially beyond our capacity as individuals. Worse, the fear is that the knowledge has grown beyond our capacity as a society. When we talk about the uncontrollable explosion in the costs of health care in America, for instance—about the reality that we in medicine are gradually bankrupting the country—we’re not talking about a problem rooted in economics. We’re talking about a problem rooted in scientific complexity.

    This reminds me a bit of the PhD comic, which shows how grad school makes you dumber (a phenomenon closely related to the Dunning–Kruger effect). The more we learn about physiology and medicine, the more specialized each branch has to become, and the more likely that we are to make mistakes when crossing disciplinary boundaries:

    Smith told me that to this day he remains deeply grateful to the people who saved him. But they missed one small step. They forgot to give him the vaccines that every patient who has his spleen removed requires, vaccines against three bacteria that the spleen usually fights off. Maybe the surgeons thought the critical-care doctors were going to give the vaccines, and maybe the critical-care doctors thought the primary-care physician was going to give them, and maybe the primary-care physician thought the surgeons already had. Or maybe they all forgot. Whatever the case, two years later, Duane Smith was on a beach vacation when he picked up an ordinary strep infection. Because he hadn’t had those vaccines, the infection spread rapidly throughout his body. He survived—but it cost him all his fingers and all his toes. It was, as he summed it up in his note, the worst vacation ever.

    This is absolutely relevant to my last post, about intelligent systems that can make decisions. If designed properly, a machine will never miss a piece of evidence or forget to perform a step. (that’s a big “if”, but one we need to start tackling). Studies have already shown that simple checklists can dramatically reduce complications and deaths during surgery. Now we need to be designing systems that produce those checklists instantly and adapt to the specifics of the patient on the table.

    This isn’t as sexy as the type of personalized medicine that relies on genetic screening but it’s probably even more important, in terms of the capacity to save lives in the short-term.

    Watson, Jeopardy, and intelligent machines

    Software firms and university scientists have produced question-answering systems for years, but these have mostly been limited to simply phrased questions. Nobody ever tackled “Jeopardy!” because experts assumed that even for the latest artificial intelligence, the game was simply too hard: the clues are too puzzling and allusive, and the breadth of trivia is too wide.

    With Watson, I.B.M. claims it has cracked the problem — and aims to prove as much on national TV. The producers of “Jeopardy!” have agreed to pit Watson against some of the game’s best former players as early as this fall.

    The New York Times profiles IBM’s Watson. It’s both a look at how far AI systems have come, and how far they still have to go. I’d be really interested in reading more about the underlying algorithms, so that I can get a better idea of where the major bottlenecks are.

    It’s going to be amazing when we can apply machines like this to tasks like decision making in hospitals, or even create new, more intuitive, data mining interfaces. Hell, if we start cross-breeding Watson with something like ADAM, the robot that can form and test hypotheses, us graduate students may be made obsolete.

    Well, someday…

    Use ‘parallel’ for easy multi-processor execution

    I just discovered the parallel utility, which is a easy way to make use of multiple processors while munging data from the shell. I especially like that I can pipe data directly in and out of it, just like the other shell utils (sed, awk, cut, sort, etc).

    From the examples on that page:
    Use imagemagick’s “convert” command to downsize many images to thumbnails:
    ls *.jpg | parallel -j +0 convert -geometry 120 {} thumb_{}

    “-j +0″ means use as many processes as possible, based on the number present in the system

    Or do it recursively using find:
    find . -name '*.jpg' | parallel -j +0 convert -geometry 120 {} {}_thumb.jpg

    I suspect this will become an integral part of my pipelines soon.

    Complexity and biology

    Michael White discusses the status quo in modeling complex biological systems:

    There is a lot of biology tourism by some computational people who really are doing fact-free science. They are happy if they can successfully use their model to number-crunch some data set (typically the other half of the single dataset they used to train their model); they declare victory and say that their success means that some grand idea (typically untested, if not untestable) that motivated their model has been vindicated.

    Read the whole thing.

    Link Roundup

    Between submitting a paper, wrangling together an R package that I’m getting ready to release, and prepping for the CSHL Biology of Genomes conference this week, I’ve been neglecting to write here. To keep you entertained, here are a few links to the best of genome biology and bioinformatics over the last few weeks:

    « Previous Entries | Next Entries »