Thursday, January 30, 2014

Microseconds matter.

Grace Hopper answers, "Why does it take so damn long to send a message by satellite?" Specifically she does a tremendous job of visualizing the difference between a nanosecond and a microsecond, and what it means to throw away microseconds of computing time.

I am a computational biologist. Sometimes I use computers to compare differences and similarities between the nucleotides (A's, T's, G's and C's) of mammalian genomes.  Sometimes I look at evolution of the human sex chromosomes. I always write computer programs to analyze these datasets.

New technology means that the size of the datasets that we can produce is increasing, and will continue to increase during my lifetime. Computational infrastructure has also been increasing (see a history of computer storage). But, computational power and storage is not going to scale fast enough to keep pace with the size and complexity of the datasets we will need to analyze. That's why Hopper's video above is so relevant.

I've noticed a mentality in biology circles that all we need to do to be able to analyze larger and larger datasets is to increase computing power. But, there are some big problems with this:

1. Computers are not self-sustaining.
Computers need to be maintained, cooled, managed, and cared for. Computing resources need a space where they will be kept cooled. Computers cannot upload or update their own software. Computers cannot turn themselves back on after a power outage, back themselves up (without instructions to do so), or upgrade their own hard drives.

We need good system administrators, and computational lab managers, to maintain computing resources, and these people need to be paid a competitive salary. High-performace computers is not a once-and-done expenditure; it is an investment.

2. Bigger is not enough.
Yes, large datasets do require lots of storage space, and analysis will increase with faster processors, but that isn't enough. Let's think about it this way:

Imagine I have a dataset that takes one week of computing time to analyze using my fast processors, all of the storage I need, and my current code. If my dataset grows to be a hundred times larger, my dataset will now take 100 weeks of computing time to analyze. If I take no time to optimize my code, or parallelize the jobs, or figure out a new, faster, method, I will be waiting two years (assuming no hiccups) just to see what the new results are.

That is unacceptable. We need to code smarter. Similarly, we need to utilize efficient storage formats. There is progress in this direction, but it needs to be a constant focus.

3. Open science.
Despite the wonders of the internet, I would argue that most of us do not take the time to carefully edit and annotate our code, and make it publicly available to others. This is especially true for all of the "in-house" scripts used for data processing. These are small scripts that aren't stand-alone programs for some new type of analysis, just day-to-day analyses or parsers. But without these intermediate scripts an outside person cannot replicate our analysis exactly. I'm guilty of this myself. I do try to comment my codes heavily, and I locally archive all the codes for each project, but I don't always go the extra step of archiving them somewhere *public*. Going forward I am going to change this.

One option is to create a pipeline of all scripts with a clear README file, and deposit into public repositories like GitHub. Another is to incorporate tools into a web-based platform that allows workflows, like Galaxy. A third option is to maintain the code on a local website, but this seems more like a back-up to me.

Grace Hopper chastised programmers for not appreciating the value of a microsecond. Her admonition rings true as much today as it did over 20 years ago.


Ruth Hufbauer said...

My husband is a programer and I'm a biologist. Most of what he does is work on making existing code more efficient. Saves money, saves time. He's also introduced me to the joys of Github, which is now where we have our household to do list! Very functional for all sorts of things. All biologists should become familiar with it.

mathbionerd said...

Excellent to hear!

A lot of people in my current lab use GitHub. In my previous lab it was Galaxy.

I concur, these public repositories are excellent resources. I hadn't thought to use them for household communications, though. Nice! :)