Monday, April 8, 2013

ENCODE: The good, the bad, and the ugly

I've recently been engaged in discussions about the ENCODE project, and, brought it up to my family and friends on facebook, asking them whether they had heard about ENCODE. Despite the splash this project has made in the scientific community, and the media coverage it got, it was my opinion that most in the general public still haven't heard about the project, let alone any of the controversy surrounding it. I thank everyone for their honest responses. I'd like to highlight my grandpa's response, which is actually the most accurate, given he's never heard of the ENCODE project. He said, verbatim:
"the word encode i think has been used since ww1 and ww2, if you mean gencode thats a new word to me. (Hi-- Melissa, papaw)" - Levi Myers
First off, he is wonderful (Hi papaw! Thank you for commenting!!). Secondly, he's totally right; the word encode is thought to have originated in apx 1919. Third, the set of people I sampled is obviously biased to my friends and family. I consider them to be a very informed and intelligent group of people who are generally interested in science and learning new things. And most of them had never heard of the ENCODE project, so here is my introduction to the project and why it is making waves in science. I meant for this to be short, but it has grown a bit, so please bear with me. And, please let me know if anything isn't clear.

What is ENCODE
All of our cells contain a set of DNA, called our genome. The genome contains much of the information for building "us". Although we all have small differences in our own DNA (that make us unique!), a few years ago scientists identified the sequence of a reference human genome. This is like identifying all the pieces of a bicycle. Some bicycles are different colors, some have some slightly different pieces, but all bicycles have the same basic parts.

But, just knowing all the parts isn't enough. We need to know what all of the pieces do. ENCODE is a science project funded by the US government with the goal of understanding what all the pieces of the human genome do: to make an ENCyclopedia Of DNA Elements.

I know I said, "the good, the bad, and the ugly", but I'm going to go out of order here.

The good about ENCODE
The first part of the ENCODE project is identifying that a region has a function. This project funded a group of labs to develop a library of these functional elements that will be made freely available, and can be a resource utilized by scientists all over the world. Recently (September 2012) the labs involved with ENCODE released the phase 1 of their data, in 30 papers published in Nature (6 papers), Genome Biology (18 papers), and Genome Research (6 papers). Links to these papers can be found at the bottom of this page.

Previously we knew that 1 to 2% of the human genome codes for genes that make proteins (which do much of the work in a cell), ENCODE identified with high confidence that about 20% of the rest (noncoding) has a definable function. Many of these regions control the activity of genes. The cohort of labs working on ENCODE also identified that another 60% of the noncoding human genome may show some small signal of being functional (they call it: biochemical activity) in at least one of the tissues studied, but the function of this 60% is unknown.

They also developed (in my opinion) an funny, concise summary* of genetic research called, The Story of You, that is accessible to the general public (and narrated by Tim Minchin!!!):

*Summary up to 2:45
*Note that humans are not necessarily "more complex" than peas. Humans cannot turn sunlight into energy through photosynthesis.

The ugly about ENCODE
When the huge, coordinated, 30 ENCODE papers came out, the authors made a decision to be generous in their interpretation of the word "function". So, remember up above, where I said ENCODE identified that 20% of the human genome has a known function, and another 60% has "biochemical activity", but unknown function. Well, many scientists would conservatively report that they identified 20% of the human genome is function. But, function, like any word, can have many meanings. The ENCODE group chose to define "function" to mean any biochemical activity. With this definition of function, the ENCODE authors could say that 80% (20% + 60%) is "functional".

This 80% figure was very popular among the media, and very unpopular among many scientists (see this summary by Brendan Maher, and note the comments section), with several blog responses (here is a wonderful summary, and many updates), and published responses (e.g., Graur et al, Genome Biology and Evolution, and W. Ford Doolittle, PNAS). Many of the critiques of the ENCODE project I have read focus on the hype surrounding the project, specifically how the authors chose to liberally define "function", how the journalists were uncritical of these claims, and the implications for future science and science communication. Similarly, there are, and I imagine will continue to be, discussions among scientists about whether the 80% number is "dishonest", perhaps just an "exaggeration", or a valid number to publicize. Although I'm certainly happy other people are taking this up, I have a different concern about the ENCODE project.

The bad about ENCODE
My biggest concern with the ENCODE project is that the majority of the efforts have not studied primary human tissues. The National Human Genome Research Institute website describes the cell lines, and a small bit about the rationale for choosing them for the ENCODE project here. Wait, wait, wait, what is a cell line? How is it different from primary human tissues?

Primary human tissues are basically any healthy (read, non-cancerous) tissue that we could sample from your body (for example, cheek cells, or blood cells, or brain cells, or liver cells). Why don't we just study these primary tissues, you might wonder? Well, with current technology, it is very difficult to get these cells to survive long outside of the body. There are whole labs dedicated to learning how to grow and sustain primary human tissues.

If we can't use primary human tissues because they die too quickly, what can we use? We can use cell lines. Cell lines are "immortal" cells that are derived from primary human tissues, but have some mutations (either natural, like the HeLa cells you may have heard of), or induced, that allow them to continue to grow and reproduce outside of a human body. These cell lines have helped scientists make many advances, including studying the effects of viruses on human cells, notably in developing vaccines, including the polio vaccine, and in studying cancer and cancer treatments.

But, cell lines are not healthy primary human tissues. The aspect of these cell lines that makes them useful for research (that they continue to replicate outside of the body), also means that they likely have differences in their DNA from primary human tissues. In fact, ENCODE notes:
"Effort was also made to select at least some cell types that have a relatively normal karyotype."
The DNA in each of our cells gets wound up into chromosomes. The "karyotype" refers to the number and structure of these chromosomes.
Normal human male karyotype (22 autosomes + X + Y)
Cell lines often have karyotypes that are not the same as primary human tissues. Human cells have 46 chromosomes (23 pairs, see above), but cell lines may have more, or the chromosomes may have unusual structures.

Thus, while these cell lines are derived from human tissues, they behave in unusual ways, and their DNA has unusual structure. Given these differences, I wonder how reasonable it is to assume that the fine-scale DNA patterns and functions identified in cell lines will actually mimic healthy human tissues.

The ENCODE project is continuing, and there are plans to investigate more healthy primary tissues. It will be wonderful to see how accurately cell lines reflect the expression patterns of the healthy tissues they derive from.


Dr. Drey said...

I'm still lurking on your posts and think they are excellent! :-)

Here I'd like to half comment half question since the deep details are far outside my expertise but much more up your alley -

Namely, as you note, the question of the definition of functionality. "any biochemical activity" to me seems like a pointless definition in light of what we (admittedly relatively recently) know about proteomics, epigenetics, siRNA, etc. it would seem to me that pretty much ALL DNA should actually have SOME biochemical activity. It is, after all, a very biologically active molecule.

The better question for me would be "what is the functionality and us it significant" rather than "does it have ANY"

I can imagine it likely that there are sequences of DNA that are functional by accident - simply because it happens to be reactive - with or without specific compensation in cellular processes for the activity. I can also see much of it being of little or no significance. After all, water in the cell is also functional and necessary in the biochemical processes.

Maybe it is just the clinician in me talking (though of course I would love to ultimately characterize every single atomic movement if a cell for as perfect a model as out friend Heisenberg will allow) but it seems a better task to focus on a more narrow definition with a specific focus as to what the functionality is to better build our understanding of the major interactive systems within a cell to build a model we can use to make predictions at the -omic level. This would, ass I see it, allow us to actually better understand and predict the effects if plieotropic phenotypes which are, after all, the cause of most disease and pathology (and variation).

Am I totally off base in my (albeit brief and rough) thinking?

mathbionerd said...

Hi Dr. Drey,

I think you are spot on. That has been the crux of the "ugly" portion. Many scientists agree with you that understanding function as it affects the cell is more important that the broad definition of any biochemical activity decided upon by the main ENCODE consortium.

There are many labs in the ENCODE consortium that are working on understanding this more specific question. I cannot imagine how frustrated they must be they their research is being overshadowed by this debate.

Unlike "the human genome" (which we now know varies much more across humans than originally expected), we know that "the human transcriptome" is highly variable between tissues of the same individual, let alone across individuals. That doesn't mean that we shouldn't look for consensus across individuals and populations that can help make broad statements, but it is a much more nuanced process... kind of like clinical work. :)