I was very excited to get to teach one lesson for my advisor's course this week. The course is co-taught, and although it was the other professor's time to teach, both were out of town, so my advisor asked myself and one other postdoc to step up for the week.
I had a blast! The course is methods in Statistical Genomics, mainly for statistics and mathematics student. The students were alert (for a 9:30am class at Berkeley, I was especially impressed), and engaged. They asked questions, and really seemed interested in what I was talking about. My main goals for the class were to get them thinking about what kinds of data are available (to apply their statistics to), to help familiarize themselves with where and how to access the data, and to get them thinking about the diversity of the questions they can ask.
Below is my outline for the class, and some references I handed out to the students. I took about an hour to go through the first three points, and my fellow postdoc spent the remaining half hour on the fourth point.
Introduction to Bioinformatics: Finding Data
1.
What kind of data is there: Overview of the Genome
1.
Central
Dogma
1.
DNA
--transcribed--> RNA --translated--> Protein
1.
DNA
is a double helix (forward/reverse), four nucleotides
(Adenine, Guanine, Cytosine, and Thymine)
2.
Ribosomes
transcribe the DNA to form single strands of RNA
(Adenine, Guanine, Cytosine, and Uracil)
3.
RNA
is translated into protein
1.
read
in triplets
2.
64
permutations of three nucleotides, but only 20 amino acids, plus three stop
codons
3.
starting
with the start codon, Methionine, and ending in one of the three stop codons,
TAG, TGA, TAA
2.
Coding
regions
1.
Affected
by selection
2.
Genes
1.
5’,
3’ UTR, exons, introns
2.
multiple
isoforms (major and minor, mostly similar exons)
3.
Transcripts
1.
miRNA,
snoRNA, lcRNA
3.
Repetitive
1.
Transposable
elements (SINEs, LINEs)
2.
Simple
tandem repeats (microsatellites, mini-satellites)
3.
Copy
number variants
4.
Neutral
regions
1.
Noncoding
2.
Far
or near genes?
3.
CpG
sites – mutation rate is 15-30x’s higher than non-CpG sites
1.
Cytosine
deaminated into a Uracil à becomes a Thymine upon repair
2.
What kind of data do you want?
1.
Across
species: Comparative Genomics
1.
Multiple
alignments – mammals, vertebrates, worms, flies,
2.
What
kinds of questions?
1.
How
has evolved
across species
2.
Has
gene family (opsins, olfactory, brain-related) expanded in certain lineages?
3.
Which
genes are highly conserved across species? (Difficult to ask the opposite,
because highly diverged genes will align poorly)
4.
What
is the genome structure across viruses (influenza, HIV)
5.
Gene
content evolution (e.g. yeast – bread/beer or bacteria – gut microbiome)
2.
Within
species: Population Genetics
1.
Data
for multiple individuals
2.
Human
1.
Complete
Genomics (fewer individuals, higher coverage)
2.
1000
Genomes (more individuals, lower coverage)
3.
HapMap,
dbSNP
3.
Non-Human
1.
dbSNP
2.
Flybase,
WormBase
4.
What
kinds of questions?
1.
Demographic
history – out of Africa, human dispersal around the world, mating patterns
2.
Identify
genes subject to natural selection (high altitude adaptation or lactose
digestion in humans, response to climate change)
3.
Effects
of artificial selection (rice domestication, changes in dog genome due to
selective breeding)
4.
Evolution
of mimicry (poisonous versus nonpoisonous species – butterflies and frogs)
3.
How to get the data?
1.
UCSC
Genome Browser - Example downloading gene coding positions on chrX
2.
Galaxy
– Example of interface, extracting multiple alignments for all genes on chrX
4. R example for parsing and analyizing files
1.
Background of the 1000 genomes project, explain vcf
2.
R
code to extract .vcf
3.
PCA
with subset of 1000genomes
4. Clustering (UPGMA, Neighbor-Joining)
References
----------------------------------------------------
Get Data
----------------------------------------------------
a. Nucleotide b. Gene c. dbGap
d. dbVar e. dbSNP f. PubMed
----------------------------------------------------
Tools
----------------------------------------------------