Analyzing the Genome with Statistics

| 19 Comments (new)

This is the third in a series of articles for the general public focused on understanding how species are related and how genomic data is used in research. Today, we talk about the challenges of using statistics to analyze phylogenomic data.

Suppose you were a door manufacturer trying to figure out the average height of a population living in a certain country. You might conduct an experiment where you ask a group of people to report their height. You would then assemble those measurements in a data set. But in order to study this data set and draw conclusions you would need to analyze it using statistics. For example, how tall should your door be in order to fit 95% of people in the country? How many people do you need to survey to accurately represent the total population? These questions can be answered with statistical analysis.

Because acquiring data from experiments can be costly and time-consuming, we often use small data sets to represent a larger population of interest. In our height experiment, we would not be able to ask every single person in the country his or her height. We would choose a group of people under the assumption that they accurately reflect the population as a whole. However, when we are trying to map out the evolutionary history of organisms using data from sequenced genomes (phylogenomics, which we talked about last time), we need to change our method of analysis.

Let’s look at the treeshrew, for instance. It looks like a rodent but actually shares some internal similarities with primates (studied by Sir Wilfrid Le Gros Clark in the 1920s), like brain anatomy and reproductive traits. To figure out if the treeshrew is more similar to rodents or primates, we could sequence its genome and, using statistics, compare its genes to those of rodents and primates. But typical statistical models are based on subsets of populations, while by definition, genomic sequencing gives us a complete data set - all of the treeshrew’s genes. These typical models may not be suitable for interpreting genomic data.

The treeshrew. Source: Wikipedia

Before reaching a conclusion about the tree shrew, or any set of data, scientists must consider precision and accuracy. Multiple measurements of the same quantity are precise if they are similar to each other. Another way of saying this is that their variance is small. On the other hand, measurements are accurate if they are close to the true value of what they are trying to measure. For genomic data, we need better statistical tools to ensure that the accuracy of our conclusions matches the precision characteristic of these huge data sets.

Larger data sets provide more precise conclusions than smaller ones. For example, when we ask more people to report their height, we are more confident that our sample represents the variability of the actual population. Similarly, we analyze more genes in the treeshrew’s genome to increase our confidence that our conclusion is precise. However, our results might not necessarily be accurate; big data sets may lead us to draw incorrect conclusions with high confidence. The treeshrew’s genome contains some genes that are more similar to rodents’ genes and some that are more similar to primates’ genes (Fan et al., Nie et al., and Xu et al.), and with so much data we could find that the treeshrew is most similar to either group with high confidence. We need analysis tools that will tell us which genes give the correct answer.

Why are conclusions from data sometimes inaccurate? Statistical biases are external factors that produce consistent error in our measurements. Biases have many sources, including faulty experimental design, violation of assumptions made in analyzing the data, and errors in the data collection process. Bias in our height experiment might arise if we unintentionally ask the height of more women than men, causing our estimate of the average height to be lower. But in the case of phylogenomics, we are likely to have biases because of our relative lack of knowledge about the genome: we don’t always know which genes to analyze or the correct way to model the data. For example, some models assume that evolution followed the same pattern throughout all time, but this most likely was not the case.

Furthermore, the process of genome sequencing and analysis itself may create error, especially in the reconstruction of the genome and the alignment of genes for comparison. If we are comparing the genome of the treeshrew to the genomes of primates and rodents, it is difficult for us to know which genes are correlated between species when we are looking at a data set of billions of points. We might use a probability model to determine correlated genes, but all models are at least somewhat incorrect and introduce bias. In smaller data sets, biases are offset by a low precision and relatively small confidence in reaching conclusions. However, in genomic-size data sets, even small biases can be amplified and lead to high confidence in the wrong answer and incorrect phylogenetic trees.

When analyzing phylogenomic datasets, we need to use analyses that are appropriate for large data sets. This will unlock the potential of phylogenomic research to draw unbiased conclusions, like figuring out the correct phylogenetic classification of the treeshrew (still a topic of controversy among evolutionary biologists). However, phylogenomics is such a young field that these tools do not yet exist. When they are developed, we can increase our chances of correctly classifying species’ relationships and discovering the true history of evolution.

For more detail, check out: “Statistics and Truth in Phylogenomics”, Kumar, Sudhir et al. Molecular Biology and Evolution (2011).

References:

Fan, Yu, et al. “Genome of the Chinese tree shrew.” Nature communications 4 (2013): 1426.

Nie, Wenhui, et al. “Flying lemurs-The’flying tree shrews’? Molecular cytogenetic evidence for a Scandentia-Dermoptera sister clade.” BMC biology 6.1 (2008): 18.

Xu, Ling, et al. “Evaluating the Phylogenetic Position of Chinese Tree Shrew ( Tupaia belangeri chinensis) Based on Complete Mitochondrial Genome: Implication for Using Tree Shrew as an Alternative Experimental Animal to Primates in Biomedical Research.” Journal of Genetics and Genomics 39.3 (2012): 131-137.

Our next installment will cover some misused terminology in phylogenomics. This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

… because it (gasp!) uses the word, “abortion.” But wait – there is a glimmer of hope: The new superintendent, who was ordered to offer a plan for redacting the textbooks, says that the books comply with the law already and instead plans to hold a public discussion.

Meanwhile, as a service to the affected high-school students, Rachel Maddow has posted the offending page on a blog, ArizonaHonorsBiology.com, which her show apparently owns. If you are curious or have a prurient interest, you may also see the verso of The Page, as well as several other pages on human reproduction.

For the record, the book is Reece, et al., Biology: Concepts and Connections.

Lenticular clouds

| 11 Comments (new)
IMG_1154Cloud_600.JPG

Interesting cloud formation, Boulder, Colorado. The camera is facing south, and the wind is coming from the west, or right.

One hour later, in Golden,

Philae craft lands on comet

| 70 Comments (new)

Rosetta headquarters announced a few moments ago that the Philae lander is now sitting on the surface of the comet and transmitting data. Unfortunately, the European Space Agency is not exactly releasing a trove of pictures. I know this is not biology, but where did you think those hydrocarbons came from in the first place?

Phylogenomics: Deciphering a Billion-Piece Puzzle

| 146 Comments (new)

This is the second in a series of articles for the general public focused on understanding how species are related and how genomic data is used in research. Today, we talk about phylogenomics, the application of whole genome sequencing to understand evolutionary relationships among species.

DNA Chemical Structure. Source: Madeleine Price Ball

The haploid human genome is 3.2 billion DNA bases long, and each base can be one of four nucleotides: A, T, C, and G. Uncoiled, the DNA in a single human cell would be 2 meters long, and the DNA in a human body would stretch from the sun to Pluto multiple times. With 3.2 billion bases, each person’s genome is unique, and this plays an essential role in shaping our physical and mental individuality. However, despite being unique, each human genome is very very similar, due to our shared ancestral heritage. Similarly, species that share a recent ancestral heritage also have similar genomes. Species that are distantly related are likely to demonstrate significant differences in their genomes. This is why, as we discussed last week, evolutionary biologists compare traits and genes to determine the relationships of different species.

Unfortunately, some genes give us the wrong answer about how species are related. A section of a gene can be identical for two species due to independent mutations. After all, any given base can only mutate into one of three other bases. Chances are the same mutation could happen twice, or multiple mutations can produce the same sequence. Consider two species that are distantly related; one contains an AGA fragment, while the corresponding fragment in the other species is TGT, i.e. they differ in 2 out of 3 positions. As these species evolve, by chance the first species may experience a change in the first position such that AGATGA, and the second species may experience a change in the third position such that TGTTGA. Now, these two sequences look the same so you might think the species share a recent common ancestor; however, it is only an accident of biology that they appear closely related. Because some fragments may be identical due to independent mutations and not shared ancestry, estimating species relationships with using whole genomes is better than just a few genes. The more information we have, the more likely we are to figure out species’ relationships correctly.

The cost to sequence whole genomes has fallen from $100 million to $1000 in just the past twelve years. It now takes days to sequence a genome compared to the 13 years it took for the first human genome. The challenge now is not to obtain the data but to compare all the billions of base pairs in one genome to those in another. Current sequencing methods, while fast, can only read the genome by dividing it into millions of short fragments, which must be reassembled like an enormous puzzle. Researchers then have to figure out which genes correspond to one another in different species’ genomes. These comparisons are challenging because genes in one genome might be in a different order, on different chromosomes, or missing completely in another species’ genome.

Biologists are beginning to use genomic information to understand how species are related and measure how fast or slowly different genes evolve. Then in turn allows us to understand how evolution happens. For example, using genomic information we can figure out how genes mutate, characterize and diagnose genetic diseases, and track harmful pathogens. But before that can happen, we need to address the difficulties of analyzing these large genomic datasets. You might think that more data is always better, but having a lot of data can lead us to have very high confidence in the wrong answer. In a pool of thousands of genes, we need to find the ones that tell us the right answer.

Next week, we’ll discuss statistical challenges associated with big data analysis, especially as it relates to phylogenomics. This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

Or, as Right-Wing Watch puts it, Neo-Confederate Republican Michael Peroutka Wins Maryland Election. Mr. Peroutka operates the family foundation that donated the allosaurus fossil to the Creation “Museum,” as we reported here. I will not synopsize the Right-Wing Watch article, but I think that you will find that being a neo-Confederate is the least of Mr. Peroutka’s problems; if he is not completely crackers, he is giving a convincing imitation.

I started this post thinking I’d write a review of Andreas Wagner’s recent book “Arrival of the Fittest: Solving Evolution’s Greatest Puzzle” (links below), an engrossing book about how biological innovation arises from the structure of metabolic, genotype, and protein networks, and how robustness–the stability of phenotypes in the face of underlying genetic variability–is critical in evolutionary innovations. But there are several excellent reviews already out there, so another would be redundant. I’ll mention only a couple of points I think worth emphasizing below the fold.

Delicious sex chromosomes

| 19 Comments (new)

Plants have sex? Yes, they totally do.

A brief overview: 
Plants have female reproductive organs (carpels) and male reproductive organs (stamens), but several different ways of determining sex. There are two main groups of seed-producing plants.

Gymnosperms are plants without covered seeds, and include those that produce cones. Gymnosperms and are split with about 75% exhibiting monoecy (having male and female sex organs on the same plant), and 25% exhibiting dioecy (having separate male plants and female plants).

Photo by Muhammad Mahdi Karim, via Wikimedia Commons

Phelsuma laticauda

| 5 Comments (new)

Photograph by Tony Gamble.

Photography contest, Honorable Mention.

Gamble.Phelsuma_laticauda_dorsum.jpg

Phelsuma laticauda – gold dust day gecko.

Kent Hovind in trouble again

| 36 Comments (new)

I haven’t got time to investigate further, but Hovind watchers might be interested that Mr, Hovind (Dr. Dino) has been charged with filing a lien on property that had already been forfeited. Or something. A Forbes columnist, Peter Reilly, suggests that the government is piling on, and I suspect he is right; you may read about it here.

Acknowledgement. Link provided by the truly indefatigable Dan Phelps.

The Family Tree of Life

| 92 Comments (new)

In the next few weeks, we’ll be posting a series of articles for the general public focused on understanding how species are related and how genomic data is used in research. We start with a background on phylogenetic trees.

Imagine you could go back in time and meet your great grandmother or even your great-great-great-great-great grandmother, when they were your age. Would they look like you? Or would they look more like your siblings or cousins? Maybe you would all look a little different. Scientists try to figure out how the distant ancestors of apes, other animals, plants, and all organisms living today looked and behaved, much in the same way that people use a family tree to trace their ancestry.

primate-family-tree-780x520_0.gif

The common ancestor of great apes lived about 18 million years ago. Source: Smithsonian National Museum of Natural History http://humanorigins.si.edu/evidence/genetics

In evolutionary biology scientists use a type of tree called a “phylogenetic tree” to organize the history of how species descended from common ancestors. The closer two species are to a common ancestor on the phylogenetic tree, the more closely the two are related.

Take the phylogenetic tree of primates, for example. The common ancestor of apes lived about 18 million years ago. But over time, this one group branched off to form many different species, including humans, which have their own separate branch on this tree.

How did so many unique species develop from one ancestor? New branches formed by a process known as divergence. When groups of ancient organisms became geographically isolated from one another, either through migration or geologic events like earthquakes, each group began to develop its own unique set of physical attributes. Sometimes, by chance, a change in a characteristic enabled an individual to survive better in its environment and produce more offspring.

Perhaps individuals in one group with larger arms were better able to break open the hard-shelled fruits that were common in one region, while some individuals in another group had the ability to travel more easily through tall trees that offered protection from predators. Whatever the reason may have been, selection favored genetic differences that improved survival. Over time, this gradual process of isolation and selection produced distinct species, which in turn branched into more species.

The end result of divergence is many species, related in a tree-like fashion, and we display these relationships using phylogenetic trees. Scientists now use increasingly sophisticated methods to determine how species were related and build phylogenetic trees. In the past, scientists built these trees simply by comparing physical traits, like how many limbs an organism has or whether it has a tail. But with the recent surge in fast and affordable gene sequencing technologies, researchers today can directly compare species’ DNA to determine how they are related.

But analyzing entire genomes, with billions of DNA base pairs, presents its own unique set of challenges, and researchers often struggle to determine if the DNA differences they find between species are truly significant or are simply due to common variability. As computer software and statistical analysis become more adept at handling these challenges, our understanding of species’ relationships could change — providing exciting new insights into our family tree of life.

Check back next week when we discuss the differences between studying small and large datasets, and the challenges associated with big data analysis. This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

According to an article in Science today, a creationist group has booked a room for a conference at Michigan State University. Science is more discreet than I have to be, but it appears that they duped a student group into booking a room for them, and they are scheming to hold another conference at the University of Texas at Arlington.

Science writes that the conference, scheduled for November 1 and

called the Origins Summit, is sponsored by Creation Summit, an Oklahoma-based nonprofit Christian group that believes in a literal interpretation of the Bible and was founded to “challenge evolution and all such theories predicated on chance.” The one-day conference will include eight workshops, according the event’s website, including discussion of how evolutionary theory influenced Adolf Hitler’s worldview, why “the Big Bang is fake,” and why “natural selection is NOT evolution.” Another talk targets the work of MSU biologist Richard Lenski, who has conducted an influential, decades-long study of evolution in bacterial populations.

All that old familiar nonsense.

Acknowledgment. Thanks to the indefatigable Dan Phelps for the tip.

IMG_4248Eclipse_600.JPG

Pinhole-camera images of solar eclipse formed by spaces between leaves in canopy. According to Jon Grepstad, this phenomenon was explained by Aristotle. The eclipse is just ending; the picture was as close to total as it got here (Boulder, Colorado).

Aeshna cyanea

| 3 Comments (new)

Photograph by Marilyn Susek.

Photography contest, Honorable Mention.

Susek.Dragon_Fly.jpg

Aeshna cyanea – southern hawker.

Beginning this week, we will run photographs every other Monday, so no picture next week; we no longer have enough honorable mentions and other miscellaneous photographs to continue posting a photograph every week. But polish your lenses (very carefully) and keep an eye out for the contest in the summer.

Case Western steps up, rejects House Bill 597

| 74 Comments (new)

As I noted a few weeks ago (see here and here), Ohio House Bill 579 cuts the guts out of science education in the public schools by emphasizing “scientific knowledge” and eliminating the teaching of “science processes”. As I argued, the process of science is central to how one justifies claims about the world in science, and eliminating reference to those processes eviscerates science education.

The Faculty Senate of Case Western Reserve University agrees. It has adopted a resolution that speaks directly to that issue. The resolution is below the fold.

Cupido comyntas

| 1 Comment (new)

Photograph by Robin Lee-Thorp.

Photography contest, Honorable Mention.

Lee-Thorp.Eastern Blue.JPG

Cupido comyntas – eastern tailed-blue butterfly.

According to reports by Linda B. Blackford in the Lexington Herald-Leader and Tom Loftus in the Louisville Courier-Journal, here and here, Kentucky authorities have noticed the apparently deceptive hiring practices of AIG and Ark Encounter, and sent a letter informing the proprietors of the Ark Park,

Therefore we are not prepared to move forward with consideration of the application for final approval [of a tax incentive] without the assurance of Ark Encounter LLC that it will not discriminate in any way on the basis of religion in hiring for the project and will revise its postings accordingly.

Update, October 9, 2014, noonish. According to a Reuters dispatch, AIG has said that it will fight for its “religious rights after state officials warned he could lose millions in potential tax credits if he hires only people who believe in the biblical flood.” In a not entirely veiled threat, Mike Zovath told Reuters, “We’re hoping the state takes a hard look at their position, and changes their position so it doesn’t go further than this,” and argued that the state had added a requirement by prohibiting religious discrimination. The state has responded by saying, “We expect all of the companies that get tax incentives to obey the law.”

Larus delawarensis

| 8 Comments (new)
IMG_1104Gull_600.JPG

Larus delawarensis – ring-billed gull, Boulder, Colorado. There is right now a fairly large flock at Walden Ponds east of Boulder. They are too far away to get a picture, unless you like snapshots of an array of gray-and-white ellipses. But this one very kindly landed in a parking lot and posed long enough to enable this portrait.

Freshwater: It is finished

| 67 Comments (new)

The Supreme Court of the U.S. today denied John Freshwater’s request (PDF) for a writ of certiorari. In other words, the Court declined to hear his case. After a legal saga that spanned more than six years and involved a two year long administrative hearing, a Court of Common Pleas review, an appeal to the state Court of Appeals, and an appeal to the Ohio Supreme Court, we are finally done. After an administrative hearing that generated over 6,000 pages of transcript, and after costing the school district on the order of $1m in direct costs (not counting the indirect costs of teacher and administrator time), the case is finally at an end.

One of these days I may write a retrospective piece on the case, but for now I’m simply glad that the damned thing is over.

On August 14, William Dembski spoke at the Computations in Science Seminar at the University of Chicago. Was this a sign that Dembski’s arguments for intelligent design were being taken seriously by computational scientists? Did he present new evidence? There was no new evidence, and the invitation seems to have come from Dembski’s Ph.D. advisor Leo Kadanoff. I wasn’t present, and you probably weren’t either, but fortunately we can all view the seminar, as a video of it has been posted here on Youtube.

It turns out that Dembski’s current argument is based on two of his previous papers with Robert Marks (available here and here) so the arguments are not new. They involve considering a simple model of evolution in which we have all possible genotypes, each of which has a fitness. It’s a simple model of evolution moving uphill on a fitness surface. Dembski and Marks argue that substantial evolutionary progress can only be made if the fitness surface is smooth enough, and that setting up a smooth enough fitness surface requires a Designer.

Briefly, here’s why I find their argument unconvincing:

  1. They conside all possible ways that the set of fitnesses can be assigned to the set of genotypes. Almost all of these look like random assigments of fitnesses to genotypes.
  2. Given that there is a random association of genotypes and fitnesses, Dembski is right to assert that it is very hard to make much progress in evolution. The fitness surface is a “white noise” surface that has a vast number of very sharp peaks. Evolution will make progress only until it climbs the nearest peak, and then it will stall. But …
  3. That is a very bad model for real biology, because in that case one mutation is as bad for you as changing all sites in your genome at the same time!
  4. Also, in such a model all parts of the genome interact extremely strongly, much more than they do in real organisms.
  5. Dembski and Marks acknowledge that if the fitness surface is smoother than that, progress can be made.
  6. They then argue that choosing a smooth enough fitness surface out of all possible ways of associating the fitnesses with the genotypes requires a Designer.
  7. But I argue that the ordinary laws of physics actually imply a surface a lot smoother than a random map of sequences to fitnesses. In particular if gene expression is separated in time and space, the genes are much less likely to interact strongly, and the fitness surface will be much smoother than the “white noise” surface.
  8. Dembski and Marks implicitly acknowledge, though perhaps just for the sake of argument, that natural selection can create adaptation. Their argument does not require design to occur once the fitness surface is chosen. It is thus a Theistic Evolution argument rather than one that argues for Design Intervention.

That’s a lot of argument to bite off in one chew. Let’s go into more detail below the fold …

Find recent content on the main index or look in the archives to find all content.

Recent Comments

  • Just Bob: So what’s old Johnny Freshwater up to these days? read more
  • Just Bob: Does this sudden dearth of trollery mean we’ve finally WON? They’ve all crawled back into their caves or church basements or wherever? SteveP and the various IBIGs and even the read more
  • Dave Luckett: Watt. He’s on second. read more
  • Just Bob: Then who appears next? read more
  • TomS: No. Who’s on First. read more
  • Keelyn: Art??? Who in heaven is Art? read more
  • Henry J: Who Art in heaven, Howard be thy name… read more
  • KlausH: He was in the grips of Reefer Madness! read more
  • Just Bob: So instead of shooting, the officer should have thrown holy water, or yelled, “By the power of JEEEZUZ, begone!” or something. Then young Michael would have reverted to the read more
  • phhht: Tarred with the Epithet Loony: An Intermittent Series read more

Categories

Archives

Author Archives

Powered by Movable Type 4.381

Site Meter