|
A Human Genome Project
Why HapMap is only half the story (view PDF) by Natasha Keith Even before the first full human genome was completely sequenced, debate raged about how to mix genomics and medicine. The scientific community’s access to such a vast amount of genetic information could have staggering consequences. Imagine this: your doctor takes a blood test, and two weeks later hands over your entire genome sequence on a single DVD. You know that 13 generations separate you from your next-door neighbor. When you’re ill, your doctor prescribes you a “miracle drug” effective for 10 percent of the population but lethal for the rest, comfortably assured that you will not experience unfortunate side effects due to your genetic makeup. From birth, your kids are prescribed additional medical checkups for those illnesses they have a higher probability of contracting. It will be an era of personalized and preventive medicine, but it won’t necessarily be easy. Although the process of geneticizing medicine has already started, the scenarios above haven’t quite arrived because they require the full sequence of your DNA—and the know-how to interpret it. A strand of DNA resembles a twisted ladder, with rungs that consist of four molecules known as DNA bases: adenine, thymine, guanine, and cytosine, often shortened to A, G, T, and C for simplicity. The ordering of these DNA bases creates a genetic code that carries the recipe for all tissues and instructions for the dynamic processes that make up a human being. But before interpreting the code, finding the secret to life, and radicalizing the medical establishment, the full sequence of DNA bases needs to be determined (“sequenced”) in a large number of people, and that is not a simple task. Currently, sequencing is too slow and expensive for wide-scale use, but this is changing rapidly. The first human genome was sequenced in 2003 by the Human Genome Project, after 13 years of work and an estimated three billion dollars. By standing on the shoulders of this original genome project, the full genome of Nobel Laureate James Watson was recently solved. The majority of the work was completed in two months, and for just under a million dollars. The scientific community figures that when the cost of sequencing a genome falls to $10,000, it will be worth widely sequencing the genomes of patients as a general practice. But while it is known that genetic variations can lead to different susceptibilities to disease, the work that defines which genetic variations cause susceptibility is still in its early phases. Correlating an individual’s genetic sequence with a particular disease is only useful if the relationship is well-characterized in a large, diverse group of people. Such large-scale questions reach into many different fields, as is exemplified at UC Berkeley and its affiliated institutions, where researchers from computer science, mathematics, computational biology, genomics, public health, and sociology are involved. There’s something here for everyone, but they all have one thing in common—they need an extensive, high-quality dataset. Enter the HapMap In spite of the intimidating size of the human genome and the limitations of full sequencing technologies, there are some simplifying features in human genetics that make it possible to compare portions of human chromosomes without full sequencing. First, in 99.5 percent of our genomes, all humans are identical. This leaves only 0.5 percent to account for genetically-determined differences in appearance, behavior, and health. Many of these differences in the sequence of our DNA come in the form of single deviant A’s, G’s, T’s and C’s, called Single-Nucleotide Polymorphisms (or SNPs, pronounced “snips”), which occur about once every 1,200 DNA bases when comparing two individuals. Fortunately for scientists, these varying bases of DNA are not random, but rather tend to occur together in “blocks”, known as haplotypes. These blocks enable researchers to look at just a few significant “marker” SNPs in a person’s genome, guess the identity of hundreds of hidden SNPs, and determine the haplotype—similar to the way a reader can guess the identity of words that are mssng vwls. As a consequence, researchers need much less sequencing data to determine the haplotype, making the process faster and financially feasible; the number of marker SNPs is estimated at 300,000–600,000, compared to the total 10 million common SNPs. A second simplifying feature is that the number of common haplotypes is small—for a given gene there are often two or three predominant haplotypes that describe about 80 percent of the population. As a result, if one could categorically sequence the marker SNPs across the human population, it would render a “map” of common haplotypes in humans that can be analyzed for disease links and compared across the globe. The interest in making such a database drove the formation of the International HapMap (“Haplotype Mapping”) project in 2002 (see sidebar). This multinational SNP dataset is made publicly available to the scientific community, which can then do computational, biological, or medical research using the data. By the end of the project’s first phase, the number of known SNPs in scientific databases had increased by a factor of 10. Access to such a database provides a new opportunity for scientists to ask more quantitative questions about the genetics behind common diseases. Finding the connections: mapping haplotypes to disease Len Pennacchio, head of the Genetic Analysis Program of the Joint Genome Institute (JGI) in Walnut Creek and an LBL senior research scientist, uses direct sequencing to try to understand relationships between haplotypes and disease. In collaboration with medical centers in Copenhagen, Ottawa, Texas, and Minnesota, he published a paper in the June 8 issue of Science describing the correlation between a certain haplotype and heart disease. The researchers utilized data from HapMap to detect these exciting associations, but this approach is unusual for Pennacchio, who usually relies on his own sequencing technologies to fine tune the SNP dataset to his needs. “There are two camps, really,” Pennacchio explains, “those that are satisfied with knowing an association between a gene and a disease exists, and those that also want to find all the details about why the association exists [at the biochemical level]. I’ve heard extreme and persuasive arguments on both sides.” Pennacchio falls into the second camp of researchers. His work tends to delve into the more subtle details of biochemical signaling and transcription and relies less on association studies. But he points out that his and other recent papers owe their success to the fact that association studies give researchers a “foothold” on the genetics: Now that they’ve found an association between a set of SNPs and heart disease, they can start to ask more detailed questions about the biochemistry underlying the association. And since their approach did not bias the search for markers toward previously suggested heart disease genes, it has the potential to open up new relationships for exploration. For the majority of Pennacchio’s work, HapMap hasn’t been specific enough, or had enough uncommon SNPs, to really probe biochemical questions at the molecular level. He looks forward to the wide scale sequencing era, which would provide the ideal database for such detailed work; for detail-oriented researchers, HapMap is just a stepping stone. “It was the most logical thing to do at the time,” Pennacchio says, “and what we’ve learned about haplotype structure from HapMap will be useful in the next generation of sequencing.” But the most lasting result of the HapMap project, which is just starting to come to light now, will be substantial support for the grand hypothesis: Common differences in genetics do not appear to explain the incidence of the majority of common diseases. Pennacchio says that the resulting associations between genetics and disease may be a little too subtle and too complex to be conclusive. Although a few diseases have found successful “hits” in the HapMap database (such as macular degeneration, where individuals with the risk-producing sequence are about seven times more likely to carry the disease), present research shows that the majority of common diseases don’t appear to associate strongly to a single haplotype. Instead, diseases are more likely to be associated with a large number of haplotypes, making their individual effects small. “We have a lot of little hitters,” Pennacchio says, “but we want the big hitters. And a lot of that may be in the uncommon SNPs and other unusual features of the human genome. We’ll see those when full sequencing becomes cheaper and faster, and that will be sooner than we think.” The computational side of HapMap The intensive data analysis done by researchers like Pennacchio requires advanced computational resources. Researchers at the International Computer Science Institute (ICSI), a non-profit research institute affiliated with UC Berkeley, write programs to identify haplotypes within the human genome, catalog genome rearrangements, and characterize marker SNPs. Dr. Eran Halperin, a senior researcher at ICSI, writes computational algorithms to enhance the ability to identify marker SNPs accurately and sensitively. The ability to use sequenced marker SNPs to predict a given unsequenced haplotype makes the HapMap project possible and financially feasible, but connecting markers to their derivative haplotypes is not a simple process. “With a dataset so large,” Halperin explains, “we need extremely accurate methodology to determine the correct haplotypes. False correlations hurt everyone in the field.” To prevent such errors, Halperin’s research deals with “multimarker methods”, methods that use a collection of SNPs, instead of single SNPs, to characterize a haplotype. His paper in the April issue of American Journal of Human Genetics describes a new multimarker method called WHAP (“weighted sum of haplotype frequency differences”), which evaluates large numbers of SNPs and determines which sets of SNPs correlate to a given haplotype. It’s rather like a library: It is easier to find a book in a library by searching the titles of the books (marker SNPs), than by searching the chapter titles of the full library (all SNPs). But two books with the same title may not reliably contain the same chapters. If one searches for the author, publisher, and the title simultaneously, however, the probability of finding a unique book is increased. In other words, by using multiple marker SNPs to identify a particular haplotype, false identifications are avoided. The WHAP method has been shown to improve the accuracy of finding marker SNPs in all the HapMap populations sampled, with datasets from both the competing sequencing technologies. To measure the success of marker methods, they report the percentage of SNPs that are known or whose identity can be guessed at via markers. For a hypothetical dataset in which all SNPs are characterized, the coverage would be 100 percent. In one of the datasets, WHAP is able to raise the correct identification of SNPs from an average of 75 percent to 83 percent in European and Asian populations, and from 61 percent to 74 percent in the Yoruba population. The Yoruba population is harder to cover because African populations have a greater genetic diversity, meaning that more markers are needed to address a more complex set of possible haplotypes. This curious detail is evidence of evolutionary history: Humans originated in Africa, where small offshoots of the population left the continent, starting new populations of less diverse gene pools in Europe and Asia. We still carry the genetic footprints of this winnowing process. Integrative biology graduate student Hua Chen studies population genetics, and how genes get positively selected and distributed among a population. He is working on characterizing the “hitchhiking effect”, in which a section of DNA is host to a gene that is favored by natural selection. Because this “positively selected” gene is so beneficial, it sweeps through the population. Other genes in the vicinity, whether beneficial or not, get dragged along, effectively “hitchhiking” along with their VIP neighbor. Many hitchhiking genes have a particular SNP signature that can be identified with computational algorithms, and their histories traced back generations. In addition, the identification of sections of the genome that bear the signature of hitchhiking can help researchers pinpoint the positively selected genes themselves. One example of this phenomenon can be seen in the region of DNA that surrounds the Duffy red cell antigen gene. The hitchhiking DNA, which is present at high frequency in Africans, is thought to have been inherited along with a mutant copy of the Duffy gene that confers resistance to malaria. Chen recognizes that HapMap, as a dataset, has some disadvantages. “Although the HapMap project has already succeeded in making a new generation of markers for genome-wide association studies of human diseases...unfortunately, the data is generated in a way [that] is not ideal for population genetic studies,” he says. However, Chen also notes that HapMap has provided “the largest polymorphism dataset from multiple populations up to now [and is] encouraging further population genetic studies and the production of more large-scale population genetic data.” The ever-evolving questions of genetics and race The idea of tracking genetic patterns in living populations naturally brings up the genetics of race. Troy Duster, a UC Berkeley professor emeritus of sociology, is directly involved in the HapMap database discussion of the use of race in research science. From the start, the HapMap project actively incorporated bioethicists and sociologists to help deal will the social side of what some people worry will be a system of genetic determinism. “It’s a hard problem,” Duster says. “We’ve been stuck with thinking of things in terms of race since the 18th century…and that’s not only bad for ethics, it’s bad for science.” Duster argues that when data are structured based on race, researchers have a natural tendency to interpret the data with skewed assumptions. A number of scientific excuses for racist attitudes have been documented through history, from the association of cranium shape with intelligence to the doctrines of Nazism, and the scientific community generally sees these studies as fraudulent relics of the past. But Duster warns that racial, social, and genetic issues will never be completely separable, especially in a project like HapMap, for which the populations were selected based on political considerations, and genetic population comparisons are a natural goal of the project. He also sympathetically points out that the choice of populations was based on political and financial concerns that were unavoidable. “If you were to sample populations at equally spaced points across the globe, it would be a scientific selection, but that would take funding we didn’t have and politics we couldn’t control,” he explains. Duster has studied the issues surrounding genetics and race throughout his career; he also served on the advisory board overseeing legal, ethical, and social issues for the Human Genome Project. He describes how the committee was faced with a constant stream of issues that developed after the first human genome was published. For example, when numerous research groups found a link between the BRCA1 and BRCA2 genes and breast cancer, the theory arose that women of Jewish decent would be more susceptible to the disease. The very suggestion of a racially-determined higher risk made waves in the community: Jewish women didn’t want to volunteer for the research, believing that they would be forced to pay higher health insurance rates and experience differential treatment in the medical establishment. And if genetic correlations were found to correspond to less quantifiable traits like intelligence or athleticism, the politics of the correlations may overwhelm the science. “Why don’t these issues go away? Because ethical issues are emergent,” Duster says. “They don’t get solved once and for all. I believe that in the next 10–15 years, the more we know about the genetic differences between groups, the more social issues will emerge.” A new era? Whether HapMap was worth the expense is a subject that is hotly debated, especially because its utility may decline as personalized sequencing becomes more affordable. But many agree that the project has enabled the scientific community to be more involved in disease correlations, and has built infrastructure that will be useful in the next stage of full genomic databases. “This kind of work is not about individual breakthroughs any more,” says Halperin. “It’s about a large community, all pushing for the same result, from different angles, and with different techniques.” But Pennacchio disagrees. “How do you deal with and interpret millions of common and rare variants in thousands of people and figure out which does what? What is the genetic basis of each disease? There’s plenty of space for breakthroughs!” From advancing computational algorithms, to understanding the structure of haplotype blocks, to preparation for the social implications of the era of personalized medicine, HapMap has caused bafflement, interest, passion, and progress in a wide, diverse scientific community. And until the new sequencing era arrives, we can probably use all the social and scientific practice we can get. Natasha Keith is a graduate student in chemistry. Want to know more? Check out: hapmap.org Comments on this article? Drop us a line at with 'letter to the editor' in the subject! |
Home
| Read
| Blog
| Join us
| About us
© 2009 Berkeley Science Review





