In major breakthrough, scientists complete first gapless sequence of a human genome, reveal hidden regions

Karen Miga

Karen Miga, assistant professor of biomolecular engineering at UC Santa Cruz, co-led the Telomere-to-Telomere (T2T) Consortium, which has released the first complete, gapless assembly of a human genome sequence. (Photo by Carolyn Lagattuta)

 

The first truly complete sequence of a human genome, covering each chromosome from end to end with no gaps and unprecedented accuracy, is now accessible through the UCSC Genome Browser and is described in six papers published March 31 in Science.

Since the first working draft of a human genome sequence was assembled at UC Santa Cruz in 2000, genomics research has led to enormous advances in our understanding of human biology and disease. Nevertheless, crucial regions accounting for some 8% of the human genome have remained hidden from scientists for over 20 years due to the limitations of DNA sequencing technologies.

“Ever since we had the first draft human genome sequence, determining the exact sequence of complex genomic regions has been challenging,” said Evan Eichler, Ph.D., researcher at the University of Washington School of Medicine and T2T consortium co-chair. “I am thrilled that we got the job done. The complete blueprint is going to revolutionize the way we think about human genomic variation, disease and evolution.”

Telomere-to-Telomere Consortium

The sequencing and analysis were performed by a team of more than 100 people, the so-called Telemere-to-Telomere Consortium, or T2T, named for the telomeres that cap the ends of all chromosomes. T2T was initially set up in 2019 by Karen Miga, assistant professor of biomolecular engineering at UC Santa Cruz, and Adam Phillippy at the National Human Genome Research Institute (NHGRI).

The consortium’s gapless version of all 22 autosomes and the X sex chromosome is composed of 3.055 billion base pairs, the units from which chromosomes and our genes are built, and 19,969 protein-coding genes.

The new reference genome, called T2T-CHM13, adds nearly 200 million base pairs of novel DNA sequences, including 99 genes likely to code for proteins and nearly 2,000 candidate genes that need further study. It also corrects thousands of structural errors in the current reference sequence.

genetic structure

Complete sequence of a Y chromosome

The researchers also released this week the complete sequence of a Y chromosome from a different source, which took nearly as long to assemble as the rest of the genome combined, said Nicolas Altemose, a postdoctoral fellow at the University of California, Berkeley, and a co-author of four new papers about the completed genome. The analysis of this new Y chromosome sequence will appear in a future publication.

“In the future, when someone has their genome sequenced, we will be able to identify all of the variants in their DNA and use that information to better guide their health care,” said Phillippy, one of the leaders of T2T and a senior investigator at NHGRI. “Truly finishing the human genome sequence was like putting on a new pair of glasses. Now that we can clearly see everything, we are one step closer to understanding what it all means.”

The gaps now filled by the new sequence include the entire short arms of five human chromosomes and cover some of the most complex regions of the genome. These include highly repetitive DNA sequences found in and around important chromosomal structures such as the telomeres at the ends of chromosomes and the centromeres that coordinate the separation of replicated chromosomes during cell division.

New discoveries

The new DNA sequences reveal never-before-seen detail about the region around the centromere. Variability within this region may also provide new evidence of how our human ancestors evolved in Africa.

“Uncovering the complete sequence of these formerly missing regions of the genome told us so much about how they’re organized, which was totally unknown for many chromosomes,” said Altemose. “Before, we just had the blurriest picture of what was there, and now it’s crystal clear down to single base pair resolution.”

The new sequence also reveals previously undetected segmental duplications, long stretches of DNA that are duplicated in the genome and are known to play important roles in evolution and disease.

“These parts of the human genome that we haven’t been able to study for 20-plus years are important to our understanding of how the genome works, genetic diseases, and human diversity and evolution,” Miga said.

Many of the newly revealed regions have important functions in the genome even if they do not include active genes.

human chromosomes

 

What they found in and around the centromeres were layers of new sequences overlaying layers of older sequences, as if through evolution new centromere regions have been laid down repeatedly to bind to the kinetochore. The older regions are characterized by more random mutations and deletions, indicating they’re no longer used by the cell. The newer sequences where the kinetochore binds are much less variable, and also less methylated. The addition of a methyl group is an epigenetic tag that tends to silence genes.

All of the layers in and around the centromere are composed of repetitive lengths of DNA, based on a unit about 171 base pairs long, which is roughly the length of DNA that wraps around a group of proteins to form a nucleosome, keeping the DNA packaged and compact. These 171 base pair units form even larger repeat structures that are duplicated many times in tandem, building up a large region of repetitive sequences around the centromere.

DNA sequences around the centromere could also be used to trace human lineages back to our common ape ancestors, he noted.

“As you move away from the site of the active centromere, you get more and more degraded sequence, to the point where if you go out to the furthest shores of this sea of repetitive sequences, you start to see the ancient centromere that, perhaps, our distant primate ancestors used to bind to the kinetochore,” Altemose said. “It’s almost like layers of fossils.”

Seeing the whole genome as a complete system for the first time

“There is a profound advantage to seeing the whole genome as a complete system. It puts us in a position to unravel how that system works,” said David Haussler, director of the UC Santa Cruz Genomics Institute. “We’ve gotten an enormous understanding of human biology and disease from having roughly 90 percent of the human genome, but there were many important aspects that lay hidden, out of view of science, because we did not have the technology to read those portions of the genome. Now we can stand at the top of the mountain and see all of the landscape below and get a complete picture of our human genetic heritage.”

The T2T genome sequence, representing the finished CHM13 genome plus the recently finished T2T Y chromosome (CHM13 includes an X but not a Y chromosome), is now a new reference genome in the UCSC Genome Browser. The T2T sequence is fully annotated in the browser, providing an efficient way for scientists to access and visualize a wealth of information associated with genes and other elements of the genome.

“We wanted to put the information out in a way that is accessible and familiar to researchers so they can begin to build on it and use all the tools and resources the browser provides,” Miga explained.

Genome Reference Consortium

The new T2T reference genome will complement the standard human reference genome, known as Genome Reference Consortium build 38 (GRCh38), which had its origins in the publicly funded Human Genome Project and has been continually updated since the first draft in 2000.

“We’re adding a second complete genome, and then there will be more,” explained Haussler. “The next phase is to think about the reference for humanity’s genome as not being a single genome sequence. This is a profound transition, the harbinger of a new era in which we will eventually capture human diversity in an unbiased way.”

Human Pangenome Reference Consortium

The T2T Consortium has now joined with the Human Pangenome Reference Consortium, which aims to create a new “human pangenome reference” based on the complete genome sequences of 350 individuals.

“Pangenomics is about capturing the diversity of the human population, and it’s also about ensuring we’ve captured the whole genome properly,” said Benedict Paten, associate professor of biomolecular engineering at UCSC’s Baskin School of Engineering, a coauthor of the T2T papers, and a leader of the pangenomics effort. “Without having a map of these difficult-to-sequence regions of the genome across multiple individuals, then we’re missing a huge amount of the variation present in our population. T2T sets us up to look across hundreds of genomes from telomere to telomere. It’s going to be great!”

The standard reference genome (GRCh38) does not represent any one individual but was assembled from multiple donors. Merging them into one linear sequence created artificial structures in the sequence. The Human Pangenome Project will make it possible to compare newly sequenced genomes to multiple complete genomes representing a range of human ancestries.

More accurate assessments of genetic variants

An important outcome of the new T2T sequence is enabling more accurate assessments of genetic variants. When human genomes are sequenced for clinical studies to understand the role of genetic variants in disease or to study genetic diversity within and between human populations, they are nearly always analyzed by aligning the sequencing results with the reference genome for comparison. The T2T variant team documented major improvements in identifying and interpreting genetic variants using the new T2T sequence compared to the standard human reference genome.

“The new human genome is incredibly accurate at the base level, allowing us to flag hundreds of thousands of variants that had been misinterpreted by mapping them to the standard reference. Many of these new variants are in genes known to contribute to disease. We can now spot those because we have a more complete and accurate reference genome,” Miga said.

Miga’s research has focused on satellite DNA, the long stretches of repetitive DNA sequences found mostly in and around telomeres and centromeres. The centromeres separate each chromosome into a short arm and a long arm and hold duplicated chromosomes together prior to cell division.

“The centromeres play a critical role in how chromosomes segregate properly during cell division, and we’ve known for some time now that they are misregulated in all kinds of human diseases. But we’ve never been able to study them at the sequence level,” Miga said. “By far the largest portion of new sequences added to the reference are centromere satellite DNAs. For the first time, we can study ‘base-by-base’ the sequences that define the centromere and can start to understand how it works.”

Long-read sequencing a game changer

The T2T’s success is due to improved techniques for sequencing long stretches of DNA at once, which helps when determining the order of highly repetitive stretches of DNA. Among these are PacBio’s HiFi sequencing, which can read lengths of more than 20,000 base pairs with high accuracy. Technology developed by Oxford Nanopore Technologies, on the other hand, can read up to several million base pairs in sequence, though with less fidelity. For comparison, so-called next-generation sequencing by Illumina is limited to hundreds of base pairs.

“These new long-read DNA sequencing technologies are just incredible; they’re such game changers, not only for this repetitive DNA world, but because they allow you to sequence single long molecules of DNA,” Altemose said. “You can begin to ask questions at a level of resolution that just wasn’t possible before, not even with short-read sequencing methods.”

———

Karen Miga

Miga is a co-corresponding author of the main Science paper along with Adam Phillippy at NHGRI and Evan Eichler at the University of Washington:

She is also a co-corresponding author of the papers on:

and a coauthor of the papers on:

 

—————-

The t2t working group

https://sites.google.com/ucsc.edu/t2tworkinggroup