Reading more than a 1000 genomes: a long-read map of human diversity

23 Jul 2025

Structural variants—large genomic changes like insertions, deletions, and duplications—are one of the main sources of human genetic diversity and can have profound effects on health and disease. Despite their importance, they have been difficult to study at scale due to the limitations of short-read sequencing, which often fails to detect or fully resolve these larger, more complex sequences. In a study now published in Nature, Siegfried Schloissnig, computer scientist at the IMP, led the sequencing and analysis of over 1,000 human genomes using long-read technology. Building on the legacy of the 1000 Genomes Project, this work delivers the most detailed map of structural variation to date and establishes a foundational resource for future research—from understanding human diversity to improving genetic diagnoses. In this interview, Schloissnig shares how the project came together, what he and his collaborators discovered, and why this dataset will shape the future of genomics.

What’s the study about?

This is a follow-up to the 1000 Genomes Project, which was the first large-scale effort to catalogue human genetic variation by sequencing individuals from diverse populations around the world. But this time, we used long-read sequencing, which gives us a much clearer picture of structural variation in the human genome. These are big changes in the DNA that can have real effects on how genes work. By sequencing over a thousand healthy people from around the world, we built a reference—what we call a “panel of normals”. That’s important, because when you're looking for disease-causing mutations, you need to know what “normal” looks like. It’s also a huge resource that people can keep analysing for years, to better understand human genetic diversity.

Why focus on structural variants, and why use long-read sequencing to study them?

Structural variants—like insertions, deletions, and duplications—can have a big impact on the genome, but they’ve been difficult to study with older, short-read sequencing. Short reads are pretty good at spotting deletions, but they struggle with insertions—you can tell something’s been added in the DNA, but not exactly what sequence. Long-read sequencing solves that. It allows us to see the full sequence of what’s been inserted, not just that it happened. So if you want to really understand the structure and content of these variants, especially the more complex ones, long reads are essential.

What did you discover?

One of the big things we found is that structural variation doesn’t come from a single process—it’s actually driven by a range of different recombination mechanisms that use varying degrees of sequence homology. That’s interesting because it suggests there’s not just one way these changes happen in the genome, but a whole spectrum of underlying molecular pathways. We also saw that some structural variants tend to pop up again and again in specific populations, likely because of the surrounding genomic context. That kind of recurrence hints that these variants can emerge independently in different groups, not just through shared ancestry. And beyond the biology, the dataset itself is a huge resource. It gives researchers a detailed map of variation to explore for years to come.

Siegfried Schloissnig, joint first author of the paper.

What was so difficult about getting this data?

It was one of those projects that started small and just kept growing. Originally, it was pitched as a six-month idea—but six years later, here we are. We realised early on that to build good tools for structural variant analysis, we first needed a proper dataset—and there just wasn’t one. So we ended up sequencing 1,000 genomes ourselves. We chose samples from the 1000 Genomes Project because they’re diverse and publicly accessible, which made things a lot easier.

Then came the difficult part: analysing the data. Instead of using a single reference genome—which is how most studies compare DNA—we wanted to use something called a pangenome graph. That’s a newer approach that combines many genomes into a kind of map that better captures variation across different people. It’s much more powerful, but the tools for working with it are still in their early stages, pretty hard to use or missing altogether. So there was a lot of trial and error, a lot of development along the way. The analysis alone took about three years, and it often felt like the project would never end. It was slow, frustrating, and at times overwhelming—but we pushed through. In the end, it’s a massive resource that I think will be worth all the effort.

How did you manage to bring such a big project to the finish line?

Honestly, it was just a lot of persistence. You keep chipping away, and eventually you reach a point where you say, “Okay, this is enough—we can send it off.” I come from a genome assembly background, where you have a clear endpoint: build the genome, analyse it, wrap it up. But with this project, it was more like wrangling a massive pile of data. You explore different angles, poke at it from various sides, and gradually build up enough results to tell a coherent story. The freedom was great, but it also made it harder to know when to stop. There’s no obvious finish line—you just have to decide you’re done.

A real turning point was when I called Jan Korbel—I remembered him from my EMBL days. I told him, “Hey, we’re close to finishing long-read sequencing for a thousand samples.” I think at first he thought I was joking, because it just came out of nowhere. But he was immediately on board. He’d led the analysis team for the original 1000 Genomes Project, so he was the perfect person to bring in. That helped solidify things and push the project into its final phase.

What can researchers do with this data now that it’s available?

A lot, actually—and people are already using it. One key application is using the data to fill in missing information in older datasets. That’s especially valuable for pharmaceutical and biomedical research teams working with large collections of short-read data from clinical trials. This way, they can effectively enrich those datasets with structural variant information without having to resequence everything. And because we’ve made all the raw data public, researchers can also study things like DNA methylation across populations, for example. More broadly, this dataset allows scientists to explore all kinds of new questions about structural variation—how it’s distributed, how it arises, and what roles it might play in health and disease. It’s a foundation for a lot of future work.

I'd like to use the end of this interview to focus on what matters most: the people. I'm deeply grateful to all my co-authors—this work would not have been possible without them. I am also grateful to the folks at the IMP for hosting the project, and to Boehringer-Ingelheim and BI X, their digital innovation lab, for the generous financial support. My friends also proved themselves invaluable, as they patiently listened to my stressed-out rants throughout this journey.

Original publication

Siegfried Schloissnig, Samarendra Pani, Jana Ebler, Carsten Hain, Vasiliki Tsapalou, Arda Söylev, Patrick Hüther, Hufsah Ashraf, Timofey Prodanov, Mila Asparuhova, Hugo Magalhães, Wolfram Höps, Jesus Emiliano Sotelo-Fonseca, Tomas Fitzgerald, Walter Santana-Garcia, Ricardo Moreira-Pinhal, Sarah Hunt, Francy J. Pérez-Llanos, Tassilo Erik Wollenweber, Sugirthan Sivalingam, Dagmar Wieczorek, Mario Cáceres, Christian Gilissen, Ewan Birney, Zhihao Ding, Jan Nygaard Jensen, Nikhil Podduturi, Jan Stutzki, Bernardo Rodriguez-Martin, Tobias Rausch, Tobias Marschall, and Jan O. Korbel^#. “Structural variation in 1,019 diverse humans based on long-read sequencing.” Nature, DOI: 10.1038/s41586-025-09290-7

^#Corresponding author.

About the IMP at the Vienna BioCenter

The Research Institute of Molecular Pathology (IMP) in Vienna is a basic life science research institute largely sponsored by Boehringer Ingelheim. With over 220 scientists from 40 countries, the IMP is committed to scientific discovery of fundamental molecular and cellular mechanisms underlying complex biological phenomena. The IMP is part of the Vienna BioCenter, one of Europe’s most dynamic life science hubs with 2,800 people from over 80 countries in six research institutions, two universities, and 35 biotech companies. www.imp.ac.at, www.viennabiocenter.org

LinkedIn Twitter Facebook

All News

Reading more than a 1000 genomes: a long-read map of human diversity

Cookie Settings