That database, called GenBank, is run by the National Center for Biotechnology Information at the National Institutes of Health, and it's a repository of all publicly available genetic sequences, from the human genome to ever-evolving influenza genes.
Scientists are still scrambling to learn more about the origins and behavior of the new virus strain, which seems to include genes from four different flues -- a human flu strain, an avian flu strain, and two previously known swine flu strains, one common in North America and one in Eurasia.
NCBI director David Lipman explains how scientists use GenBank to trace the origins of and learn more about a new virus.
Can you explain what GenBank is, and how is it being used by scientists who are studying the H1N1 virus?
GenBank has been around for 25 years now. It's a database with all of the publicly available DNA sequences. The human genome is in there, and many other thousands of organisms that have had either all of their genome or many of their genes sequenced are in there as well. There are also a lot of pathogens in there -- bacterial pathogens and viral pathogens like influenza.
And essentially, when someone has sequenced a new gene or a new virus, they will send the sequence to GenBank and we'll get it into a standard form and we'll make it available to everybody.
How many sequences are in there right now?
Many billions [...] We have had exponential growth in the number of sequences in GenBank for 25 years, so there are billions and billions.
Is the process for putting in H1N1 swine flu sequences the same as it is for anything else?
Well, we sped up the process and really tried to do what we could to make sure those sequences are as available as possible. Whenever there's been any kind of an epidemic or scare like that, we really accelerate the process. We're getting [H1N1 swine flu] sequences and making them available within almost minutes after working on them.
Why do you choose to make the data publicly available?
In many areas, since you don't know who can most benefit from having access to the data, it's better to make it available to everybody. For example, it's turned out that a sequence that's involved with colon cancer in humans may actually be evolutionarily related to a bacterial gene for which we already know a lot about how it functions. So we can move more quickly, in terms of experiments, by learning from the tree of life.
Or in the case of the flu [...] the field of flu researchers isn't that big, there are plenty of other people who are knowledgeable and bring other skills in.
To give you an idea about how this works: On the flu page where we have the [H1N1] sequences we are having -- across all our databases -- 14,000 searches a day having to do with the flu, and at least 7,000 individuals doing the searching per day.
It's a fortuitous thing that in the evolution of medical research, all these genomics and high-throughput sequencing methods came about at the same time as the revolution in Internet communication. Because I think that's really benefitted medical discovery.
One of the success stories of the whole genomics and molecular biology revolution has been, not only are people generating a lot of data, but actually it's being used. And that's really important. If you generate a lot of data but it's just hanging out there, nobody's touching it, or it's being hidden away in a balkanized way, you just don't get this kind of usage.
Has much of the knowledge that scientists have gained so far about the origin and makeup of the H1N1 swine flu come from GenBank?
We're seeing things anecdotally on web sites about that, but we won't know for sure until people begin publishing papers and citing things from GenBank. But since I'm personally interested in flu research and have done some work in that, I know informally that there are people taking those sequences and doing all kinds of analyses and communicating back and forth and collaborations are beginning, and so forth.
From the very first sequences that [scientists] had for [H1N1 swine flu], you could see [...] its ancestry was pretty clear cut. You find that by comparing it to sequences in GenBank, of course. In other words, when the very first sequences were done, even before they were submitted to GenBank, people were using GenBank to do these analyses.
Basically, the process is they sequence the virus, then while they're still holding on to it [...] they can take that sequence and compare it to all the other flu sequences. They may have downloaded those sequences from GenBank at some point. But since all the researchers are putting sequences into GenBank, if someone's doing flu research and they get a new isolate, they'll sequence it, then they'll compare it to everything in GenBank -- even though they would not have yet deposited it in GenBank. So they're using GenBank for their own research, but then contributing to it so others can use it.