The Never-Ending Quest to Rewrite the Tree of Life

The bottom of the ocean is one of the most mysterious places on the planet, but microbiologist Karen Lloyd of the University of Tennessee, Knoxville, wanted to go deeper than that. In 2010, she was a postdoc at Aarhus University in Denmark, and Lloyd wanted to see what microbes were living more than 400 feet beneath the sea floor.

Like nearly all microbiologists doing this type of census, she relied on 16S rRNA sequencing to determine who was there. Developed by microbiologist Carl Woese in the late 1970s, the technique looks for variation in the 16S rRNA gene, one that’s common to all organisms (it’s key to turning DNA into protein, one of life’s most fundamental processes). When Lloyd compared what she had seen under the microscope to what her sequencing data said, however, she knew her DNA results were missing a huge portion of the life hidden underneath the ocean.

“I had two problems with just 16S sequencing. One, I knew it would miss organisms, and two, it’s not good for understanding small differences between microbes,” Lloyd says.

Scientists use heat maps like these to visualize the diversity of bacteria in various environments.

Technology had made gene sequencing much quicker and easier compared to when Woese first started his work back in the 1970s, but the principle remained the same. The 16S rRNA gene codes for a portion of the machinery used by prokaryotes to make protein, which is a central activity in the cell. All microbes have a copy of this gene, but different species have slightly different copies. If two species are closely related, their 16S rRNA sequences will be nearly identical; more distantly related organisms will have a greater number of differences. It not only gave researchers a way to quantify evolutionary relationships between species, Woese’s work also revealed an entirely new branch on the tree of life—the archaea, a group of microscopic organisms distinct from bacteria.

Woese’s success in using 16S rRNA to rewrite the tree of life no doubt encouraged its widespread use. But as Lloyd and other scientists began to realize, some microbes carry a version that is significantly different from that seen in other bacteria or archaea. Since biologists depended on this similarity to identify an organism, they began to realize that they were leaving out potentially significant portions of life from their investigations.

These concerns culminated approximately ten years ago during a period when sequencing technologies were rapidly accelerating. During this time, researchers figured out how to prepare DNA for sequencing without needing to know anything about the organism you were studying. At the same time, scientists invented a strategy to isolate single cells. At her lab at the Joint Genome Institute outside San Francisco, microbiologist Tanja Woyke put these two strategies together to sequence the genomes of individual microbial cells. Meanwhile, Jill Banfield, across the bay at the University of California, Berkeley, used a different approach called metagenomics that sequenced genes from multiple species at once, and used computer algorithms to reconstruct each organism’s genome. Over the past several years, their work has helped illuminate the massive amount of microbial dark matter that comprises life on Earth.

“These two strategies really complement each other. They have opened up our ability to see the true diversity of microbial life,” says Roger Lasken, a microbial geneticist at the J. Craig Venter Institute.

Microbial Dark Matter

When Woese sequenced the 16S genes of the microbes that would come to be known as archaea, they were completely different from most of the other bacterial sequences he had accumulated. They lacked a true nucleus, like other bacteria, but their metabolisms were completely different. These microbes also tended to favor extreme environments, such as those at high temperatures (hot springs and hydrothermal vents), high salt concentrations, or high acidity. Sensing their ancient origins, Woese named these microbes the archaea, and gave them their own branch on the tree of life.

Woese did all of his original sequencing by hand, a laborious process that took years. Later, DNA sequencing machines greatly simplified the work, although it still required amplifying the small amount of DNA present using a technique known as polymerase chain reaction, or PCR, before sequencing. The utility of 16S sequencing soon made the technique one of the mainstays of the microbiology lab, along with the Petri dish and the microscope.

The method uses a set of what’s known as universal primers—short strands of RNA or DNA that help jump start the duplication of DNA—to make lots of copies of the 16S gene so it can be sequenced. The primers bound to a set of DNA sequences flanking the 16S gene that were thought to be common to all organisms. This acted like a set of bookends to identify the region to be copied by PCR. As DNA sequencing technology improved, researchers began amplifying and sequencing 16S genes in environmental samples as a way of identifying the microbes present without the need to grow them in the lab. Since scientists have only been able to culture about one in 100 microbial species, this method opened broad swaths of biodiversity that would otherwise have remained invisible.

“We didn’t know that these deep branches existed. Trying to study life from just 16S rRNA sequences is like trying to understand all animals by visiting a zoo,” says Lionel Guy, a microbiologist from Uppsala University in Sweden.

Discover how to interpret and create evolutionary trees, then explore the tree of life in NOVA’s Evolution Lab.

It didn’t take long, however, for scientists to realize the universal primers weren’t nearly as universal as researchers had hoped. The use of the primers rested on the assumption that all organisms, even unknown ones, would have similar DNA sequences surrounding the 16S rRNA gene. But that meant that any true oddballs probably wouldn’t have 16S rRNA sequences that matched the universal primers—they would remain invisible. These uncultured, unsequenced species were nicknamed “microbial dark matter” by Stanford University bioengineer and physicist Stephen Quake in a 2007 PNAS paper.

The name, he says, is analogous to dark matter in physics, which is invisible but thought to make up the bulk of the universe. “It took DNA technology to realize the depth of the problem. I mean, holy crap, there’s a lot more out there than we can discover,” Quake says.

Quake’s snappy portmanteau translated into the Microbial Dark Matter project—an ongoing quest in microbiology, led by Woyke, to understand the branches on the tree of life that remain shrouded in mystery by isolating DNA from single bacterial and archaeal cells. These microbial misfits intrigued Lloyd as well, and she believed the subsurface had many more of them than anyone thought. Her task was to find them.

“We had no idea what was really there, but we knew it was something,” Lloyd says.

To solve her Rumsfeldian dilemma of identifying both her known and unknown unknowns, Lloyd needed a DNA sequencing method that would allow her to sequence the genomes of the microbes in her sample without any preconceived notions of what they looked like. As it turns out, a scientist in New Haven, Connecticut was doing just that.

Search for Primers

In the 1990s, Roger Lasken had recognized the problems with traditional 16S rRNA and other forms of sequencing. Not only did you need to know something about the DNA sequence ahead of time in order to make enough genetic material to be sequenced, you also needed a fairly large sample. The result was a significant limitation in the types of material that could be sequenced. Lasken wanted to be able to sequence the genome of a single cell without needing to know anything about it.

Then employed at the biotech firm Molecular Staging, Lasken began work on what he called multiple displacement amplification (MDA). He built on a recently discovered DNA polymerase (the enzyme that adds nucleotides, one by one, to a growing piece of DNA) called φ29 DNA polymerase. Compared to the more commonly used Taq polymerase, the φ29 polymerase created much longer strands of DNA and could operate at much cooler temperatures. Scientists had also developed random primers, small pieces of randomly generated DNA. Unlike the universal primers, which were designed to match specific DNA sequences 20–30 nucleotides in length, random primers were only six nucleotides long. This meant they were small enough to match pieces of DNA on any genome. With enough random primers to act as starting points for the MDA process, scientists could confidently amplify and sequence all the genetic material in a sample. The bonus inherent in the random primers was that scientists didn’t need to know anything about the sample they were sequencing in order to begin work.

“For the first time, you didn’t need to culture an organism or amplify its DNA to sequence it,” he says.

The method had only been tested on relatively small pieces of DNA. Lasken’s major breakthrough was making the system work for larger chromosomes, including those in humans, which was published in 2002 in PNAS. Lasken was halfway to his goal—his next step was figuring out how to do this in a single bacterium, which would enable researchers to sequence any microbial cell they found. In 2005, Lasken and colleagues managed to isolate a single E. coli cell and sequence its 16S rRNA gene using MDA. It was a good proof of principle that the system worked, but to understand the range and depth of microbial biodiversity, researchers like Tanja Woyke, the microbiologist at the Joint Genome Institute, needed to look at the entire genome of a single cell. In theory, the system should work neatly: grab a single cell, amplify its DNA, and then sequence it. But putting all of the steps together and working on the kinks in the system would require years of work.

Woyke had spent her postdoc at the Joint Genome Institute sequencing DNA from samples not grown in the lab, but drawn directly from the environment, like a scoop of soil. At the time, she was using metagenomics, which amplified and sequenced DNA directly from environmental samples, yielding millions of As, Ts, Gs, and Cs from even a thimble of dirt. Woyke’s problem was determining which genes belonged to which microbe, a key step in assembling a complete genome. Nor was she able to study different strains of the same microbe that were present in a sample because their genomes were just too similar to tell apart using the available sequencing technology. What’s more, the sequences from common species often completely drowned out the data from more rare ones.

“I kept thinking to myself, wouldn’t it be nice to get the entire genome from just a single cell,” Woyke says. Single-cell genomics would enable her to match a genome and a microbe with near 100% certainty, and it would also allow her to identify species with only a few individuals in any sample. Woyke saw a chance to make her mark with these rare but environmentally important species.

Soon after that, she read Lasken’s paper and decided to try his technique on microbes she had isolated from the grass sharpshooter Draeculacephala minerva, an important plant pest. One of her biggest challenges was contamination. Pieces of DNA are everywhere—on our hands, on tables and lab benches, and in the water. The short, random primers upon which single-cell sequencing was built could help amplify these fragments of DNA just as easily as they could the microbial genomes Woyke was studying. “If someone in the lab had a cat, it could pick up cat DNA,” Woyke says of the technique.

In 2010, after more than a year of work, Woyke had her first genome, that of Sulcia bacteria, which had a small genome and could only live inside the grass sharpshooter. Each cell also carried two copies of the genome, which helped make Woyke’s work easier. It was a test case that proved the method, but to shine a spotlight on the world’s hidden microbial biodiversity, Woyke would need to figure out how to sequence the genomes from multiple individual microbes.

Work with Jonathan Eisen, a microbiologist at UC Davis, on the Genomic Encyclopedia of Bacteria and Archaea Project, known as GEBA, enabled her lab to set up a pipeline to perform single cell sequencing on multiple organisms at once. GEBA, which seeks to sequence thousands of bacterial and archaeal genomes, provided a perfect entry to her Microbial Dark Matter sequencing project. More than half of all known bacterial phyla—the taxonomic rank just below kingdom—were only represented by a single 16S rRNA sequence.

“We knew that there were far more microbes and a far greater diversity of life than just those organisms being studied in the lab,” says Matthew Kane, a program director at the National Science Foundation and a former microbiologist. Studying the select few organisms that scientists could grow in pure culture was “useful for picking apart how cells work, but not for understanding life on Earth.”

GEBA was a start, but even the best encyclopedia is no match for even the smallest public library. Woyke’s Microbial Dark Matter project would lay the foundation for the first of those libraries. She didn’t want to fill it with just any sequences, however. Common bacteria like E. coli, Salmonella, and Clostridium were the Dr. Seuss books and Shakespeare plays of the microbial world—every library had copies, though they represented only a tiny slice of all published works. Woyke was after the bacterial and archaeal equivalents of rare, single-edition books. So she began searching in extreme environments including boiling hot springs of caustic acid, volcanic vents at the bottom of the ocean, and deep inside abandoned mines.

Using the single-celled sequencing techniques that she had perfected at the Joint Genome Institute, Woyke and her colleagues ended up with exactly 201 genomes from these candidate phyla, representing 29 branches on the tree of life that scientists knew nothing about. “For many phyla, this was the first genomic data anyone had seen,” she says.

The results, published in Nature in 2013, identified some unusual species for which even Woyke wasn’t prepared. Up until that study, all organisms used the same sequence of three DNA nucleotides to signal the stop of a protein, one of the most fundamental components of any organism’s genome. Several of the species of archaea identified by Woyke and her colleagues, however, used a completely different stop signal. The discovery was not unlike traveling to a different country and having the familiar red stop sign replaced by a purple square, she says. Their work also identified other rare and bizarre features of the organisms’ metabolisms that make them unique among Earth’s biodiversity. Other microbial dark matter sequencing projects, both under Woyke’s Microbial Dark Matter project umbrella and other independent ventures, identified microbes from unusual phyla living in our mouths.

Some of the extremeophile archaea that Woyke and her colleagues identified were so unlike other forms of life that they grouped them into their own superset of phyla, known as DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanohaloarchaeota, and Nanoarchaeota). The only thing that scientists knew about these organisms were the genomes that Woyke had sequenced, isolated from individual organisms. These single-cell sequencing projects are key not just for filling in the foliage on the tree of life, but also for demonstrating just how much remains unknown, and Woyke and her team have been at the forefront of these discoveries, Kane says.

Sequencing microbes cell by cell, however, isn’t the only method for uncovering Earth’s hidden biodiversity. Just a few miles from Woyke’s lab, microbiologist Jill Banfield at UC Berkeley is taking a different approach that also has also produced promising results.

Studying the Uncultured

Typically, to study microbes, scientists have grown them in a pure culture from a single individual. Though useful for studying these organisms in the laboratory, most microbes live in complex communities of many individuals from different species. Starting in the early 2000s, genetic sequencing technologies had advanced to the point where researchers could study the complex array of microbial genomes without necessarily needing to culture each individual organism. Known as metagenomics, the field began with scientists focused on which genes were found in the wild, which would hint at how each species or strain of microbe could survive in different environments.

Just as Woyke was doubling down on single-cell sequencing, Banfield began using metagenomics to obtain a more nuanced and detailed picture of microbial ecology. The problems she faced, though very different from Woyke’s, were no less vexing. Like Woyke, Banfield focused on extreme environments: acrid hydrothermal vents at the bottom of the ocean that belched a vile mixture of sulfuric acid and smoke; an aquifer flowing through toxic mine tailings in Rifle, Colorado; a salt flat in Chile’s perpetually parched Atacama Desert; and water found in the Iron Mountain Mine in Northern California that is some of the most acidic found anywhere on Earth. Also like Woyke, Banfield knew that identifying the full range of microbes living in these hellish environments would mean moving away from using the standard set of 16S rRNA primers. The main issue Banfield and colleagues faced was figuring out how to assemble the mixture of genetic material they isolated from their samples into discrete genomes.

A web of connectivity calculated by Banfield and her collaborators shows how different proteins illustrate relationships between different microbes.

The solution wasn’t a new laboratory technique, but a different way of processing the data. Researchers obtain their metagenomic information by drawing a sample from a particular environment, isolating the DNA, and sequencing it. The process of sequencing breaks each genome down into smaller chunks of DNA that computers then reassemble. Reassembling a single genome isn’t unlike assembling a jigsaw puzzle, says Laura Hug, a microbiologist at the University of Waterloo in Ontario, Canada, and a former postdoc in Banfield’s lab.

When faced with just one puzzle, people generally work out a strategy, like assembling all the corners and edges, grouping the remaining pieces into different colors, and slowly putting it all together. It’s a challenging task with a single genome, but it’s even more difficult in metagenomics. “In metagenomics, you can have hundreds or even thousands of puzzles, many of them might be all blue, and you have no idea what the final picture looks like. The computers have to figure out which blue pieces go together and try to extract a full, accurate puzzle from this jumble,” Hug says. Not surprisingly, the early days of metagenomics were filled with incomplete and misassembled genomes.

Banfield’s breakthrough helped tame the task. She and her team developed a better method for binning, the formal name for the computer process that sorts through the pile of DNA jigsaw pieces and arranges them into a final product. As her lab made improvements, they were able to survey an increasing range of environments looking for rare and bizarre microbes. Progress was rapid. In the 1980s, most of the bacteria and archaea that scientists knew about fit into 12 major phyla. By 2014, scientists had increased that number to more than 50. But in a single 2015 Nature paper, Banfield and her colleagues added an additional 35 phyla of bacteria to the tree of life.

The latest tree of life was produced when Banfield and her colleagues added another 35 major groups, known as phyla.

Because researchers knew essentially nothing about these bacteria, they dubbed them the “candidate phyla radiation”—or CPR—the bacterial equivalent of Woyke’s DPANN. Like the archaea, these bacteria were grouped together because of their similarities to each other and their stark differences to other bacteria. Banfield and colleagues estimated that the CPR organisms may encompass more than 15% of all bacterial species.

“This wasn’t like discovering a new species of mammal,” Hug says. “It was like discovering that mammals existed at all, and that they’re all around us and we didn’t know it.”

Nine months later, in April 2016, Hug, Banfield, and their colleagues used past studies to construct a new tree of life. Their result reaffirmed Woese’s original 1978 tree, showing humans and, indeed, most plants and animals, as mere twigs. This new tree, however, was much fuller, with far more branches and twigs and a richer array of foliage. Thanks in no small part to the efforts of Banfield and Woyke, our understanding of life is, perhaps, no longer a newborn sapling, but a rapidly maturing young tree on its way to becoming a fully rooted adult.