On August 25th 2012, the spacecraft Voyager 1 exited our Solar System and entered interstellar space, set for eternal solitude among the stars. Its twin, Voyager 2, isn’t far behind. Since their launch from Cape Canaveral in Florida, in 1977, their detailed reconnaissance of the Jovian planets—Jupiter, Saturn, Uranus, Neptune—and over 60 moons extended the human senses beyond Galileo’s wildest dreams.
After passing Neptune, the late astrophysicist Carl Sagan proposed that Voyager 1 should turn around and capture the first portrait of our planetary family. As he wrote in his 1994 book, Pale Blue Dot, “It had been well understood by the scientists and philosophers of classical antiquity that the Earth was a mere point in a vast encompassing Cosmos, but no one had ever seen it as such. Here was our first chance (and perhaps our last for decades to come).”
Indeed, our planet can be seen as a fraction of a pixel against a backdrop of darkness that’s broken only by a few scattered beams of sunlight reflected off the probe’s camera. The precious series of images were radioed back to Earth at the speed of light, taking five and a half hours to reach the huge conical receivers in California, Spain, and Australia more than 4 billion miles away. Over such astronomical distances, one pixel out of 640,000 can easily be replaced by another or lost entirely in transmission. It wasn’t, in part due to a single mathematical breakthrough published decades before.
In 1960, Irving Reed and Gustave Solomon published a paper in the Journal of the Society for Industrial and Applied Mathematics, entitled, “Polynomial Codes Over Certain Finite Fields,” a string of words that neatly convey the arcane nature of their work. “Almost all of Reed and Solomon’s original paper doesn’t mean anything to most people,” says Robert McEliece, a mathematician and information theorist at California Institute of Technology. But within those five pages was the basic recipe for the most efficacious error-correction codes yet created. By adding just the right levels of redundancy to data files, this family of algorithms can correct for error that often occurs during transmission or storage without taking up too much precious space.
Today, Reed-Solomon codes go largely unnoticed, but they are everywhere, reducing errors in everything from mobile phone calls to QR codes, computer hard drives, and data beamed from the New Horizons spacecraft as it zoomed by Pluto. As demand for digital bandwidth and storage has soared, Reed-Solomon codes have followed. Yet curiously, they’ve been absent in one of the most compact, longest-lasting, and most promising of storage mediums—DNA.
Several labs have investigated nature’s storage device to archive our ever-increasing mountain of digital information; encoding small amounts of data in DNA and, more importantly, reading it back. But those trials lacked sophisticated error correction, which DNA data systems will need if they are to become our storage medium of choice. Fortunately, a team of scientists, led by Robert Grass, a lecturer at ETH Zurich, rectified that omission earlier this year when they stored a duo of files in DNA using Reed-Solomon codes. It’s a mash up that could help us reliably store our fragile digital data for generations to come.
DNA is best known as the information storage device for life on Earth. Only four molecules—adenine, cytosine, thymine, and guanine, commonly referred to by their first letters—make up the rungs on the famous double helix of DNA. These sequences are the basis of every animal, plant, fungi, archaea, and bacteria that has ever lived in the 4 billion some years that life has existed on Earth.
“It’s not a form of information that’s likely to be outdated very quickly,” says Sriram Kosuri a geneticist from University of California, Los Angeles. “There’s always going to be a reason for studying DNA as long as we’re still around.”
It is also incredibly compact. Since it folds in three dimensions, we could store all of the world’s current data—everyone’s photos, every Facebook status update, all of Wikipedia, everything—using less than an ounce of DNA. And, with its propensity to replicate given the right conditions, millions of copies of DNA can be made in the lab in just a few hours. Such favorable traits make DNA an ideal candidate for storing lots of information, for a long time, in a small space.
A Soviet scientist named Mikael Nieman recognized DNA’s potential back in 1964, when he first proposed the idea of storing data in natural biopolymers. In 1988, his theory was finally put into practice when the first messages were stored in DNA. Those strings were relatively simple. Only in recent years have laboratories around the world started to convert large amounts of the binary code that’s spoken by computers into genetic code.
In 2012, by converting the ones of binary code into As or Cs, and zeros into Ts and Gs, Kosuri along with George Church and Yuan Gao stored an entire book called Regenesis, totaling 643 kilobytes, into the genetic code. A year later, Ewan Birney, Nick Goldman, and their colleagues from the European Bioinformatics Institute added a slightly more sophisticated way of translating binary to nucleic acid that reduced the number of repeated bases.
Such repeats are a common problem when writing and reading of DNA, or synthesizing and sequencing, as they’re called. Although Birney, Goldman, and team stored a similar amount of information as Kosuri, Church, and Gao—739 kilobytes—it was spread over a range of media types: 154 Shakespearean sonnets, Watson and Crick’s famous 1953 paper that described DNA’s molecular structure, an audio file of Martin Luther King Jr.’s “I Have a Dream” speech, and a photograph of the building they were working in near Cambridge, UK.
The European team also integrated a deliberate error-correction system: distributing their data over more than 153,000 short, overlapping sequences of DNA. Like shouting a drink order multiple times in a noisy bar, the regions of overlap increased the likelihood that the message would be understood at the other end. Indeed, after a Californian company called Agilent Technologies manufactured the team’s DNA sequences, packaged them, and sent them to the U.K. via Germany, the team was able to remove any errors that had occurred “by hand” using their overlapping regions. In the end, they recovered their files with complete fidelity. The text had no spelling mistakes, the photo was high-res, and the speech was clear and eloquent.
“But that’s not what we do,” says Grass, the lecturer at the Swiss Federal Institute of Technology. After seeing Church and colleagues’ publication in the news in 2012, he wanted to compare how competent different storage media were over long periods of time.
“The original idea was to do a set of tests with various storage formats,” he says, “and torture them with various conditions.” Hot and cold, wet and dry, at high pressure, and in an oxygen-rich environment, for example. He contacted Reinhard Heckel, a friend he had met at Belvoir Rowing Club in Zurich for advice. Heckel, who was a PhD student in communication theory at the time, voiced concern that such an experiment would be unfair since DNA didn’t have the same error-correction systems as other storage devices such as CDs and computer hard drives.
To make it a fair fight, they implemented Reed-Solomon codes into their DNA storage method. “We quickly found out that we could ‘beat’ traditional storage formats in terms of long term reliability by far,” Grass says. When stored on most conventional storage devices—USB pens, DVDs, or magnetic tapes—data starts to degrade after 50 years or so. But, early on in their work, Grass and his colleagues estimated that DNA could hold data error-free for millennia, thanks to the inherent stability of its double helix and that breakthrough in mathematical theory from the mid-20th century.
Out from Obscurity
When storing and sending information from one place to another, you almost always run the risk of introducing errors. Like in the “telephone” game, key parts may be modified or lost entirely. There has been a rich history of reducing such errors, and few things have propelled the field more than the development of information theory. In 1948, Claude Shannon, an ardent blackjack player and mathematician, proposed that by simplifying files or transmissions into numerous smaller components—yes or no questions—combined with error-correcting codes, the relative risk of error becomes very low. Using the 1s and 0s of binary, he hushed the noise of telephone switching circuits.
Using this binary foundation, Reed and Solomon attempted to shush these whispers even further. But their error-correction codes weren’t put into use straight away. They couldn’t, in fact—the cyphers needed to decode them weren’t invented until 1968. Plus, there wasn’t anything to use them on; the technology that could utilize them hadn’t been invented. “They are very clever theoretical objects, but no one ever imagined they were going to be practical until the digital electronics became so sophisticated,” says McEliece, the Caltech information theorist.
Once technology did catch up, one of the codes’ first uses was in transmitting data back from Voyager 1 and 2. Since the redundancy provided by these codes (together with another type, known as convolution codes) cleaned up mistakes—the loss or alteration of pixels, for example—the space probes didn’t have to send the same image again and again. That meant more high-resolution images could be radioed back to Earth as Voyager passed the outer planets of our solar system.
Reed-Solomon codes weren’t widely used until October 1982, when compact discs were commercialized by the music industry. To manufacture huge quantities en masse, factories used a master version of the CD to stamp out new copies, but subtle imperfections in the process along with inevitable scratches when the discs were handled all but guaranteed errors would creep into the data. But, by adding redundancy to accommodate for errors and minor scratches, Reed-Solomon codes made sure that every disc, when played, was as flawless as the next. “This and the hard disk was the absolute distribution of Reed-Solomon codes all over the world,” says Martin Bossert, director of the Institute of Telecommunications and Applied Information Theory at the University of Ulm, Germany.
At a basic level, here’s how Reed-Solomon codes work. Suppose you wanted to send a simple piece of information like the equation for a parabola (a symmetrical curved line). In such an equation, there are three defining points: 4 + 5x + 7x2. By adding incomplete redundancy in the form of two extra numbers—a 4 and a 7, for example—a total of five numbers is sent in the transmission. As a result, any transposition or loss of information can be corrected for by feeding the additional numbers through the Reed-Solomon algorithm. “You still have an overrepresentation of your system,” Grass says. “It doesn’t matter which one you lose, you can still get back to the original information.”
Using similar formulae, Grass and his colleagues converted two files—the Swiss Federal Charter from 1291 and an English translation of The Methods of Mechanical Theorems by Archimedes—into DNA. The redundant information, in the form of extra bases placed over 4,991 short sequences according to the Reed-Solomon algorithm, provided the basis for error-correction when the DNA was read and the data retrieved later on.
That is, instead of wastefully overlapping large chunks of sequences as the EBI researchers did, “you just add a small amount of redundancy and still you can correct errors at any position, which seemed very strange at the beginning because it’s somehow illogical,” Grass says. As well as using fewer base pairs per kilobyte of data, this tack has the added bonus of automated, algorithmic error-correction.
Indeed, with a low error-rate—less than three base changes per 117-base sequence—the overrepresentation in their sequences meant that the Reed-Solomon codes could still get back to the original information.
The same basic principle is used in written language. In fact, you are doing something very similar right now. Even when text contains spelling errors or even when whole words are missing, you can still perfectly read the message and reconstruct the sentence accordingly. The reason? Language is inherently redundant. Not all combinations of letters—including spaces as a 27th option—give a meaningful word, sentence, or paragraph.
On top of this “inner” redundancy, Grass and colleagues installed another genetic safety net. On the ends of the original sequences, they added large chunks of redundancy. “So if we lose whole sequences or if one is completely screwed and it can’t be corrected with the inner [redundancy], we still have the outer codes,” Grass says. It’s similar to how CDs safeguard against scratches.
It may sound like overkill, but so much redundancy is warranted, at least for now. There simply isn’t enough information on the rate and types of errors that occur during DNA synthesis and sequencing. “We have an inkling of the error-rate, but all of this is very crude at this point,” Kosuri says. “We just don’t have a good feeling for that, so everyone just overdoes the corrections.” Further, given that the field of genomics is moving so fast, with new ways to write and read DNA, errors might differ depending on what technologies are being used. The same was true for other storage devices while still in their infancy. After further testing, the error-correction codes could be more attuned to the expected error rates and the redundancy reduced, paving the way for higher bandwidth and greater storage capacity.
Into the Future
Compared with the previous studies, storing two files totaling 83 kilobytes in DNA isn’t groundbreaking. The image below is roughly five times larger. But Grass and his colleagues really wanted to know just how much better DNA was at long-term storage. With their Reed-Solomon coding in place, Grass and colleagues mimicked nature to find out.
“The idea was always to make an artificial fossil, chemically,” Grass says. They tried impregnating their DNA sequences in filter paper, they used a biopolymer to simulate the dry conditions within spores and seeds of plants, and they encapsulated them in microscopic beads of glass. Compared with DNA that hasn’t been modified chemically, all three trials led to markedly lower rates of DNA decomposition.
The glass beads were the best option, however. Water, when unimpeded, destroys DNA. If there are too many breaks and errors in the sequences, no error-correction system can help. The beads, however, protected the DNA from the damaging effects of humidity.
With their layers of error-correction and protective coats in place, Grass and his colleagues then exposed the glass beads to three heat treatments—140˚, 149˚, and 158˚ F—for up to a month “to simulate what would happen if you store it for a long time,” he says. Indeed, after unwrapping their DNA from the beads using a fluoride solution and then re-reading the sequences, they found that slight errors had been introduced similar to those which appear over long timescales in nature. But, at such low levels, the Reed-Solomon codes healed the wounds.
Using the rate at which errors arose, the researchers were able to extrapolate how long the data could remain intact at lower temperatures. If kept in the clement European air outside their laboratory in Zurich, for example, they estimate a ballpark figure of around 2,000 years. But place these glass beads in the dark at –0.4˚ F, the conditions of the Svalbard Global Seed Bank on the Norwegian island of Spitsbergen, and you could save your photos, music, and eBooks for two million. That’s roughly ten times as long as our species has been on Earth.
Using heat treatments to mimic the effects of age isn’t foolproof, Grass admits; a month at 159˚ F certainly isn’t the same as millennia in the freezer. But his conclusions aren’t unsupported. In recent years, palaeogenetic research into long-dead animals has revealed that DNA can persist long after death. And when conditions are just right—cold, dark, and dry—these molecular strands can endure long after the extinction of an entire species. In 2012, for instance, the genome of an extinct human relative that died around 80,000 years ago was reconstructed from a finger bone. A year later, that record was shattered when scientists sequenced the genome of an extinct horse that died in Canadian permafrost around 700,000 years ago. “We already have long-term data,” Grass says. “Real long-term data.”
But despite its inherent advantages, there are still some major hurdles to surmount before DNA becomes a viable storage option. For one, synthesis and sequencing is still too costly. “We’re still on the order of a million-fold too expensive on both fronts,” Kosuri says. Plus, it’s still slow to read and write, and it’s not rewritable nor is it random access. Currently, today’s DNA data storage techniques are similar to magnetic tape—the whole memory has to be read to retrieve a piece of information.
Such caveats limit DNA to archival data storage, at least for the time being. “The question is if it’s going to drop fast enough and low enough to really compete in terms of dollars per gigabyte,” Grass says. It’s likely that DNA will continue to be of interest to medical and biological laboratories, which will help to speed up synthesis and sequencing and drive down prices.
Whatever new technologies are on the horizon, history has taught us that Reed-Solomon-based coding will probably still be there, behind the scenes, safeguarding our data against errors. Like the genes within an organism, the codes have been passed down to subsequent generations, slightly adjusted and optimized for their new environment. They have a proven track record that starts on Earth and extends ever further into the Milky Way. “There cannot be a code that can correct more errors than Reed-Solomon codes…It’s mathematical proof,” Bossert says. “It’s beautiful.”