Skip to Main Content

Is DNA the Future of Data Storage?

Season 8 Episode 18 | 18m 1sVideo has Closed Captions

There are a few obstacles to overcome before we get there.

Could the future of data storage be DNA? It’s the original format after all, storing the information needed to build every living thing. and it has a handful of qualities that would make it perfect to store all the digital information in our world. With recent advances in DNA sequencing and DNA printing, it’s technically possible.

10/27/2022

From

Problems playing video? | Closed Captioning Feedback

Problems playing video? | Closed Captioning Feedback

Is DNA the Future of Data Storage?

Season 8 Episode 18 | 18m 1sVideo has Closed Captions

Could the future of data storage be DNA? It’s the original format after all, storing the information needed to build every living thing. and it has a handful of qualities that would make it perfect to store all the digital information in our world. With recent advances in DNA sequencing and DNA printing, it’s technically possible.

10/27/2022

From

Problems playing video? | Closed Captioning Feedback

Genre

Science and Nature

Follow Us

Share This Video

Embed Video
Reactions
Is DNA the Future of Data Storage?
S8 Ep18
Fixed iFrame
Width: in pixelsHeight: in pixels
Responsive iFrame

How to Watch Reactions

Reactions is available to stream on pbs.org and the free PBS App, available on iPhone, Apple TV, Android TV, Android smartphones, Amazon Fire TV, Amazon Fire Tablet, Roku, Samsung Smart TV, and Vizio.

Open in new tab

Storing data in DNA could be world changing, really.

This pile of mismatched hard drives that I keep in this basket of shame is a nightmare.

It is only about 20 terabytes of video files, but it is bulky and annoying and just full of cables and a mess, but scale my hard drives up to even more data, and you're going to need a much bigger basket.

Google search alone, for example, processes over 200 petabytes of data every single day.

That is 200,000 of these single terabyte drives or 200 million single gigabyte drives.

I don't know anyone who uses a single gigabyte drive, but you get the idea.

No one has that kind of storage space.

Well, maybe somebody does.

Now imagine storing all the data in the world, all of it, every book, every movie, every document, everything.

That is about 10 trillion gigabytes of digital data.

So how much space do you think that would take?

Well, you would need 1,000,000,010 terabyte hard drives.

If each one of those is about five inches long, end to end, they would reach nearly 79,000 miles, which is about three times around the equator.

But using DNA to store data could keep all of that information in a space as small as a coffee cup.

So what is it about DNA that's making scientists look into using it as data storage?

And what has to happen to get all of our zeros and ones into tiny little DNA strands?

Gosh, this is great footage.

Scientists began playing around with storing data in DNA back in the nineties, and the idea has been around since the 1960s.

But why?

It seems like kind of a weird leap to go from tapes and hard drives to a biological molecule as a storage method.

But DNA is the original data storage format.

I'm a geneticist, so I'm biased.

But every living thing on earth stores the information it needs to build its cells and tissues and respond to its environment in its DNA.

And DNA has some properties that would lead itself well to being a digital data storage solution.

One is that it is super stable over time.

Keep DNA in a dark cool space, not even a freezer, but just something like a wine cellar, and it'll last for hundreds of years.

DNA in permafrost has been decoded after as much as 30,000 years.

It's a long time.

Two, DNA has a really dense storage capacity.

All of the information needed to make you is stored in almost every single one of your cells.

And, tiny spoiler, a single gram of DNA, about half the weight of a sugar cube can store at least 215 million gigabytes of data.

So DNA naturally has the capability to store a ton of information for a long, long time.

And trust me when I say that I will absolutely need this video footage to be on a drive somewhere in 500 years.

Your great, great, great, great grandchildren need to know ACS Reactions information.

Okay, so while theoretically all the data in the world could fit in a coffee cup, it would probably be more easily accessible and a bit more useful if we stored it in a space about the size of a household refrigerator, but that's still pretty impressive.

The third thing that DNA has that makes it perfect for storing data is a code.

So just like the zeros and ones of binary computer code, DNA stores information in the very specific order of four nucleotides, A, T, C and G. And just like rearranging the zeros and ones in a binary code differently can give you everything from Jurassic Park to Chicken Soup for the Chemist's Soul, rearranging DNA letters can give you everything from dinosaurs to chickens, and the two strands of DNA are complementary.

And this means that an A on one strand of DNA will always bind to a T on the other, and a C will always bind to a G. So if you have one strand of DNA, you can always rebuild the other.

And this makes copying information really easy, and it is absolutely critical to how everything else in this episode is going to work.

It's also phenomenally important to all of life here.

But, you know, priorities.

Another plus of storing data in DNA is that it's not going to become obsolete.

We're going to be sequencing DNA a lot longer than we ever used floppy drives.

And if it's implemented carefully and thoughtfully, DNA data storage has the potential to be way more environmentally friendly than continuing to produce hard drives out of metal and plastic.

The funny thing is that in 2018 when I graduated from grad school, our gel imager was still using Windows 98 and it only had a floppy drive.

So in fact, I used floppy drives until 2018.

There are going to be five big steps for storing and accessing data in DNA.

First, you're going to need to turn the zeros and ones of computer file language into the As, Ts, Cs, and Gs of DNA language.

That will happen in your computer.

Second, you need to physically print out a strand of DNA that encodes those specific As, Ts, Cs and Gs.

This is going to happen on a DNA printer.

Then you can go store your DNA in a tube somewhere.

Third, when you're ready to retrieve your data, you need to access the file or set of files of interest from your DNA storage tube.

Fourth, when you want to read the information you've accessed, you have to take that DNA molecule and then read out the sequence of As, Ts, Cs and Gs that are on it.

That happens in a machine called a sequencer.

Fifth and finally, you can then turn that text file of As, Ts, Cs and Gs into zeros and ones again, and that happens once again in your computer.

Okay, so now let's go from theory to reality.

I'm going to convert our very first ACS Reactions video into DNA.

So to do that, I first have to scroll way back to the bottom of the ACS Reactions page to here, in 2014.

It was a long time ago.

So I took that video file and I wanted to turn it into binary code.

So remember that at their core, computer files are strings of zeros and ones.

And when I did this for just the very first frame of ... Oh, my computer is so dirty.

So I converted just the first frame of that video into zeros and ones, and it was thousands of pages of zeros and ones, thousands of pages, truly.

I thought I was going to print it all out, but then I didn't want to kill all the trees.

So it's a lot of zeros and ones is what I'm telling you.

So for fear of crashing my computer, I didn't do the whole video because it's not necessary.

But now we need a method to turn all of these zeros and ones into As, Ts, Cs, Gs, and one of the easiest ways to do that is to take pairs and say maybe 00 would equal A.

01 would equal T. 10 would equal C, and 11 would equal G. So you could take the information you have and it would actually be compressed by half because you'd go for two characters down to one character, but that's actually not very efficient or reliable.

Instead, a team in 2017 used information theory principles to compress and error correct their data.

Now look, it's complicated, but it's kind of like zipping files on your computer to make them smaller.

It's better to zip your files before you store them as DNA.

This isn't going to happen, right?

So much DNA.

The team was able to store a full computer operating system, an 1895 French film, Arrival of a Train at La Ciotat, a $50 Amazon gift card, a computer virus, a pioneer plaque, and a 1948 study by information theorist Claude Shannon in a little over 14 million DNA bases.

To put that into context, your human genome is 3.2 billion bases long.

14 million bases could fit onto a spec way smaller than the period at the end of a sentence.

With this method, the team suggested that they could pack 215 petabytes of data into a single gram of DNA, which was way more dense than any other effort at the time.

And to give that a little bit of context, 250 petabytes is like 1.5 billion Facebook photos.

It's a lot of data.

That actually doesn't sound like enough.

The theoretical upper limit of DNA data storage is a zettabyte in a gram of DNA.

That is 1 trillion gigabytes.

Even Elon Musk doesn't have that much storage space, probably.

So let's say we use this method to turn our zeros and ones of our video file into As, Ts, Cs and Gs.

Now we have a file of As, Ts, Cs and Gs.

So how do we turn that text file into a strand of DNA?

Well, I'll tell you that this is where our production budget runs out.

I can code stuff.

I can't afford to print the DNA.

So you have to take the nucleotides that are in a text file on your computer and print them out as strands of DNA.

You can think of it like an inkjet printer but for DNA.

Now, you can do this a couple of different ways, but the most common and standard for a long time has been phosphoramidite chemistry.

To get how this works, we have to take a little bit of a closer look at the As, Ts, Cs and GS that make up the DNA code.

They're each different types of nucleosides strung together on a backbone of sugars and phosphates.

A nucleoside plus one of these phosphates is called a nucleotide.

And these phosphates are key.

To build a strand of DNA, cells connect the phosphate group of one nucleotide to a free hydroxyl group on the carbon of the next nucleotides sugar ring.

So you can attach phosphate to sugar, phosphate to sugar, phosphate to sugar over and over to build a big long chain.

But unlike a normal nucleotide, a phosphoramidite molecule has a phosphite in the place of the phosphate.

That phosphite contains one fewer oxygen atom than the normal phosphate, and it can't connect to the next nucleotide.

This means it's considered protected.

If it's at the end of a chain of nucleotides, you can't add another, you can't keep growing that chain.

However, those phosphites can be easily removed when necessary.

So to print out DNA, you can start with your first letter as a phosphoramidite.

Remember, in our DNA data storage scenario, this will stand for zeros and ones.

But in the actual thing that you're printing out, it is a molecule called adenine or A for short.

Now, imagine that you have a phosphoramidite A and you want to attach a G as the next space.

Well, that phosphoramidite A is protected so you can't add a G. Nothing will stick to it.

So in order to add that G, you have to de-protect the phosphite group on the A and then wash the Gs over it.

But this means that only one G will bind because the Gs themselves are protected.

So this way you don't accidentally get a string of like G, G, G, G, G, you just get an A and then your one G. So now you have a two letter long strand of DNA, A and G. To add the next letter, maybe a T, you can de-protect the G and then wash a bunch of Ts over it so that you get A, G, T and so on and so on.

So slowly, base by base, you can turn your text file of As, Ts, Cs and Gs into a real strand of DNA.

So now your files are strands of DNA.

You can place them into a tube just like this and store them for hundreds of years.

But what about when you need to read that data back out?

So first you need to find the piece of DNA that you're looking for.

The whole point of this storage method is that it's really dense.

So you're not going to give each file its own tube.

You're going to put a bunch of files together all in one tube.

But finding what you're looking for in that tube of DNA can be really difficult.

Imagine that you've taken every book in the world, torn out all of the individual pages, and then tossed them together into an empty swimming pool.

Finding the book you want to read is going to be really difficult and you don't want to have to look at every single page to do so.

One way to find just what you're looking for is using a process called PCR, or polymerase chain reaction.

Basically it's a method that allows you to make lots and lots of copies of a very specific piece of DNA.

So you take your mish-mashed soup of DNA files and you add two small pieces of DNA to it that's going to specify the one piece that you actually want to read.

Then you copy that piece billions of times, so it overwhelms everything else in your soup.

It's like taking a bowl of chicken noodle soup and duplicating the carrots so many times that you have a five ton pile of carrots with a tiny bowl of soup underneath.

If you just want to eat carrots, it's a lot easier to do now.

So same with your DNA.

If you pick a piece of DNA from the tube to sequence, you're more likely going to be looking at the DNA you want than the DNA you don't want just because there's more of it around.

But this process can be kind of annoying since you could use up your soup just to retrieve one file.

The chicken and noodles and broth under the carrot pile are basically useless now.

So one group is embedding the DNA in little spheres of silica to make the retrieval easier.

As a test, they encoded 20 different images into DNA.

They then put little DNA barcodes on the outside of all of those spheres.

These were little tags hanging off that would allow you to pick out the images that you wanted.

So these barcodes corresponded to keywords about the image, things like orange and cat for an image of a tiger.

Then they could use fluorescent tagging to sort out just the spheres that they wanted to sequence.

And I know this sounds a little bit like magic, so let me explain.

Because the tags they put on the outside of the spheres are DNA, they can use the complementary nature of DNA to make tiny matching tags that will bind to only the tags that they're looking for.

Things like CAT.

You then stick a fluorescent molecule onto those matching tags and then mix them in with the sample of spheres.

They'll bind to the tags on the images they're looking for.

So now the images you want will have a fluorescent tag on them and the images that you don't will not have a fluorescent tag on them.

You can use a super special sorting machine to separate the fluorescing spheres out from the ones that aren't fluorescing.

So now you have a group of just the files you were looking for.

The team was able to accurately pick out the image they wanted from 20 images stored in DNA, and they say that this could work for up to 10 to the 20th files.

That's a lot of files.

So now you found your file.

How do you read it?

Well, you can sequence the DNA, which just means reading the order of the As, Ts, Cs and Gs in that strain of DNA.

And there are two main methods of doing this.

The first is by building a copy of the DNA and the second is by pulling that DNA through a tiny pore.

So the first method takes advantage of the complementary nature of DNA that we've talked about before in this episode where an A always binds to a T and a C always binds to a G. The sequencing machine looks at a strand of DNA of unknown sequence.

In order to determine its sequence, it slowly builds a copy of it base by base.

So each new base you add gives off a different color of fluorescent light, say red for C or yellow for A.

A very sensitive camera can pick up this light and turn it into a DNA sequence that you can then read on a text file.

The other common method of DNA sequencing uses a pore embedded in a membrane.

The sequencer slowly pulls the DNA through that pore, and as it does so, the DNA strand disrupts an electrical signal running through that membrane.

Each nucleotide is shaped a little bit differently, so it disrupts the signal a little bit differently.

And you can read those disruptions back out as A, T, C or G. Now this kind of sequencing, because it gives you your information back in real time, can also be used for the data accession problem that we talked about earlier.

If you start reading a piece of DNA and it doesn't look like the file you want, you can have the sequencer kick it out of the pore and keep looking for the file that you do want.

But no matter which way you sequence it, you can then use whatever computer algorithm you use to turn zeros and ones into A, T, C and G back into the zeros and ones of your computer file.

So digital data in DNA sounds awesome, but we're all still lugging around hard drives.

So what needs to happen for us to store all of our information in just this?

I mean, at least theoretical DNA tastes good.

Well, as you might have noticed already, all of these methods are a little bit bulky and time consuming.

At the moment it takes about a week to order printed DNA from most of the companies offering that service.

Then it takes about a half hour of prep and a couple hours of sequence to get that info back out when you want it.

And there's also some computational time in there too.

So this is all good for files that you don't need often, but it's not great if you just want to show someone the cute picture you took of your dog yesterday.

Look at how cute she is.

DNA data storage would be great for things you don't have to access often replacing stuff like tape backups.

It's already primed to be a great archival method.

But the other thing we need to overcome is cost.

Nobody's got enough money to print all of the digital data that we have.

Well, maybe someone does.

Currently it would cost $1 trillion to write one petabyte of DNA.

Again, that's like 6 million Facebook photos.

That's way too expensive to make it accessible.

One way to make it affordable might be to try new methods of writing out that DNA.

EDS or enzymatic DNA synthesis is one promising new method.

Like it sounds, this uses an enzyme rather than a typical chemical reaction to add a base to a growing nucleotide chain.

Like the phosphoramidite synthesis, you start out with your first nucleotide and then slowly build your strand one by one.

But EDS uses an enzyme that's carrying the next nucleotide that you want to add to the growing chain.

Once it adds it, it then protects the end so more nucleotides can't bind, then it does this with the next nucleotide and the next nucleotide and the next nucleotide.

So this could be useful for making longer and longer strands of DNA since the previous phosphoramidite method starts to peter out after about a few hundred bases.

The combination of new DNA printing methods, the ability to write longer strands of code and decreasing sequencing costs means that by 2030, DNA synthesis could reach affordable rates for long term storage, something around a dollar a terabyte each for reading and writing.

Okay, so real talk in my professional geneticist opinion, I really do believe that this is going to be a reality for archival data storage in about the next decade.

Sequencing costs are coming way down.

DNA printing costs are coming down, and it's a stable, compact, efficient storage solution for the massive amount of data that we are all creating every single day.

And the potential for doing extra cool things like storing data in DNA inside living bacteria or sending stable DNA messages out into the solar system just makes the sci-fi loving part of my brain sing.

So bring it on, Elon.

I am ready to put your computational data storage capacity to shame with DNA.

It's snack time.

Support for PBS provided by: