Across several news outlets this week, there's been talk of storing the world's data on DNA strands. You can read one of those articles here while Neil Lamb, HudsonAlpha's director of educational outreach, gives a bit more insight into exactly how the process would work...
Scientists have developed a method for storing documents,
images and sound files inside the strands of the DNA double helix. The
technology could open new avenues to keep copies of your favorite photos, that short
story you wrote in fifth grade or those home movies of Christmas and birthday
parties. Best of all, the technology would be safe for thousands of years and
would take up less space than a tube of lipstick.
Let’s back up for a moment and discuss storing data. Information,
whether from text, image or sound, is digitally encoded as long strings of 0s
and 1s. Eight of these digits make up a “byte” of information. A typed page is
made up of 2,000 bytes while a movie download contains about a billion bytes.
It’s been estimated that all
of the world’s digital data takes up roughly three zettabytes (a billion
trillion bytes).
DNA also uses a code to store information. In this case the code is four chemical “bases” – adenosine (A), thymine (T), guanine (G) and cytosine (C). Several years ago, scientists began to look at how the digital code of 1s and 0s could be stored inside the DNA. The digital string of 0s and 1s is rewritten as a series of A,T,C and G. (Keep in mind, the DNA fragments used for storage have no biological function and are kept inside a vial rather than inside a cell.) When stored under particular conditions, the DNA is stable for tens of thousands of years. When it’s time to recover the information, the DNA is sequenced and the order of the bases converted back to the corresponding bytes.
Early attempts to store information as DNA code directly
mapped 0s and 1s onto the bases – for example, a 0 was represented by A or C and a 1 by T and G. Unfortunately, this
approach is problematic when the string of 0s and 1s leads to a repeat in the
DNA sequence – like CCCCC. Current DNA sequencing technology struggles to
correctly identify these repeat regions, miscalculating how many “Cs” are
present and introducing errors into the numerical data.
Here’s where the recent media attention comes into play. Nick Goldman and colleagues at the European Bioinformatics Institute in the UK have
devised a method to minimize the likelihood of copying errors. Rather than use
a direct link between 0s and 1s and DNA bases, they devised an intermediate
code that prevents repeating bases. To further reduce errors, the original code
is split into fragments four different ways, with the breakpoints occurring at
different locations each time. This way, if an error does occur, other copies
of the same region can be used as comparison.
The scientific team encoded multiple files, including part
of an MP3 recording of Martin Luther King’s “I have a dream” speech, a text
file of all the sonnets of William Shakespeare and a PDF of the 1953 paper by
Francis Crick and James Watson describing the structure of DNA. All told,
757,000 bytes of information were encoded on over 153,000 DNA fragments. The
scientists estimate their approach, which is described online in the journal Nature, can store over two petabytes (or
two million billion bytes) of information on a single gram of DNA. That’s a
mind-boggling amount of information contained in something about the size of 15
grains of sugar.
Speed and cost are the two biggest drawbacks to DNA-based
storage. It took four days to synthesize the code into
DNA and the process of sequencing and decoding the fragments required two
weeks. The synthesis and decoding process costs $12,620 per megabyte of
information – millions of times more expensive than storing data on magnetic
tape. However, as technology continues to improve, both the price and timeframe
are expected to drop dramatically. If current trends continue, the researchers
estimate that in less than a decade DNA-based storage will be cost-effective
for information stored 50 years or more. This could be especially useful for
long-term archiving of governmental, historical or scientific data that only
rarely would be accessed.
If you’ve ever had to search for a way to pull data from an
old floppy disk, zip drive or VHS tape, you know how quickly digital storage
technologies change. The researchers note DNA has been storing biological information
for more than 3 billion years, meaning the odds are high it will be around in the
future, available for conversion into whatever new technology civilizations are
using to share data. Hang on to your CDs, DVDs and thumb drives a little bit
longer, but this technology is certainly worth watching.
Dr. Neil Lamb is HudsonAlpha's director of
educational outreach. Trained as a human geneticist,
he now focuses his energy on creating programs and
activities that help Alabama's teachers, students and the public understand genetics and biotechnology.


