DNA, the next storage medium for Big Data?
Present in all cells of living species, the DNA “macromolecule” stores all genetic data relating to life. Its existence was discovered in 1869 by Swiss scientist Friedrich Miescher. The double helix structure was first illustrated in 1953 in a landmark article in the journal Nature. Since then, many research studies have revealed the complexity of the information stored in DNA. For example, the human genome alone has over 3 billion base pairs, and was only completely decoded barely twenty years ago.
Many potential applications
It was a small step for researchers to take an interest in DNA for computer applications, not only in terms of storage capacity, but also of data structure. Bioinformatics is the study of this mass of data and its organization, as well as the storage processes that make information accessible to all living cells.
The replication of this structure could have numerous concrete applications, among others for databases, search engines, and for the cloud, whose “organic” organization has many similarities with the biology of living beings: impressive quantities of data, complex interconnections, etc.
Several research teams around the world are specifically interested in DNA storage capacity. For example, teams at Microsoft see various applications for the cloud. In July 2016, Microsoft made headlines when it announced it had stored 200 megabytes of data in DNA, including a music video!
“Microsoft and University of Washington DNA Storage Research Project” (2016):
(Virtually) infinite storage
Research has shown that it is theoretically possible to store one quintillion bytes (1018, 1,000,000,000,000,000,000) of data per cubic millimetre of DNA. The magnitude and scale of this number are hard to grasp!
Last year, a joint team from the New York Genome Center and Columbia University obtained an actual storage capacity of 215 petabytes per gram of DNA. By way of comparison, all films produced since the beginning of film history, if stored digitally in DNA, would fit in a little less than the size of a sugar cube.
"DNA is the densest known storage medium in the universe, according to the laws of physics. That’s why researchers are interested,” says Victor Zhirnov, chief scientist for the Semiconductor Research Corporation, an American research institute. No wonder that DNA intrigues companies like Microsoft or Intel: this research seems to be the solution to humanity’s storage capacity woes, as current media reaches its limit just as “big data” explodes.
Making DNA — How does it work?
Microsoft has partnered up with Twist Bioscience, a San Francisco-based biotechnology company. The United States is home to many such start-ups that fabricate DNA or try to improve its production.
For nearly 40 years, it has been possible to create DNA from a chemical synthesis process that binds individual nucleic acids into longer strands. However, companies in this field would like to improve this tedious and error-prone process. The main breakthrough could come from enzyme-based biotechnology, as we’re seeing with the genetic code of humans.
Forget the 0’s and 1’s
The problem is the encoding process related to biogenetics. The data must be converted into DNA-specific code, known as nucleotide chains — the famous A, G, C and T. This process of encoding is long, complex and costly, which is the main hindrance at the moment.
The encoding of DNA will have to be automated and accelerated. According to Doug Carmean, an architect at Microsoft Research, the company is currently able to do this at a speed of about 400 bytes per second. But for the process to be viable, it should be at 100 megabytes per second.
Microsoft also estimates that the current cost of DNA storage must be reduced 10,000-fold before it becomes competitive enough to be popular. In other words, this strange new technology based on the same molecules as those in our genes is not about to find its way onto our computers.
Nevertheless, the tech giant wants to have an operational DNA-based storage system in one of its data centres by the end of the decade. This system, according to Carmean, could look like “a big Xerox copier from the 1970s”.
An ultra-durable material
Various semiconductor experts thought that DNA would be “too soft” to be considered as a storage medium, when in fact, DNA can last between 100 and 1,000 times longer than current storage devices! And the information is so durable and stable that, having withstood ice ages and other natural disasters, it can still be retrieved and read from the remains of organisms tens of thousands of years old. Can the same be said about our current magnetic media, including tapes still used by companies to perform computer backups?
Synthetic DNA is durable and can encode digital data very densely, making it an attractive medium for long-term data storage. However, to recover stored data on a large scale, all the DNA in a given pool must be sequenced, even if only a subset of the information will be extracted. Microsoft announced last February that it was able to code and store 35 separate files (over 200 MB of data) in more than 13 million DNA oligonucleotides, then recover each file individually and without any errors using a random-access approach.
Time will tell if, by the end of the decade, we store our files encoded in nucleic bases. In the age of quantum computing, nothing should surprise us anymore.