Methylated DNA storage goes parallel

A new DNA-based storage approach, faster than coding data in DNA’s four nucleic bases, works by setting DNA sections on or off, with the presence or absence of methyl groups, to provide binary coding – but the speed is millions of times slower than LTO-9 tape.

The concept is described in a Nature paper, titled “Parallel molecular data storage by printing epigenetic bits on DNA” written by Chinese scientific biologists.

DNA has an informational density a million times that of SSDs, and endures for thousands of years. But is is currently written in a serial fashion with expensive, bulky and slow equipment. The Chinese scientists may have found a way to speed things up.

The four DNA nucleic bases, nucleotides, are adenine (A), guanine (G), cytosine (C) and thymine (T) and they are found in the double helix formation of the DNA biopolymer molecule, located in the cells of all living organisms. A DNA molecule contains many genes arranged in a set order. A mammalian gene can be more than 10,000 nucleotides long and can surpass 100,000 nucleotides. 

A methyl group is formed from a carbon atom linked to three hydrogen atoms – CH₃ – and such groups can be added to DNA gene sequences to make them inactive. Inactive and active gene sequences can stand for binary 1s and 0s – epigenetic bits. The addition is a type of chemical change, DNA methylation, called an epigenetic modification. Data to be stored – text or images for example – is encoded in binary numbers and these are written to DNA in what has been likened to a chemical equivalent of a monotype printing process. We should envisage the “printing” of epigenetic bits and understand that it is done in parallel – in a single DNA reaction.

Epigenetic printing and retrieval flowcharts. Note A barcoded DNA carrier is a DNA molecule with a unique “barcode” sequence embedded within it, acting as a molecular identifier. These barcodes can be added to short, specific DNA sequences and function as unique tags, allowing researchers to track, identify, and differentiate them.
Monochrome rubbing imagr and panda’s head.

It is simpler, and faster, to do this than to synthesize new DNA sequences with data encoded using the four nucleotides, as existing DNA can have sections switched on or off using templates – prefabricated DNA “bricks.” The researchers used a set of 700 DNA “movable types” and five templates, and wrote approximately 275,000  bits on an automated platform with 350  bits written per reaction. 

They say their “framework, that programs arbitrary epigenetic information on universal DNA, is desirable for the purpose of synthesis-free DNA data storage.”

Data reading is accomplished by nanopore sequencing that can detect methyl group absence or presence and so reconstitute the binary data. This data contains data group indexing information so that the binary sequences are regenerated in the correct order.

The researchers demonstrated the technique’s feasibility by storing the letters “DNA,” then images with, first, a monochrome brass rubbing and, second, a colored panda’s head, comprising >250,000 epi-bits, and then retrieving them. They leveraged “the parallel nature of the epi-bit writing mechanism” with the images, and defined “the bit parallelism … as the number of bits written in a single minimal reaction per data-writing cycle.” Traditional synthesis-based DNA storage “has a bit parallelism of around 1 … whereas the enlarged storage experiment had a bit parallelism of 32.” But they achieved 350-bit parallelism per writing reaction by using 175 DNA bricks each bearing two epi-bits. The reading speed can be increased by having multiple readers operate in parallel.

Rubbing image retrieval with progressive improvement.

The researchers reveal that: “In this work, we employed a self-made primitive liquid dispensing device with four nozzles working collectively at 0.08kHz sampling frequency for the large-scale storage experiment, yielding a write speed of 40 bits/sec.” A write speed of 40 bits/sec is, frankly, ludicrously slow when compared to the 2.5:1 compression-assisted 400 MB/sec of LTO-9 tape, which is 3,200 megabits per second, 3,200,000,000 bits/sec to make it clear – 80 million times faster than this 40 bit/sec DNA epi-bit storage.

Panda’s head image retrieved epi-bits and printed image.

 Although it could be increased by using an inkjet printer approach to dispense the fluids involved, with multiple printers, there is no feasible way this method could increase the speed by 80 million times, matching LTO-9 tape bandwidth, and LTO-10’s 1,100 MB/sec write bandwidth will make this difficulty even worse.

The paper’s authors argue:”“Our framework presents a new modality of DNA data storage that is parallel, programmable, stable and scalable. Such an unconventional modality opens up avenues towards practical data storage and dual-mode data functions in biomolecular systems.”

Let’s step back for a moment. DNA storage works by using chemical reactions at the molecular level in fluids. When writing data the fluids have to be prepared and mixed and stored, requiring bio-chemical lab equipment and time. When writing the fluid has to be retrieved, sampled, placed in a vessel of some kind and sequenced, with the sequencing results analyzed and presented by a computer system, which takes time. DNA storage, using current technology, is inherently slow, making its only possible application the very long-term storage of cold data. Because the DNA data storage writing and reading equipment has not been commercialized it tends to be both expensive and fairly bulky, compared to digital storage equipment, and, as we see, slow, 80 million times slower than LT0-9 tape.

Unless there is a practical prospect of DNA storage’s data I/O speed approaching that of tape or optical disk archives then there will be little incentive to invest in it and productize the equipment needed. On this basis, DNA storage is a fantasy.

Bootnote

The paper’s Nature citation is: Zhang, C., Wu, R., Sun, F. et al. Parallel molecular data storage by printing epigenetic bits on DNA. Nature 634, 824–832 (2024). https://doi.org/10.1038/s41586-024-08040-5.

It is a complex paper with many diagrams and 126 pages of supplementary notes.

As a side note, Catalog Technologies demonstrated massively parallel search of DNA stored data at the end of 2022. It was based on encoding data in sections of synthetically produced DNA molecule groups rather than the slower method of encoding it in DNA molecules directly. As we wrote then: “Catalog encoded approximately 17,000 words (from Shakespeare’s Hamlet) into DNA in a few minutes on its Shannon DNA writing system. No pre-processing or DNA-based indexing of this data was carried out. It then ran a keyword search on this stored data and retrieved all occurrences of the query word.”