Storage pioneer Catalog Technologies says it has made a historic breakthrough by demonstrating massively parallel search of data stored in DNA.
Update: Catalog explained how computing capabilities can increase the efficiency and cost-effectiveness of reading data back from DNA by orders of magnitude. December 19, 2022.
Catalog is developing DNA storage technologies based on encoding data in sections of synthetically produced DNA molecule groups rather than the slower method of encoding it in DNA molecules directly. The writing and reading of data will be done potentially by using lab-on-a-chip sequencing technology and Catalog is partnering with Seagate to develop this capability.
Hyunjun Park, Catalog founder and CEO, said in a statement: “This historic and transformational achievement is based on years of work with partners and collaborators that helped make DNA-based computation a reality.”
Catalog encoded approximately 17,000 words (from Shakespeare’s Hamlet) into DNA in a few minutes on its Shannon DNA writing system. No pre-processing or DNA-based indexing of this data was carried out. It then ran a keyword search on this stored data and retrieved all occurrences of the query word.
It says the number of steps required in this search of the DNA-stored data would be approximately the same if the dataset had 170,000 or 170 million words instead of 17,000. This is due to the chemical processes (DNA storage sample rehydration and sequencing) involved being inherently massively parallel.
Catalog says it is on track to demonstrate this search scalability on data sets containing over 100 million words by mid-2023.
Park said: “With the advantages of DNA-based data storage and computation demonstrated, we now turn our attention to addressing more sophisticated applications from signal processing to machine learning over massive datasets. In parallel, we are working closely with partners and collaborators to reduce the size and complexity of our platform and to identify specific workloads to target commercial offerings.”
But has Catalog actually demonstrated DNA-based computation? We could say that calling this computation is like saying a semiconductor chip that can only add is a processor. Catalog has demonstrated one specific aspect of a computation use case; keyword search. It is an amazing scientific achievement but do not start building end-of-life plans for tape archives or Ocient hyperscale data analysis systems just yet.
Catalog also says it has demonstrated how computing capabilities can increase the efficiency and cost-effectiveness of reading data back from DNA by orders of magnitude. Catalog explained how in a mailed message to us: “By computing chemically we were able to reduce the amount of data to be read by a sequencer to just the targeted search term. That is, the only DNA presented to the sequencer was, more or less, just the resultant DNA file from the chemical search. This netted a two orders of magnitude speed up in “reading” courtesy of avoiding having to read 99 percent of the DNA encoded data.”
“This contrasts with using a sequencer to read all the data and then using conventional computing to decode and ascertain the content of the search. Tests have shown that we would expect this result generally, without regard to the amount of data being searched. Thus, any amount of data being searched will come down to a “cost” of about 1 percent of the total; this is about two orders of magnitude improvement in a data file of arbitrary size.”
The data storage search area is receiving attention from AI-based semantic searchers like Nuclia. Catalog’s DNA storage and search technology relies, at the moment, on the massive potential capacity of DNA storage giving it physical space and cost advantages that tape, disk and SSD cannot match. If it can search exabytes, even zettabytes, of data in a massively parallel fashion then the technology has legs.
This is quite slow writing IO in IT storage terms, as writing 17,000 x 5-letter words in three minutes equates to a write rate of 472 bytes/sec.