Academics break down the evolution of database architectures

Two US academics have written a paper explaining and clarifying what has happened with database architectures, data models, and query languages. 

“What Goes Around Comes Around… And Around…” is a 17-page opus that repays reading and study with a terrific understanding of how NoSQL came about, what a document database consists of, how text search engines from the 1960s prefigured today’s vector databases, and many other valuable insights.

The authors, Michael Stonebraker from MIT and Andrew Pavlo from Carnegie Mellon, look at MapReduce systems, key-value stores, document databases, columnar and wide-column models, text search engines, array databases, vector databases, and graph databases. They also examine several DBMS architectures, concluding that “what goes around with databases will continue to come around in upcoming decades. Another wave of developers will claim that SQL and the RM (Relational Model) are insufficient for emerging application domains. People will then propose new query languages and data models to overcome these problems … However, we do not expect these new data models to supplant the RM.”

Michael Stonebraker (left) and Andy Pavlo

To give you a flavor of their approach, we’ve summarized what the two say about MapReduce and key-value (KV) stores.

MapReduce Systems

Google constructed its MapReduce (MR) framework in 2003 as a “point solution” for processing its periodic crawl of the internet. … Map is a user-defined function (UDF) that performs computation and/or filtering while Reduce is a GROUP BY operation. Yahoo! developed an open-source version of MR in 2005, called Hadoop. It ran on top of a distributed file system HDFS that was a clone of the Google File System.

There was a controversy about the value of Hadoop compared to RDBMSs designed for OLAP workloads. A 2009 study showed that data warehouse DBMSes outperformed Hadoop. The Hadoop technology and services market cratered in the 2010s. Many enterprises spent a lot of money on Hadoop clusters, only to find there was little interest in this functionality.

Google announced that they were moving their crawl processing from MR to BigTable. The reason was that Google needed to interactively update its crawl database in real time but MR was a batch system. Google killed off MR in 2014. The BigTable replacement of MR left the three leading Hadoop vendors (Cloudera, Hortonworks, MapR) without a viable product to sell. Cloudera rebranded Hadoop to mean the whole stack (application, Hadoop, HDFS) and built a RDBMS, Impala, on top of HDFS but not using Hadoop. MapR built Drill directly on HDFS.

HDFS has lost its luster, as enterprises realize that there are better distributed storage alternatives. Meanwhile, distributed RDBMSs are thriving, especially in the cloud. MR brought about the revival of shared-disk architectures with disaggregated storage, subsequently giving rise to open-source file formats and data lakes.

KV Stores

A KV DBMS represents a collection of data as an associative array that maps a key to a value. In the 2000s, several new Internet companies built their own shared-nothing, distributed KV stores for narrowly focused applications, like caching and storing session data, such as Memcached, Redis and Amazon’s Dynamo KV.

A second KV DBMS category are embedded storage managers designed to run in the same address space as a higher-level application [such as] BerkeleyDB from the early 1990s. Recent notable entries include Google’s LevelDB, which Meta later forked as RocksDB. Key-value stores provide a quick “out-of-the-box” way for developers to store data, compared to the more laborious effort required to set up a table in a RDBMS. If an application requires multiple fields in a record, then KV stores are probably a bad idea.

Several systems began as a KV store and then morphed into a more feature-rich record store. Such systems replace the opaque value with a semi-structured value, such as a JSON document. Examples are Amazon’s DynamoDB and Aerospike. One new architecture trend from the last 20 years is using embedded KV stores as the underlying storage manager for full-featured DBMSs.

MySQL was the first DBMS to expose an API that allowed developers to replace its default KV storage manager. This API enabled Meta to build RocksDB to replace InnoDB for its massive fleet of MySQL databases. Using an existing KV store allows developers to write a new DBMS in less time.

Authors and access

There is much, much more and it is all worth your attention. Stonebraker is an Adjunct Professor Emeritus in the Department of Electrical Engineering & Computer Science at MIT. Pavlo is Associate Professor with Indefinite Tenure of Databaseology in the Computer Science Department at Carnegie Mellon University.

The paper can be accessed here [PDF]. We recommend it for anyone connected with databases, data models, and query languages to understand how and why different approaches came about.