Introduction
Two of the most commonly used data storage architectures today are data warehouse and data lake. Data warehouses store structured data that is extracted and transformed to fit standardized schemas, and they are usually used to power business intelligence and data analytics. Data lakes store both structured and unstructured data, where schemas are not enforced, and they are usually used to power machine learning use cases.
A data lakehouse is a relatively new data storage architecture that combines the benefits of both a data warehouse and a data lake. It has the cost advantage and flexibility of a data lake, and also has the schema enforcement and data governance features of a data warehouse.
Data Lakehouse Architecture
The main idea of the data lakehouse architecture is to have a structured transactional layer on top of a data lake. This layer enables data to be stored and accessed in a transactional manner, and also adds schema enforcement, data validation, data versioning, and indexing capabilities. This layer is built using the open source Delta Lake format.
Delta Lake
Delta Lake consists of a table of contents and a log of transactions. It lives inside the underlying data lake, alongside the data objects that it catalogs.
Each log record is a JSON file that has an increasing numeric ID that indicates the order at which transactions happened, so the file name is like “000001.json”. Inside each log record, there is an array of transactions that are performed on top of the previous version of the underlying data objects.
To reduce the number of log records that need to be consumed and improve the performance of reading from a data lakehouse, checkpoints are used to compress multiple log records together. Checkpoints store all non-redundant transactions in the table up to a certain log record ID. Checkpoints are stored in Parquet files with the same record ID as the last log record (e.g. 000003.parquet). There is also a pointer file that points to the last checkpoint to further improve read efficiency.
Reading from Tables
Reading from a table stored inside a data lakehouse follows these steps:
- Read the last checkpoint pointer and go to the last checkpoint file.
- List all of the log records newer than the last checkpoint.
- Apply the changes in the checkpoint and all subsequent log records to reconstruct the state of the table.
- Use the metadata to identify while data object files are relevant to this read query.
- Query all of these data object files, possibly in parallel, and return the results.
Writing to Tables
Writing to a table stored inside a data lakehouse follows these steps:
- Read the data at the most recent log record ID (using the steps above).
- Write any new data objects from this transaction into new files in the data lake.
- Write this transaction into a new log record file after incrementing the log record ID.
- Optionally, write a new checkpoint.
Data Lakehouse Features
Time Travel and Rollbacks
In a data lake, rolling back updates is difficult because data object files lack the metadata to identify updates and versions. However, being able to get the snapshot of data at a given time in the past is very important in enabling reproducibility in machine learning training.
The data lakehouse architecture, on the other hand, makes time traveling and rollbacks very easy. Because the data objects and logs are immutable, we don’t have to worry about corrupting historical data with dirty writes. Time traveling in a data lakehouse is simply reading a table based on an older log record ID.
Efficient UPSERT, DELETE, and MERGE
In a data lake, UPSERT, DELETE, and MERGE operations are difficult, because they need to make changes to the underlying data objects, so we need to pause read operations to ensure consistency. We also need to be careful of not leaving the data in a partially-updated state if these operations fail.
In a data lakehouse, these operations can be made transactionally, and instead of editing the underlying data objects, we simply add the transaction to the log records.
Streaming Ingest and Consumption
To ingest streaming data into a data lake, a streaming message bus such as Apache Kafka or Kinesis is used. This adds complexity and maintenance burden to data management.
In a data lakehouse, a table’s log can serve as a message queue, which removes the need to introduce a new message bus component to a company’s data management ecosystem.
Data Layout Optimization
Because the log records of a data lakehouse represent transactions in a table, they tend to be stored in an inefficient way, such as having a large number of small transactions, with transactions on different data objects ordered in a way that is inefficient for read.
Delta Lake provides an OPTIMIZE operation that compacts small objects without affecting ongoing transactions. It also orders the data in such a way that optimizes for read queries based on a few frequently used columns, and this is known as Z-ordering. Z-ordering increases locality in these frequently used columns, which allows queries that filter on those columns to read less data.
Schema Evolution and Enforcement
Datasets maintained over a long period of time often have schema updates, and in a traditional data lake, older objects often have outdated schemas.
Delta Lake enables schema updates to be stored transactionally. This allows older data object files to be read without rewriting them to reflect changes in the schema. Delta Lake also ensures that newly written data follows the table’s latest schema. This type of schema enforcement gives a data lakehouse similar benefits to that of a data warehouse.
Data Lakehouse Use Cases
Data lakehouses combine the benefits of data warehouses and data lakes, so many companies choose to adopt this solution in order to unify their data storage architecture. They are used in all sorts of industries and use cases, and are especially common for companies that process and store large amounts of data and use them to train machine learning models.