By: Ken Adams
Introduction
Delta Lake is an open source storage layer that sits on top of cloud storage technology such as Azure Data Lake Storage or Amazon S3. The technology was introduced by Databricks in 2019, and all of the code is available here. Delta Lake is an extension of Spark and markets the fact that it brings ACID transactions to data lakes. This is a huge step forward for data engineers.
Data integrity on data lakes is a challenge today due to the fact that failed ETL jobs can cause partial data to be written or corrupted. With Delta Lake, you are able to guarantee a write operation either completes fully or not at all, and thus prevent corrupt data.
7 Key Functionalities of Delta Lake
- ACID Transactions
- Schema Enforcement
- Schema Evolution
- Time Travel
- Parquet
- Data Manipulation (DML)
- Metadata Management
As mentioned earlier, ACID transactions capability is a huge advancement for data lakes. Typically, a data lake has numerous data streams with users reading data from and writing data to the lake. Delta Lake stores a transaction log (DeltaLog) which tracks all commits made to a directory since the beginning of time. This prevents “dirty reads” by the users.
Delta Lake allows you to specify your schema and enforce it. This prevents the insertion of bad records during data integration.
Data changes over time and thus data engineers need to be able to adjust a table schema. Delta Lake provides the ability to easily make those changes automatically without unwieldly DDL.
The DeltaLog allows users to return to an older version of the data. This rollback functionality means bad updates can be reverted, data can be audited, and older copies of the data can be accessed for experimentation. Users have the ability to query the data using an “as of” timestamp. See below for an example SQL query:
1 |
SELECT count(*) FROM my_table TIMESTAMP AS OF “2019-01-01” |
Data in the Delta Lake is stored in Apache Parquet format. This provides for valuable compression and encoding schemes that are native to Parquet.
Delta Lake provides the ability to merge, update, and delete data sets. This allows data engineers to comply with the most stringent of data regulations, such as GDPR and CCPA. This can be done using various APIs like Scala, Java, and Python. The syntax can be fairly simple – deltaTable.delete(“delay < 0").
Delta Lake treats metadata no different than data. The power of Spark’s distributed engine makes metadata management efficient. This means petabyte size tables with millions of partitions and files can be handled with ease. The DeltaLog also allows users to review metadata about all of the changes that have been applied to the data.
Delta Lakes Pricing
The Delta Lake technology is open source and has no cost. The pricing would then depend on the data lake technology being used. In Azure, this would fall to the cost of blob versus Azure Data Lake Storage (ADLS). Blob storage is typically cheaper but might not perform as well as ADLS with larger volumes of data. ADLS also integrates Azure Active Directory (AD) while blob storage does not. Pricing for both can be found here and here.
In Closing
Anyone starting a data lake or having an existing data lake should certainly consider Delta Lake. It is easily layered on top of a data lake and provides ACID transactions, metadata handling, time travel, and simple data manipulation. Organizations that require data be updated or removed to comply with privacy laws such as GDPR, HIPAA, FERPA, et al. should seriously consider Delta Lake.
If you’re interested in reading more please see the following links:
Questions?
Thanks for reading. We hope you found this blog post to be useful. Do let us know if you have any questions or topic ideas related to BI, analytics, the cloud, machine learning, SQL Server, Star Wars, or anything else of the like that you’d like us to write about. Simply leave us a comment below, and we’ll see what we can do!
Keep Your Business Intelligence Knowledge Sharp by Subscribing to our Email List
Get fresh Key2 content around Business Intelligence, Data Warehousing, Analytics, and more delivered right to your inbox!