Corrections in data lakehouse table format comparisons

Vinoth Chandar
bytearray
Published in
4 min readApr 20, 2022

--

A live document to serve as a point of reference for corrections for inaccuracies for different comparative studies of Hudi, Delta Lake, or Iceberg.

System comparisons are hard to write. I have put out a bakeoff or two myself and get how deep/complex/nuanced they can get. So, it's quite understandable that when folks put out comparisons some inaccuracies can creep in by mistake. We are not going to delve into the important aspects, that some of these comparisons miss and change their essence. But, simply make a best effort to keep the stories straight and remove subjectivity, much like old school newspaper corrections, and keep things factual.

Food for thought: When people say “Eventually, one of these table formats will become the industry standard.” — Remember that all three projects have co-existed for 4–5 years now. Think of the Lindy effect — the “phenomenon by which the future life expectancy of some non-perishable things, like a technology or an idea, is proportional to their current age”

Subsurface blog from Dremio

Link: https://www.dremio.com/subsurface/comparison-of-data-lake-table-formats-iceberg-hudi-and-delta-lake/

  1. Blog does not even mention Hudi or Databricks Delta (paid version)’s change capture/incremental query capabilities. “I” in Hudi stands for “incrementals” and its a game-changer for batch data pipelines
  2. Engine read compatibility for Hudi: Misses Redshift Spectrum. For Hudi, it misses good’ol Impala. Hudi 0.11 release has support for BigQuery external tables with built-in syncing support.
  3. Engine read compatibility for Delta: Misses Redshift Spectrum
  4. Engine read compatibility for Iceberg (+Hudi): While Databricks SQL syntax is tied to Delta Lake, Databricks Spark does work on both Iceberg & Hudi.
  5. Engine Write Compatibility for Iceberg (+Hudi): Both Iceberg + Hudi support a Java writer interface. In Hudi’s case, this is used to build the Kafka Connect support. Databricks Spark should be able to write Hudi and Iceberg as well.
  6. File Format Support for Hudi : Hudi has ORC support (Hive/Spark as of 0.10.1) and also supports Avro data format (not .avro files) through the Merge-on-read storage type
  7. Delete Support in Hudi via SQL : “But for Deletes you’ll have to rely on the engine’s (Spark/Flink) API” — right out of Hudi Quickstart for 0.10.1 you can use SQL to delete in Hudi.
  8. Hidden Partitioning/Partition Evolution in Hudi/Delta Lake: The article uses very iceberg-specific terminology. But Databricks Delta (non-OSS version) supports generated columns, that can achieve similar effects as hidden partitioning. Hudi 0.11, has data skipping via a column_stats index, that can do similar support for select built-in spark udfs.
  9. Hidden Partitioning/Partition Evolution in Hudi/Delta Lake: As of 0.11, Hudi added support for full (or backward-incompatible) schema evolution for Spark.
  10. Read/Write section under Tale of Two Sparks: For Snowflake & Iceberg Read & Write? External tables are read-only.

History:

  • [April 19, 2022] Initial corrections.
  • [June 28, 2022] Some corrections from here has been applied to the original blog post.

Data Lake Formats Interests and Adoption Rate

Link: https://garystafford.medium.com/data-lake-table-formats-interest-and-adoption-rate-40817b87be9e

  1. Missed Redshift Spectrum for both Hudi and Delta Lake.
  2. Hudi’s issue statistics discount the ASF JIRA, where all developer issues are tracked. Github is merely used for community support (flip side: shows how much OSS Community support Hudi contributors are doing)

History:

  • [Feb 12, 2022] Initial corrections suggested to author.
  • [Feb 12, 2022] Corrections have been made on the blog.

Oracle’s DeltaLake vs Hudi on OCI Comparison

Link: https://blogs.oracle.com/developers/post/deltalake-vs-hudi-on-oracle-cloud-infrastructure-part-1

https://blogs.oracle.com/developers/post/deltalake-vs-hudi-on-oracle-cloud-infrastructure-part-2

  1. Hudi writes were done with GZip (higher CPU cost for higher compression), while Delta Lake uses Snappy (speed of compression). Test also does not attribute/turn-off extra file size to meta fields written out by Hudi.
  2. Hudi writes use the defaultupsert operation, which incurs additional costs of memory caching, and index lookups, probably explaining why the writes OOM at the higher end. bulk_insert is the documented, recommended way for loading data.
  3. Blog claims the insert operation in Hudi leads to duplicates, while completely ignoring Delta Lake does not provide any such guarantees (as of June 29,2022) to begin with.
  4. The blog claims Hudi has “true streaming”, while Delta Lake does “micro batching”. This is a characteristic of the engine (Spark) and both do the same micro-batching, in fact.
  5. For append-only workloads, even Hudi’s MOR table type writes out parquet files, which explains the performance being the same as COW in the test.
  6. Benchmark size is 100MB-1GB, which brings into question how the results apply broadly.

History:

  • [May 21, 2022] Initial corrections suggested to author, benchmarks have not been re-run since.

A thorough comparison of delta lake, iceberg, and hudi

Link: https://databricks.com/session_na20/a-thorough-comparison-of-delta-lake-iceberg-and-hudi

  1. Hudi fully supports deletes, it’s even in the name “huDi”, examples are in the quickstart. Similar basic misses on pySpark support.
  2. Hudi has Flink support already, at the time of writing. Engine pluggability should be YES
  3. Hudi log format is fully open and documented. so “Yes (data) + No (log)” is untrue.
  4. Hudi and Delta Lake both support filter push downs, just not through separate metadata at the time of writing. for e.g Spark queries would read parquet metadata (if not delta/hudi table metadata like today) to perform this.
  5. Due to author’s association with Iceberg, unsupported featured are marked “Ongoing” for iceberg, while its marked “no” for Hudi/Delta

History:

  • [Dec 2, 2020] Corrections posted on twitter due to a report from Hudi user.
  • [Sep 8 2021] Author responded to the thread, offering to correct. But it’s an already delivered conference talk.

Up Next

--

--