Home
News
Tech Grid
Interviews
Anecdotes
Think Stack
Press Releases
Articles
  • Home
  • /
  • Guest Article
  • /
  • The Tip of the Iceberg: What Happens Now That the Table Format War Is Won?

The Tip of the Iceberg: What Happens Now That the Table Format War Is Won?

  • May 6, 2026
Alex Merced
The Tip of the Iceberg: What Happens Now That the Table Format War Is Won?

Something important happened over the past two years. The Apache Iceberg lakehouse became the default data architecture for artificial intelligence and analytics workloads. This did not happen based on a vendor declaration; the shift occurred due to the unforgiving requirements from modern data teams.

Artificial intelligence workloads need fresh data and models, and agents cannot wait for overnight batch loads as they require open data. Training pipelines, feature stores, and inference engines all need direct access without routing through a single vendor compute engine. Data structured for machines and semi-structured event streams, high-precision timestamps, and schemas that evolve without breaking downstream consumers is the new necessity.

Apache Iceberg delivers all these capabilities on open standards, which may be why it was adopted by Databricks, Snowflake, and Microsoft Fabric. So while thousands of enterprises run their data platforms on Iceberg lakehouses today, nobody warned these teams about the steep operational cost and complexity, leading many to ask, “now what?”

Fielding the (Hidden) Management Tax

Data teams chose Iceberg to gain open access, but what they actually received was another task. Iceberg tables fragment over time, so every insert, update, and delete operation creates new data files. Small files pile up quickly, metadata bloats at an alarming rate, and snapshots accumulate with every transaction. Query performance degrades gradually and is rarely enough to trigger an engineering alert immediately. Results in dashboards load much more slowly as well, meaning a report that used to load in two seconds now takes twelve.

To overcome this, engineers started writing custom compaction jobs, scheduling manual cleanup runs, and tuning complex partitioning strategies to group files more effectively. They build custom monitoring tools to catch massive file counts, debate snapshot expiration schedules, and worry about rewriting manifests during busy business hours.

But we all know that engineers were not brought on for this work. They were hired to build data products, train machine learning models, and answer business questions. Instead, they are left babysitting table maintenance, where Iceberg queries run slower than the team remembers from the legacy data warehouse. This happens because nobody is organizing the physical data layout from the actual queries. In a traditional warehouse, the vendor handled file management. In a raw lakehouse, that burden falls on an organization’s engineers as they essentially traded the data warehouse tax for a new management tax. For most teams, this was a bad deal.

What If the Lakehouse Managed Itself and Iceberg V3 is Added?

The right platform removes the tax entirely. It is built natively on Apache Iceberg, not layered on top or translated behind the scenes. Every component understands Iceberg, enabling autonomous management. Tables organize themselves, and intelligent clustering optimizes layout based on query patterns, rewriting only degraded regions for zero downtime.

Maintenance runs continuously without manual effort, and compaction, snapshot expiration, manifest rewriting, and cleanup happen automatically. The system targets hot partitions and schedules heavy work during low-traffic periods. Engineers do not manage jobs, and performance stays consistently high.

Apache Iceberg V3 represents a major evolution of the table format, unlocking new categories of use cases. Its most important addition is binary deletion vectors, which introduce a new architecture for handling updates and deletes. Instead of relying on position delete files, V3 uses compact bitmaps that enable much faster incremental updates. This is especially valuable for change data capture pipelines, where data changes frequently and needs to stay current with minimal compute overhead. AI agents and machine learning models depend on fresh data, and V3 improves how quickly that data becomes available. These deletion vectors integrate with the autonomous engine of a modern lakehouse, where they accumulate over time and are automatically compacted into rewritten data files during low-traffic periods, preserving both fast writes and efficient reads without manual intervention.

Iceberg V3 also introduces true row-level lineage tracking by assigning a unique identifier to every row along with a sequence number that records when it was last modified. This builds provenance directly into the table format, making it easier for organizations in regulated industries such as financial services, healthcare, and government to meet strict audit requirements without relying on external lineage systems. In addition, the new VARIANT data type allows semi-structured data to be stored as a first-class column with richer primitives than raw JSON strings, including support for dates, decimals, and binary data. This reduces friction for teams working with event streams, logs, and telemetry data, while also simplifying AI pipelines by minimizing the need for heavy transformation before querying.

The Semantic Layer Advantage

Fast queries and automated maintenance solve core engineering problems, but they do not address the challenges business users and AI agents face when working with data. The real issue is interpretation. Raw tables often include obscure column names and confusing schemas that lack business context, which creates friction and slows decision-making. A unified semantic layer addresses this gap by adding a business-friendly abstraction on top of technical structures. It translates complex schemas into clear, human-readable terms and aligns them with how the organization thinks about its data. For example, a table named cust_txn_fct_2026 can be exposed as Daily_Customer_Transactions, making it immediately understandable and usable across teams.

This semantic layer bridges the divide between raw storage and business readiness by establishing a consistent source of truth for metrics. Every dashboard, machine learning model, and AI agent operates from the same definitions, eliminating discrepancies across departments.

This consistency is essential for modern AI workloads, where agents require a shared business vocabulary to operate reliably. Without it, large language models pointed at raw datasets often generate incorrect results. By explicitly defining relationships, adding metadata, and enabling semantic search, the layer ensures both humans and AI systems can discover and understand data using natural language. It also includes built-in cataloging and lineage, removing the need for manual coordination just to interpret datasets.

The table format war may be over, but the real opportunity is just beginning. Organizations that embrace Apache Iceberg as the foundation must now focus on eliminating the operational burden, accelerating performance, and making data truly usable for both humans and AI. An intelligent, autonomous lakehouse paired with a strong semantic layer transforms Iceberg from a storage standard into a complete data platform. This is how teams move beyond managing infrastructure and start delivering real value, enabling faster decisions, more reliable AI, and a system that continuously improves without constant human intervention.

Alex Merced
Alex Merced

Head of Developer Relations, Dremio

Alex Merced is the co-author of “Apache Iceberg: The Definitive Guide” and Head of Developer Relations at Dremio, providers of the leading, unified lakehouse platform for self-service analytics and AI. With experience as a developer and instructor, his professional journey includes roles at GenEd Systems, Crossfield Digital, CampusGuard, and General Assembly. He co-authored “Apache Iceberg: The Definitive Guide,” published by O’Reilly, and has spoken at notable events such as Data Day Texas and Data Council.