Databricks Introduces Open-Source Unity Catalog, Turning Up the Heat on Snowflake’s Data Workload Interoperability

Databricks Introduces Open-Source Unity Catalog, Turning Up the Heat on Snowflake's Data Workload Interoperability

Sign up for our daily and weekly newsletters to stay updated with exclusive content on industry-leading AI coverage.

Today, Databricks kicked off its annual Data and AI summit with a highly anticipated announcement: the open-sourcing of its Unity Catalog platform. This three-year-old platform, previously a proprietary product, helps customers manage their data governance needs.

Now available under the Apache 2.0 license, Unity Catalog allows other companies to use its architecture and code for free, setting up and customizing their own catalogs. The platform now also includes an OpenAPI specification, server, and clients.

This move offers businesses the flexibility to access their data and AI assets managed within the catalog without being tied to a single vendor. They can now use this information with their preferred tools, including those compatible with Delta Lake and Apache Iceberg query engines.

This announcement comes just days after Snowflake, Databricks’ key competitor, revealed its own open catalog implementation, Polaris Catalog, for enterprises. However, while Databricks has immediately open-sourced Unity Catalog, Snowflake plans to do so over the next 90 days.

Unity Catalog originally launched as a proprietary, closed-source solution for data and AI asset management within Databricks’ platform. It featured centralized data access management, auditing, data discovery, lineage tracking, and secure data sharing. However, its closed nature limited users’ ability to integrate it with other technologies.

To address this, Databricks introduced the Delta Lake Universal Format (UniForm), which went into general availability recently. This feature automatically generates the metadata needed for Apache Iceberg or Hudi and unifies the table formats into a single copy that can be queried from any supporting engine.

By open-sourcing Unity Catalog with open APIs and an Apache 2.0 licensed server, Databricks aims to create a universal interface supporting various open data formats via UniForm, compatible with multiple query engines, tools, and cloud platforms.

Joel Minnick, VP of product marketing at Databricks, explained that open-sourced Unity Catalog allows existing customers to utilize a wide ecosystem of Delta Lake and Apache Iceberg compatible tools. It provides the flexibility to access data and AI assets managed in Unity Catalog using their preferred tools. Existing deployments implement the same open APIs, enabling external clients to read from all tables, volumes, and functions in Unity Catalog with their existing access controls.

Unity Catalog ensures interoperability with all major cloud platforms (Microsoft Azure, AWS, GCP, and Salesforce) and compute engines like Apache Spark, Presto, Trino, DuckDB, Daft, PuppyGraph, and StarRocks. It also supports data and AI platforms such as dbt Labs, Confluent, Eventual, Fivetran, Granica, Immuta, Informatica, LanceDB, LangChain, Tecton, and Unstructured.

Additionally, it supports various open formats and engines, including the Iceberg REST Catalog and Hive Metastore (HMS) interface standards. The platform provides unified governance for both tabular and non-tabular data and AI assets, simplifying management at scale.

Snowflake’s Polaris Catalog similarly focuses on interoperability without vendor lock-in but is limited to data conforming to the Apache Iceberg table format. Unity Catalog OSS, however, handles data in multiple formats, including Iceberg, Delta/Hudi, Parquet, CSV, and JSON. Additionally, Databricks’ solution supports unstructured datasets and AI tools, allowing organizations to manage various files used in generative AI applications—a feature not available in Polaris.

Over 10,000 organizations globally, including NASDAQ, Rivian, and AT&T, currently use Unity Catalog within the Databricks Data Intelligence Platform. It will be interesting to see how its adoption evolves with the shift to open source.

The Databricks Data and AI Summit runs from June 10 to June 13, 2024.