Understanding Managed Tables vs. External Tables in Databricks
Understanding Managed Tables vs. External Tables in Databricks
Published: None
Understanding Managed Tables vs. External Tables in Databricks
In Databricks, when dealing with data, it’s crucial to understand how data and metadata are managed, particularly when choosing between managed tables and external tables.
📊 Managed Tables
When you create a managed table in Databricks using a command like:
df.write.saveAsTable("table_name")
- Data & Metadata Location: Both the data and metadata are managed and stored by Databricks.
- Data: Stored in the Data Explorer.
- Metadata: Managed in Databricks’ Hive metastore (typically located in the /warehouse directory).
This setup is ideal if you want Databricks to handle both the data storage and the lifecycle of the table, including data management and cleanup.
🌐 External Tables
When creating an external table:
df.write.option("path", "/mnt/storageaccount/containername").saveAsTable("table_name")
- Data & Metadata Location: Here, the data resides in an external storage location (like ADLS - Azure Data Lake Storage), while the metadata is still managed within Databricks.
- Data: Stored in an external location, such as a mounted Azure Data Lake Storage (ADLS).
- Metadata: Managed within Databricks.
📝 Key Takeaways:
- Managed Tables are convenient when you want Databricks to take care of both data and metadata. When you drop the table both metadata and data itself drop.
- External Tables give you flexibility by keeping data in external storage, which is ideal for scenarios where data location is crucial. Drop an External Delta table, only the table metadata (information about the table) is deleted. The actual data remains intact in the external storage layer (such as an Azure Data Lake Storage or an Amazon S3 bucket) where it was originally stored.
Reference:
https://docs.databricks.com/en/tables/managed.html
https://docs.databricks.com/en/sql/language-manual/sql-ref-external-tables.html
Comments
Post a Comment