Skip to main content

Command Palette

Search for a command to run...

Delta Tables

Published
11 min read
Delta Tables

If you follow me on LinkedIn or Twitter, you’ve probably come across my FabricSeries, where I break down complex Fabric concepts into simple, practical steps. Most of those posts focus on helping new users understand Lakehouses, Warehouses, Notebooks, Pipelines, and Power BI inside Fabric.

For today’s episode, we’ll be tackling Delta tables: what they are, why they matter, and how to use them effectively in Fabric.

Delta table is stored with Parquet files and a _delta_log folder that holds all the log entries in JSON format. This dual structure is what enables key features like ACID compliance (Atomicity, Consistency, Isolation, Durability). Every change (insert, update, delete) to the Parquet data is recorded as a new JSON entry in the \_delta\_log. This makes data management easier and more effective.

In simple terms, alongside the Parquet data files is a crucial folder named \_delta\_log. This folder acts as the transaction manager.

  • Transaction Logs: It holds a series of JSON files that record every modification (insert, update, delete, schema change) ever performed on the table.

  • ACID Compliance: This log is the key enabler of ACID properties, guaranteeing that data integrity is maintained, even when multiple users or processes access the table concurrently.

Let’s dive into each of these parts to understand how they work.

The Parquet file is the storage layer of a Delta table. It holds the table’s data in a column-oriented, open-source format designed for speed and efficiency.

Column-Oriented Storage Traditional file formats like CSV store data row by row. Parquet stores each column separately.

CSV (Row-Oriented):
1,John Smith,Engineering,95000
2,Jane Doe,Sales,75000
3,Bob Lee,Marketing,80000

To get all salaries, you must read:
- All employee IDs
- All names  
- All departments
- All salaries
Result: Read 100% of the file

Parquet (Column-Oriented):
Column 1 (IDs): [1, 2, 3]
Column 2 (Names): [John Smith, Jane Doe, Bob Lee]
Column 3 (Departments): [Engineering, Sales, Marketing]
Column 4 (Salaries): [95000, 75000, 80000]

To get all salaries, you only read:
- Column 4 (Salaries)
Result: Read only 25% of the file

This means queries run 4-10x faster because Spark only reads the columns you actually need.

Parquet files include:

  • They are immutable, meaning the files themselves never change.
    When you update or delete data, new files are written instead of modifying existing ones.

  • They support partitioning, allowing the data to be split into multiple files to improve read performance.

  • They are optimized for analytics, especially for large datasets, because columnar storage reduces how much data has to be scanned.

💡
In simple terms, while Parquet files store the data, the Delta logs store the history.

Inside every Delta table is a folder called _delta_logs. This folder contains a sequence of JSON files, each representing a transaction or change made to the table.

These logs are essential because they:

  • record every write, update, delete, or schema change as a new transaction

  • make ACID transactions possible in a data lake

  • enable features like time travel, allowing you to query the table as it existed at any point in time

  • ensure data consistency and reliability across distributed systems

Still complex? Let’s break it down

_delta_log/
├── 00000000000000000000.json  (Version 0 - Table created)
├── 00000000000000000001.json  (Version 1 - First insert)
├── 00000000000000000002.json  (Version 2 - Second insert)
├── 00000000000000000003.json  (Version 3 - Update operation)
└── 00000000000000000004.json  (Version 4 - Delete operation)

Each JSON file records:

  • Which Parquet files were added

  • Which Parquet files were removed

  • When the operation occurred

  • How many rows were affected

  • Who made the change

💡
Delta logs are the reason a Delta table behaves like a database table even though it's stored in a lake.

In Microsoft Fabric, Delta tables come in two forms. The difference is based on where the data is stored and how it is managed.

1. Internal Delta Tables

These are the tables created inside Fabric.
Fabric manages the storage, keeps track of the files, and automatically maintains the logs. These tables integrate tightly with Lakehouses and Warehouses.

If you create a table using Fabric Notebooks, Lakehouse Explorer, or shortcuts inside a Lakehouse, you're usually creating an internal Delta table.

2. External Delta Tables

External Delta tables reference data that lives outside Fabric, for example, in external ADLS Gen2 storage or another data lake. Even though the data is external, Fabric still reads the Delta logs to understand the table’s structure, schema, and history.

External tables are useful when you want Fabric to work with existing Delta Lake datasets without moving the data.

ACID Transactions

Delta Lake supports atomic, consistent, isolated, and durable transactions, similar to relational databases, ensuring reliable and secure data modifications.

Atomicity

Every operation either completes fully or doesn't happen at all, there's no partial state.

What does this means? Using a business scenario: You're loading yesterday's sales data, 500,000 records from 5 CSV files. During the load, the network fails after processing 3 files.

Without ACID (Traditional Data Lake):

  • 300,000 records loaded (files 1, 2, 3)

  • 200,000 records missing (files 4, 5)

  • Your reports show incomplete data

  • You don't know which records are missing

With Delta Tables:

  • The entire transaction rolls back automatically

  • Zero records are committed

  • Your table remains in its previous consistent state

  • You simply retry the load

  • No cleanup, no data integrity issues

Consistency

Your data always follows the rules you've defined, no exceptions allowed.

Business Scenario: Your sales table requires that every record must have:

  • A positive amount

  • A valid customer ID

  • An order date not in the future

Someone tries to insert a record with amount = -500 or customer_id = NULL.

Without ACID:

  • Invalid record gets written to storage

  • Your analytics include corrupt data

  • Power BI reports show impossible values

  • Machine learning models train on bad data

  • Nobody notices until quarterly audit

With Delta Tables:

  • Schema enforcement blocks the write

  • Transaction fails with clear error message

  • Your table remains clean

  • Data quality is protected automatically

Isolation

Multiple users and processes can work simultaneously without seeing each other's incomplete work.

Business Scenario: At 9:00 AM, your finance team starts running the monthly revenue report, it takes 30 minutes to complete.

At 9:10 AM, your data engineering team starts updating yesterday's sales transactions.

Without ACID:

  • Report sees partial updates

  • Some of yesterday's sales appear in report

  • Some don't (depending on timing)

  • Revenue numbers are wrong

  • CFO presents incorrect data to the board

With Delta Tables:

  • Finance team's report sees a consistent snapshot from 9:00 AM

  • Data team's updates happen in complete isolation

  • Report completes with accurate data

  • After update commits, future reports see the new data

  • No incorrect numbers

Business Impact: Your reports always show consistent data, even during active updates. Multiple teams can work on the same tables simultaneously. No more "please don't run reports during ETL windows."

Durability

Once a transaction completes successfully, those changes are permanent, even if the system crashes immediately.

Business Scenario: Your overnight ETL pipeline loads 10 million customer records. At 3:00 AM, it completes successfully and confirms "Load complete." At 3:05 AM, the entire server cluster crashes.

Without ACID:

  • Some writes might not have reached disk

  • Morning reports show 9.2 million records

  • 800,000 records vanished

  • You spend hours investigating

With Delta Tables:

  • Delta log confirms all 10 million records committed

  • Data survived the crash

  • Morning reports show all 10 million records

  • Zero data loss

Creating Delta Tables in Fabric

For all the operations of create, read, update, and delete, we will use Spark SQL in the Fabric Notebook. Let's explore three methods to create Delta tables.

Method 1: Creating a Table

Note: USING DELTA is optional because Delta format is applied automatically when you create tables in Fabric.

What Happens:

  1. Fabric creates the table definition in the metastore

  2. Creates the Delta table structure in the Tables folder

  3. Initializes the _delta_log folder

  4. Table is ready to receive data

As demonstrated in this screenshot, the Delta format is the default table type in Fabric. Therefore, the explicit USING DELTA keyword is not required.

Method 2: Creating Partitioned Tables

Parquet files can be partitioned, and multiple files will be generated as a result. Partitioning improves query performance by organizing data into separate folders.

Partitioned Table Creation:

In partitioning, we need to ensure that the column selected for partition does not have very high cardinality. It must correctly divide the data.

What This Creates:

Tables/
└── sales/
    ├── _delta_log/
    ├── order_date=2024-12-01/
    │   ├── part-00000.parquet
    │   └── part-00001.parquet
    ├── order_date=2024-12-02/
    │   ├── part-00000.parquet
    │   └── part-00001.parquet
    └── order_date=2024-12-03/
        └── part-00000.parquet

Query Performance:

Partitioning Best Practices:

Good Partitioning: ✅ Partition by date/time for time-series data ✅ Partition by region/country for geo-distributed data ✅ Use columns frequently in WHERE clauses ✅ Target 1 GB - 10 GB per partition

Bad Partitioning: ❌ Don't partition by customer_id (too many unique values) ❌ Don't partition by transaction_id (creates millions of folders) ❌ Don't partition by columns rarely used in queries

INSERT Operations

Now let's insert data into our Delta tables.

What Happens Internally:

  1. Spark creates a new Parquet file with this record

  2. Adds entry to _delta_log recording the new file

  3. Transaction commits atomically

  4. Table now contains 1 record

Inserting Multiple Records

Inserting from Another Table

%%sql
INSERT INTO employees
SELECT * FROM new_hires_staging
WHERE hire_date = CURRENT_DATE()

Use Case: Your HR system exports new hires to a staging table each morning. This INSERT moves them to the permanent employees table.

Example: Partitioned Table Inserts

What Fabric Does:

See code below

Tables/sales/
├── order_date=2024-12-01/
│   └── part-00000.parquet  (contains order 1001, 1002)
└── order_date=2024-12-02/
    └── part-00000.parquet  (contains order 1003)

On execution, multiple files are generated based on the partition key.

Reading Data from Delta Tables

Filtering query

Aggregation

Unified View: On execution of SELECT, we get a unified view even though data is stored across multiple Parquet files and partitions. Delta Lake combines everything automatically.

Leveraging Partition Pruning

Performance: If your sales table has 3 years of data (1,095 partitions), this query only reads 7 partitions—skipping 99.4% of the data.

Creating Visualizations

Fabric notebooks allow you to create charts directly from query results without needing any additional libraries. After running a SELECT query, click the chart icon to visualize your data.

You can also personalize your chart by customizing its appearance and metrics to align with your specific specialty

Updating Records in Delta Tables

Remember: Parquet files are immutable, so to update a record, the entire file is duplicated with the record update and the delta logs register this update.

What Happens Behind the Scenes:

  1. Delta identifies which Parquet files contain employee_id = 1

  2. Reads the entire file (could be 1 GB)

  3. Updates the one record

  4. Writes a new Parquet file with the change

  5. Adds entry to _delta_log marking old file as removed

  6. Adds entry to _delta_log marking new file as added

Performance Note: If you updated 4 records and each was in a different Parquet file, all the Parquet files were duplicated (one for each). This is why frequent updates can impact performance.

UPDATE with Calculations

%%sql
-- Give 10% raise to high performers
UPDATE employees
SET 
    salary = salary * 1.10,
    last_review_date = CURRENT_DATE()
WHERE performance_rating >= 4.5

Deleting Records from Delta Tables

%%sql
-- Delete terminated employees
DELETE FROM employees
WHERE termination_date < '2024-01-01'

-- Delete old records
DELETE FROM sales
WHERE order_date < '2020-01-01'

What Happens:

  1. Delta identifies Parquet files containing records to delete

  2. Reads those files

  3. Filters out the deleted records

  4. Writes new files with remaining records

  5. Marks old files as removed in Delta log

Dropping Tables

%%sql
DROP TABLE employees

This command will delete both the table definition and all data files if it's a managed table. For external tables in Fabric, only the table definition is deleted and the actual data files remain in the external location.

Viewing Transaction History

Transaction history for Delta tables is stored in JSON files in the _delta_log folder, and you can use this transaction log to manage data versioning.

Output Shows:

  • version: Sequential version number (0, 1, 2, ...)

  • timestamp: When operation occurred

  • operation: CREATE, INSERT, UPDATE, DELETE, MERGE

  • operationMetrics: Details like numOutputRows, numFiles

  • userMetadata: Who made the change

Time Travel

Data versioning and time travel allow every transaction to be logged, enabling access to previous versions through time travel queries.

Query by Timestamp:

%%sql
-- See table as it existed on December 1
SELECT * FROM employees 
TIMESTAMP AS OF '2024-12-01 00:00:00'

-- Compare monthly snapshots
SELECT 
    'December 2024' as month,
    AVG(salary) as avg_salary
FROM employees TIMESTAMP AS OF '2024-12-01'
UNION ALL
SELECT 
    'November 2024' as month,
    AVG(salary) as avg_salary
FROM employees TIMESTAMP AS OF '2024-11-01'

Restoring Previous Versions

%%sql
-- Rollback to version 5
RESTORE TABLE employees TO VERSION AS OF 5

-- Rollback to specific date
RESTORE TABLE employees TO TIMESTAMP AS OF '2024-12-01'

Business Use Cases: Accidental Data Deletion

2:00 AM: Someone runs DELETE FROM sales WHERE 1=1 (accidentally deletes everything)
8:00 AM: Data team discovers the mistake

Solution:
1. Check history: DESCRIBE HISTORY sales
2. Find last good version (version before delete)
3. Restore: RESTORE TABLE sales TO VERSION AS OF 45
4. Data recovered in 5 minutes

Delta tables automatically providing enterprise-grade reliability without complex configuration. When we create tables in Fabric, they are defined as Delta tables consisting of Parquet files along with delta log files, making them efficient and ensuring ACID transactions.

Key Takeaways:

  1. Two Components: Parquet files (data) + Delta logs (transaction history)

  2. ACID Transactions: Atomicity, Consistency, Isolation, Durability

  3. Two Types: Internal (managed by Fabric) vs External (data stays external)

  4. Immutable Parquet: Updates create new files, not modify existing ones

  5. Time Travel: Query any historical version using version numbers or timestamps

Enjoyed this article? Share with your dev/data friends. Want to learn more? Check out other articles on Medium

More from this blog

F

FabricSeries

18 posts