Fabric Notebook: A Deep Dive into Apache Spark and Runtime Architecture

Apache Spark in Microsoft Fabric:

Processing massive datasets shouldn't require a complex setup or a PhD in distributed systems. At the heart of Microsoft Fabric’s data capabilities lies Apache Spark, the open-source engine known for its speed and scale. Say goodbye to complexity.

In This Article, we'll Cover:

Notebooks Fundamentals: What notebooks are, cells, and supported languages
Notebook Features: Lakehouse Explorer, Resources, Data Wrangler, and Copilot
Microsoft Fabric Runtime: Core components, versions, and optimizations
Spark Compute Options: Starter Pools, Custom Pools, and resource management
Working with Data: Reading, transforming, batch processing, and streaming
Data Integration: File formats, external sources, and connectivity

Let’s walk through this slowly and clearly.

What is a Notebook?

A notebook is a web-based, interactive environment for data professionals to write and run code (Python, Spark, SQL, Scala, R) for data engineering, science, and analytics, using Apache Spark for big data processing, exploring data with built-in charts, integrating with Lakehouse/Warehouse, and even running T-SQL for warehouse management, all within a collaborative workspace.

Instead of switching between multiple tools, notebooks let you:

Write code
See results
Add explanations
Create charts
Share with others

Notebooks are divided into cells: code and markdown cells

1. Code Cells

Where you write your code
Can be run individually
Show results below

2. Markdown Cells

Where you write explanations
Use regular text, not code
Help document your work

Native Integration with Lakehouses

As soon as you open a notebook, you can use the Lakehouse Explorer panel to attach a new or existing lakehouse. Once it’s attached, the notebook automatically recognizes every table and file inside that lakehouse. You can browse folders, preview files, and load data directly into your Spark session without typing long file paths or setting up mounts manually.

Instead of worrying about storage paths, permissions, or connectors, you focus on what matters: reading data, transforming it, exploring it, and analyzing it.

Built-in File System: Resources

Fabric notebooks also come with a built-in file system called ‘Resources’, allowing you can upload small files like Python modules, CSVs, JSON samples, or reference images. The Resources Explorer works like a miniature file manager, you can create folders, rename items, or delete them just as you would on your desktop.

The files stored in the file system are tied to the notebook itself, and are separate from OneLake. This is useful for when you want to store files temporarily to perform quick experiments or ad hoc analysis of data / scripts. Or, if you want to just simply store notebook-specific assets.

Data Wrangler

is an interactive, code-generating tool within Fabric notebooks designed to streamline data preparation and exploration for data science workflows. It provides a visual interface for transforming and cleaning data, accelerating tasks like feature engineering and exploratory data analysis.

After dropping the file into the notebook, the data wrangler autogenerates the code needed to query and load the data. This low-code experience, simplifies data loading and lowers the barrier to entry to get started with data exploration. You don't need any coding experience to simply just load your data into your Fabric notebook.

The screenshot below shows our data wrangler_sample_df and the interface is currently focused on a Drop columns operation. This operation is designed to remove unnecessary or irrelevant columns from your dataset. On the left-hand side, under “Target columns,” you can see that I have selected three columns to drop: 'Survived', 'Pclass', and 'Ticket'. This means these columns will be removed from the dataframe in this cleaning step. Below that, under “Cleaning steps,” you can see a list of steps applied to the dataset. Step 1 is loading the data, and step 2 is this drop operation. On the right-hand side, the interface provides a code preview, showing the equivalent Python code that performs the drop: wrangler_sample_df = wrangler_sample_df.drop(columns=['Survived', 'Pclass', 'Ticket']). This code ensures that the same transformation can be reproduced programmatically if needed. There’s also a small pop-up indicating that I am editing a previous step, which reminds us that Data Wrangler allows you to go back and adjust earlier operations without starting over.

Copilot

Accessing Copilot via the chat panel allows you to use natural language prompts to ask for insights ("Show me the top 10 products by sales"), ("Generate code to remove duplicates from this dataframe"), or ("Show me a bar chart of sales by product"). Copilot then responds by providing either plain English explanations or immediately runnable code snippets that can be inserted directly into your notebook.

Beyond code generation, Copilot is an excellent tool for notebook documentation and learning. Users can ask the chat to explain existing notebook cells or automatically add markdown comments. This conversational, low-code approach simplifies the data exploration process, significantly lowering the barrier to entry for users who may have limited coding experience.

Now that we have established an understanding of notebooks, cells, data wranglers and apache spark languages, let’s dive in to apache spark

What is Apache Spark?

Apache Spark is an open-source engine for big data processing. It works in clusters, meaning it breaks your job into smaller pieces and processes them at the same time. This makes it very fast and reliable.

In simple terms, Apache Spark is software that helps computers work together to process huge amounts of data quickly.

Language support: Spark supports various languages including Scala, Spark R, Spark SQL, and PySpark. The latter two are most commonly used for data engineering and analytics tasks.

PySpark (Most Popular)

Easy to learn
Great for data analysis
Perfect for beginners
Most tutorials use this

Spark SQL (For Database People)

Write SQL queries
Works like traditional databases
Familiar if you know SQL

Scala

Spark's native language
Faster performance
Steeper learning curve

SparkR (R): For Statisticians

Uses R programming language
Good for statistical analysis
Popular in research

PySpark (Python) because it's the easiest to learn and most widely used.

What is PySpark?

PySpark is Python code that controls Apache Spark.
It is a Python library that lets you write Spark commands using Python instead of the native Scala language.

Breaking it Down:

Python = A programming language that's easy to learn and read

Spark = Software for processing big data across multiple computers

PySpark = Python + Spark = Writing Python code to control Spark

In Microsoft Fabric:

When you select a PySpark notebook, you write Python code, but Spark still runs the job using its distributed engine.
When you select an Apache Spark (Scala) notebook, you write Scala code directly to Spark.

Both options use the same Spark engine, but the programming language is different.

Working with Data in Notebooks

Reading data using spark, python and sql

map(): Apply function to each element
filter(): Select elements matching condition
flatMap(): Map each element to multiple elements
reduceByKey(): Aggregate values by key
join(): Combine two RDDs by key

Actions (trigger computation):

collect(): Return all elements to driver
count(): Count number of elements
first(): Return first element
saveAsTextFile(): Write to file system
reduce(): Aggregate elements using function

Data Manipulation: Filtering and Grouping

Spark dataframes allow for various data manipulations like filtering, sorting, and grouping.

Partitioning data when saving it can significantly improve performance. This approach organizes data into folders based on specific column values, making future data operations more efficient:

For more, check out my article

Using SQL in Notebooks

Within notebooks, the %%sql magic command can be used to directly run SQL code.

Runtime Architecture

Microsoft Fabric Runtime is an Azure-integrated platform based on Apache Spark that enables execution and management of data engineering and data science experiences, combining components from internal and open-source sources.

Apache Spark provides a powerful open-source distributed computing library that enables large-scale data processing and analytics tasks, offering a versatile platform for data engineering and science experiences.

Delta Lake Integration is an open-source storage layer bringing ACID transactions and data reliability features to Apache Spark, integrated within Fabric Runtime to enhance data processing capabilities and ensure data consistency across concurrent operations.

Native Execution Engine leverages columnar format and vectorized processing to boost query execution performance, with TPC-DS 1TB benchmark results showing 4x speed enhancement over OSS Spark. This engine:

Translates SparkSQL code into optimized C++ code
Uses Meta's Velox and Intel's Apache Gluten
Supports both Parquet and Delta formats
Requires no code changes
Avoids vendor lock-in

Optimization Enhancements

Microsoft has enhanced Apache Spark with Split Block Bloom Filters to reduce false positives when verifying element existence, Parquet Footer Caching to reduce I/O operations by caching metadata, Smart Shuffle Optimizations to improve data distribution across nodes, and Optimized Sorting for Window Functions to accelerate data sorting within partitions.

Runtime Versions

Microsoft Fabric supports multiple runtime versions with version-specific features:

Runtime 1.3 (Current GA)

Apache Spark 3.5
Latest performance optimizations
Enhanced Delta Lake capabilities
Recommended for production workloads

Runtime 1.2

Apache Spark 3.4
Native Execution Engine in preview
Query optimization techniques

Runtime 1.1

Apache Spark 3.3
Foundational Fabric integration

Version Selection Fabric defaults new workspaces to the latest runtime version. Runtime selection can be configured at workspace level through Workspace Settings > Data Engineering/Science > Spark settings.

Delta Lake V-Order Optimization

Fabric Runtime includes native writer capabilities with V-Order optimization for Delta Parquet files. V-Order:

Applies columnar storage optimizations
Improves read performance across all Fabric engines
Defaults to enabled but can be disabled
Enhances query execution for analytical workloads

Spark Compute

Starter Pools

Starter pools provide rapid Apache Spark session initialization, typically within 5 to 10 seconds, with no manual setup required, using always-on clusters ready for requests.

Characteristics:

Pre-warmed Clusters Clusters are already provisioned and running, eliminating cold start delays. Sessions start immediately upon request.

Dynamic Scaling Starter pools use medium nodes that dynamically scale up based on Spark job needs. Scaling occurs automatically without user intervention.

Session Startup Time When using starter pools without library dependencies or custom Spark properties, sessions typically start in 5-10 seconds because clusters are already running.

Limitations

Only support Medium node sizes
Selecting other node sizes results in on-demand session start (2-5 minutes)
Custom compute configurations trigger on-demand provisioning

Billing Model You're charged for capacity consumption when executing notebook or Spark job definition, not for idle cluster time or session personalization time.

Custom Spark Pools

A Spark pool allows you to specify resource requirements for data analysis tasks, including node count, node size, and auto-scaling behavior.

Configuration Options:

Node Sizing Available node sizes depend on Fabric capacity. Two Apache Spark VCores equal one capacity unit, with a 3x burst multiplier applied. For example:

F64 SKU: 64 capacity units
64 × 2 = 128 Spark VCores
128 × 3 (burst) = 384 total Spark VCores

Auto-scaling Custom pools support automatic scaling based on workload demands:

Define minimum and maximum node counts
Cluster scales within defined boundaries
Scaling decisions based on pending tasks and resource utilization

Administrative Requirements

Workspace admin permissions required to create custom pools
Fabric capacity admin must grant sizing permissions
Permissions control resource allocation across organization

Capacity and Resource Management

Capacity Units Fabric uses capacity units to measure computing power. Spark compute consumption is measured in VCore-hours.

Resource Allocation When executing notebooks or Spark jobs:

System allocates nodes from available capacity
Driver node is provisioned first
Executor nodes are provisioned based on configuration
Resources are released upon session completion

Monitoring and Optimization

View resource consumption in notebook Resource tabs
Monitor executor utilization and task distribution
Analyze execution logs for bottlenecks
Adjust pool configuration based on workload patterns

Spark Job Processing

Batch Processing

Batch processing processes big data at rest, allowing you to filter, aggregate, and prepare very large datasets using long-running jobs in parallel.

Use Cases:

ETL workflows processing historical data
Daily/hourly data aggregation pipelines
Large-scale data transformations
Report generation from data warehouses

Streaming Data Processing

Streaming or real-time data is data in motion, with telemetry from IoT devices, weblogs, and clickstreams as examples, which can be processed to provide useful information such as geospatial analysis, remote monitoring, and anomaly detection. Apache Spark supports real-time data stream processing through Spark Streaming.

Structured Streaming treats streaming data as an unbounded table with new rows continuously appended. It uses the same DataFrame/Dataset API as batch processing.

Input Sources: Kafka, Event Hubs, file sources
Output Sinks: Files, Kafka, foreach sinks, memory sinks
Windowing: Time-based aggregations on streaming data
Watermarking: Handling late-arriving data

Example

Machine Learning Workloads

Apache Spark's machine learning library, MLlib, contains several machine learning algorithms and utilities. MLlib is a machine learning library built on Spark that can be used from Spark clusters.

MLlib Components:

Classification: Logistic regression, decision trees, random forests
Regression: Linear regression, generalized linear models
Clustering: K-means, Gaussian mixture models
Collaborative Filtering: Alternating least squares (ALS)
Feature Engineering: Transformers and estimators
Pipeline API: ML workflow construction

ML Pipeline

Data Sources and Integration

Lakehouse Integration At least one lakehouse reference must be added to Spark jobs, serving as the default lakehouse context. Multiple lakehouse references are supported.

Supported File Formats:

Parquet (optimized columnar format)
Delta Lake (ACID transactions)
CSV (comma-separated values)
JSON (JavaScript Object Notation)
Avro (binary format)
ORC (Optimized Row Columnar)

External Data Sources:

Azure Data Lake Storage Gen2
Azure Blob Storage
JDBC databases
Hive tables
Kafka streams

Conclusion

Apache Spark in Microsoft Fabric provides a comprehensive platform for large-scale data processing, combining the power of distributed computing with the convenience of fully managed infrastructure. Understanding Spark's architecture, data abstractions, and optimization techniques enables data engineers and scientists to build efficient, scalable analytics solutions.

The platform's integration of Spark with Delta Lake, automated cluster management, and native performance optimizations delivers enterprise-grade capabilities while reducing operational complexity. As Fabric continues evolving, Spark remains the foundational compute engine powering data engineering and data science workloads at scale.

Additional Resources

Microsoft Learn documentation

Runtime 1.3

Runtime lifecycle

Apache Spark documentation