Back to Home →

Databricks Data Engineer Associate Certification

Preparation Plan
Knowledge Base

Preparation Plan

Foundations & Databricks Lakehouse Platform

Databricks Fundamentals
Advanced Databricks Features
Assessment Quiz
Hands-On Exercises
Key Takeaways

ELT with Apache Spark

Data Extraction and Basic Transformations
Hands-on Exercise: Building an ETL Pipeline
Advanced Transformations
Hands-on Exercise: Advanced Transformations
Assessment Quiz

Incremental Data Processing

Delta Lake and ACID Transactions
Delta Lake Operations and Optimizations
Incremental Data Loading
Assessment Quiz
Key Takeaways

Production Pipelines & Data Governance

Production Pipelines
Hands-on Exercise: Creating a Production Pipeline
Data Governance
Hands-on Exercise: Implementing Data Governance
Assessment Quiz
Key Takeaways

Knowledge Base

Foundations & Databricks Lakehouse Platform

Databricks Fundamentals

Introduction to Databricks

Databricks is a unified data analytics platform that combines data engineering, data science, and business analytics into a single, integrated environment. Founded by the creators of Apache Spark, Databricks provides an enterprise-grade, cloud-based platform that enables organizations to process and analyze massive datasets efficiently.

Key Components of the Databricks Platform:

The Databricks platform consists of several integrated components:

Workspace - The web-based interface where users interact with Databricks
Notebooks - Interactive documents for writing and executing code
Clusters - Compute resources that execute the code written in notebooks
Jobs - Scheduled or triggered execution of notebooks or other code
Data - Storage and management of datasets in various formats

Lakehouse Architecture Fundamentals

The data lakehouse architecture represents a modern approach to data management that combines the best features of data warehouses and data lakes.

Data Lakehouse vs. Data Warehouse vs. Data Lake:

Feature	Data Lake	Data Warehouse	Data Lakehouse
Data Structure	Unstructured/semi-structured	Highly structured	Supports all data types
Schema	Schema-on-read	Schema-on-write	Schema enforcement with flexibility
Data Quality	Limited	High	High
Performance	Variable	Optimized	Optimized
Cost	Lower storage costs	Higher costs	Balance of cost and performance
Use Cases	Data science, ML	BI reporting	Unified analytics

The Medallion Architecture:

The medallion architecture is a data organization framework used in the lakehouse that organizes data into three tiers:

Bronze (Raw) - Contains raw data ingested from various sources with minimal processing
Silver (Validated) - Data that has been cleansed, conformed, and validated
Gold (Enriched) - Business-level aggregates and enriched data ready for consumption

Benefits of the Lakehouse Approach:

Eliminates data silos and reduces data duplication
Provides a single source of truth for all analytics workloads
Enables both batch and streaming data processing
Supports diverse workloads including BI, ML, and data science
Offers ACID transactions through Delta Lake
Reduces total cost of ownership through unified architecture

Databricks Clusters and Runtimes

Clusters are the computational resources that execute your code in Databricks.

Types of Clusters:

All-Purpose Clusters - Interactive clusters used for development, exploration, and ad-hoc analysis
Jobs Clusters - Created automatically when jobs are triggered and terminated upon completion

Databricks Runtime (DBR):

The Databricks Runtime is a set of software components optimized to run on the Databricks platform
Includes Apache Spark plus performance improvements and integrations with other systems
Available in different versions with varying components and features
Specialized runtime options include:
- Standard Runtime (includes Spark)
- Machine Learning Runtime (includes ML libraries)
- Genomics Runtime (optimized for genomic data)
- Light Runtime (minimal dependencies)

Cluster Termination:

All-purpose clusters can be configured to automatically terminate after a period of inactivity
Jobs clusters terminate automatically upon job completion
Manual termination stops all running operations and releases compute resources
Termination does not delete any data or notebooks

Advanced Databricks Features

Version Control with Databricks Repos

Databricks Repos enables Git-based version control directly within the Databricks workspace.

Key Features:

Direct integration with Git repositories (GitHub, GitLab, Bitbucket, etc.)
Support for standard Git operations:
- Clone repositories
- Create and switch branches
- Commit changes
- Pull latest changes
- Resolve merge conflicts

CI/CD Workflows with Repos:

Enable automated testing of notebooks
Facilitate collaborative development
Support deployment pipeline integration
Maintain versioning history
Implement proper change management processes

Limitations Compared to Traditional Git:

Some complex Git operations might require using the Git CLI or external tools
Large file handling differs from standard Git
Merge conflict resolution has a different interface than traditional Git tools

Multi-Language Notebook Development

Databricks notebooks support multiple programming languages, allowing different cell types within the same notebook.

Supported Languages:

Python (default)
SQL
Scala
R

Language Switching in Notebooks:

Use magic commands to switch between languages:
- %python for Python code
- %sql for SQL queries
- %scala for Scala code
- %r for R code

Notebook Workflows:

Notebooks can call other notebooks using the %run command
Example: %run /path/to/notebook
Variables and functions defined in the called notebook become available in the calling notebook
This enables modular development and code reuse

Hands-On Practice: Creating and Managing Databricks Resources

Exercise 1: Creating a Cluster

Navigate to the “Compute” section in the Databricks workspace
Click “Create Cluster”
Configure the cluster with these settings:
- Cluster name: “DE-Training-Cluster”
- Cluster mode: “Single Node” (for training purposes)
- Databricks Runtime: Latest ML version
- Enable autoscaling: No
- Terminate after: 120 minutes of inactivity
Click “Create Cluster”

Exercise 2: Creating and Using Notebooks

Navigate to “Workspace” in the sidebar
Create a new folder named “DE-Training”
Within the folder, create a new notebook:
- Name: “Day1-Exercises”
- Language: Python
- Cluster: Connect to your “DE-Training-Cluster”
Execute the following commands in separate cells:

# Python cell
print("Hello from Python!")

-- SQL cell (use %sql magic command)
%sql
SELECT "Hello from SQL!" AS greeting

// Scala cell (use %scala magic command)
%scala
println("Hello from Scala!")

Exercise 3: Exploring the Medallion Architecture

Create a new notebook and implement a simple medallion architecture:

Bronze layer: Read sample data
Silver layer: Clean and validate data
Gold layer: Aggregate for analytics

Assessment Quiz

What is the primary advantage of a data lakehouse over a traditional data warehouse?
- A) Lower cost
- B) Support for both structured and unstructured data processing
- C) Proprietary data formats
- D) Coupled storage and compute
In the medallion architecture, which layer contains raw data with minimal processing?
- A) Gold
- B) Silver
- C) Bronze
- D) Platinum
Which type of Databricks cluster automatically terminates after the workload completes?
- A) All-purpose cluster
- B) Jobs cluster
- C) High-concurrency cluster
- D) Standard cluster
How can you switch to writing SQL code in a Python notebook?
- A) Use the %sql magic command
- B) Create a new notebook with SQL language
- C) Import SQL library in Python
- D) Change notebook settings to SQL
What does DBR stand for in the Databricks context?
- A) Databricks Resource
- B) Database Runtime
- C) Databricks Runtime
- D) Distributed Batch Running
Which feature enables Git version control in Databricks?
- A) Workspace
- B) Repos
- C) Clusters
- D) Jobs
How can one notebook call another notebook in Databricks?
- A) Using the import statement
- B) Using the %run command
- C) Using notebook workflows
- D) Using the execute command
Which statement about Databricks cluster termination is correct?
- A) Termination deletes all data stored in the cluster
- B) Termination stops all running operations and releases compute resources
- C) Clusters can only be terminated manually
- D) Terminated clusters cannot be restarted
What is a key improvement in data quality that the data lakehouse provides over a traditional data lake?
- A) Support for ACID transactions
- B) Lower storage costs
- C) Faster processing
- D) More storage capacity
Which of the following is a valid magic command in Databricks notebooks?
- A) #python
- B) @sql
- C) %scala
- D) !r

Assessment Quiz Answers

B) Support for both structured and unstructured data processing
C) Bronze
B) Jobs cluster
A) Use the %sql magic command
C) Databricks Runtime
B) Repos
B) Using the %run command
B) Termination stops all running operations and releases compute resources
A) Support for ACID transactions
C) %scala

Hands-On Exercises

Let’s proceed with detailed hands-on exercises to reinforce your understanding of the Databricks Lakehouse Platform. These exercises will give you practical experience with the concepts we’ve covered.

Exercise 1: Setting Up Your Databricks Environment

Creating and Configuring a Cluster

Navigate to the Compute section in the left sidebar of your Databricks workspace
Click the “Create Cluster” button
Configure your cluster with these settings:
- Cluster name: DE-Certification-Cluster
- Cluster mode: Single Node
- Databricks Runtime Version: Select the latest version (non-ML)
- Node type: Select a small instance type (e.g., Standard_DS3_v2 on Azure)
- Terminate after: 120 minutes of inactivity
- Under Advanced Options > Spark > Configuration, add:
```
spark.sql.shuffle.partitions 8
spark.sql.adaptive.enabled true
```
Click “Create Cluster” and wait for it to start (this may take 3-5 minutes)

Exploring the Databricks Workspace

While your cluster starts, explore the workspace:

Navigate through the left sidebar menus:
- Data: Browse available data sources and tables
- Workflows: View job scheduling capabilities
- Compute: Return to cluster management
- Catalog: Explore the data catalog (if available)

Exercise 2: Working with Notebooks

Creating Your First Notebook

Click Workspace in the left sidebar
Create a new folder called “DE-Certification”
In that folder, click “Create” > “Notebook”
Name your notebook “Day1-Foundations” and select “Python” as the default language
Select your DE-Certification-Cluster

Multi-Language Notebook Operations

In the first cell, enter and run the following Python code:

# Python cell
print("Exploring Databricks Runtime Environment")
   
# Display Spark version
print(f"Spark Version: {spark.version}")
   
# Display Databricks Runtime version
dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()

In a new cell, switch to SQL and run a simple query:

%sql
-- SQL cell
SELECT current_timestamp() AS current_time, 
       current_user() AS username, 
       'Databricks Lakehouse Platform' AS platform

In another cell, try Scala:

%scala
// Scala cell
println("Scala is the native language of Apache Spark")
val sparkVersion = spark.version
println(s"Current Spark version: $sparkVersion")

In another cell, switch back to Python and use a display command:

# Create a simple DataFrame
data = [("Bronze", "Raw data"), 
        ("Silver", "Cleaned data"), 
        ("Gold", "Business-ready data")]
   
medallic_df = spark.createDataFrame(data, ["Layer", "Description"])
display(medallic_df)

Exercise 3: Implementing a Simple Medallion Architecture

Let’s create a new notebook to implement a basic medallion architecture:

Create a new notebook called “Medallion-Architecture-Demo”
Run the following cells in sequence:

Step 1: Generate Sample Data (Bronze Layer)

# Generate sample sales data
from pyspark.sql import functions as F
import random
from datetime import datetime, timedelta

# Create a sample dataset with some inconsistencies and issues
data = []
start_date = datetime(2023, 1, 1)

# Generate 100 sample records
for i in range(100):
    # Create some data quality issues randomly
    if random.random() < 0.1:
        # Some records have null values
        customer_id = None
    else:
        customer_id = f"CUST-{random.randint(1, 20):04d}"
    
    # Some dates are missing
    if random.random() < 0.05:
        date = None
    else:
        date = (start_date + timedelta(days=random.randint(0, 60))).strftime("%Y-%m-%d")
    
    # Some product IDs have different formats
    if random.random() < 0.15:
        product_id = f"prod-{random.randint(1, 50)}"
    else:
        product_id = f"PROD-{random.randint(1, 50):03d}"
    
    # Some quantities are invalid (negative)
    if random.random() < 0.08:
        quantity = -random.randint(1, 10)
    else:
        quantity = random.randint(1, 10)
    
    # Some prices have incorrect decimal places or are zero
    if random.random() < 0.07:
        price = round(random.random() * 100, random.randint(0, 4))
    else:
        price = round(random.random() * 100, 2)
    
    data.append((customer_id, date, product_id, quantity, price))

# Create a DataFrame
bronze_df = spark.createDataFrame(data, ["customer_id", "date", "product_id", "quantity", "price"])

# Save as a Delta table
bronze_df.write.format("delta").mode("overwrite").saveAsTable("bronze_sales")

# Display the bronze data
print("Bronze Layer Data (Raw):")
display(spark.table("bronze_sales"))

Step 2: Clean and Validate Data (Silver Layer)

# Read from the bronze layer
bronze_data = spark.table("bronze_sales")

# Clean and validate the data
silver_data = (bronze_data
    # Filter out records with null customer_id or date
    .filter(F.col("customer_id").isNotNull() & F.col("date").isNotNull())
    
    # Standardize product_id format
    .withColumn("product_id", 
        F.when(F.col("product_id").startswith("prod-"), 
               F.concat(F.lit("PROD-"), F.lpad(F.regexp_extract(F.col("product_id"), "prod-(\d+)", 1), 3, "0")))
        .otherwise(F.col("product_id")))
    
    # Ensure quantity is positive
    .withColumn("quantity", 
        F.when(F.col("quantity") <= 0, None)
        .otherwise(F.col("quantity")))
    
    # Ensure price has exactly 2 decimal places and is positive
    .withColumn("price", 
        F.when(F.col("price") <= 0, None)
        .otherwise(F.round(F.col("price"), 2)))
    
    # Convert date string to date type
    .withColumn("date", F.to_date(F.col("date")))
    
    # Add a month column for aggregation in gold layer
    .withColumn("month", F.date_format(F.col("date"), "yyyy-MM"))
)

# Save as a Delta table
silver_data.write.format("delta").mode("overwrite").saveAsTable("silver_sales")

# Display the silver data
print("Silver Layer Data (Cleaned and Validated):")
display(spark.table("silver_sales"))

Step 3: Create Aggregated Business Views (Gold Layer)

# Read from the silver layer
silver_data = spark.table("silver_sales")

# Create monthly sales summary by product
gold_monthly_product_sales = (silver_data
    .groupBy("month", "product_id")
    .agg(
        F.count("*").alias("transaction_count"),
        F.sum("quantity").alias("total_quantity"),
        F.sum(F.col("quantity") * F.col("price")).alias("total_sales_amount"),
        F.avg("price").alias("average_price")
    )
    .orderBy("month", "product_id")
)

# Save as a Delta table
gold_monthly_product_sales.write.format("delta").mode("overwrite").saveAsTable("gold_monthly_product_sales")

# Create customer spending summary
gold_customer_summary = (silver_data
    .groupBy("customer_id")
    .agg(
        F.countDistinct("date").alias("active_days"),
        F.count("*").alias("transaction_count"),
        F.sum(F.col("quantity") * F.col("price")).alias("total_spend"),
        F.max("date").alias("last_purchase_date")
    )
    .orderBy(F.col("total_spend").desc())
)

# Save as a Delta table
gold_customer_summary.write.format("delta").mode("overwrite").saveAsTable("gold_customer_summary")

# Display the gold layer tables
print("Gold Layer Data - Monthly Product Sales:")
display(spark.table("gold_monthly_product_sales"))

print("Gold Layer Data - Customer Summary:")
display(spark.table("gold_customer_summary"))

Exercise 4: Running Notebooks from Another Notebook

Create a new notebook called “Master-Notebook”
Use the %run command to execute our previous notebooks:

# Run the Medallion Architecture notebook
%run /DE-Certification/Medallion-Architecture-Demo

# Now we can work with the tables created in that notebook
print("Accessing Gold Layer tables from the master notebook:")
display(spark.sql("SELECT * FROM gold_monthly_product_sales LIMIT 5"))

Exercise 5: Working with Databricks Repos (Version Control)

If your Databricks workspace has Repos enabled:

Click on “Repos” in the left sidebar
Click “Add Repo”
For this exercise, you can:
- Clone a sample repository from GitHub (e.g., https://github.com/databricks-industry-solutions/media-data-lakehouse)
- Or create a new repo connected to your GitHub (if you have an account)
Explore the repository structure
Make a small change to a file, commit it, and push (if connected to your personal repo)

Key Takeaways

The Databricks Lakehouse Platform combines the best features of data warehouses and data lakes
The medallion architecture (Bronze, Silver, Gold) provides a structured approach to data processing
Databricks supports multiple programming languages within the same notebook
Clusters can be configured for different workloads (all-purpose vs. jobs)
Delta Lake tables provide ACID transactions, improving data reliability
Databricks Repos enables Git-based version control for your code

ELT with Apache Spark

Data Extraction and Basic Transformations

Data Extraction Techniques

Apache Spark provides versatile capabilities for extracting data from various sources. The Databricks platform enhances these capabilities with optimized readers and connectors.

Extracting Data from Files

Spark supports multiple file formats, each with specific readers:

# Reading CSV files
csv_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/path/to/file.csv")

# Reading Parquet files
parquet_df = spark.read.parquet("/path/to/directory")

# Reading JSON files
json_df = spark.read.json("/path/to/file.json")

# Reading from a directory of files
directory_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/path/to/directory/*.csv")

Extracting Data from JDBC Sources

Connecting to relational databases:

jdbc_df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://server:port/database") \
    .option("dbtable", "schema.table") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

Temporary Tables, Views, and CTEs

Spark SQL offers different ways to reference datasets:

Temporary Views

# Creating a temporary view
df.createOrReplaceTempView("temp_view_name")

# Using the temporary view
result = spark.sql("SELECT * FROM temp_view_name WHERE column > 10")

Temporary views exist only within the current Spark session.

Global Temporary Views

# Creating a global temporary view
df.createOrReplaceGlobalTempView("global_view_name")

# Accessing the global temporary view
result = spark.sql("SELECT * FROM global_temp.global_view_name")

Global temporary views exist across all sessions within the same Spark application.

Common Table Expressions (CTEs)

CTEs provide a way to write auxiliary statements for use in a larger query:

-- Using a CTE
WITH revenue_data AS (
  SELECT product_id, SUM(amount) as total_revenue
  FROM sales
  GROUP BY product_id
)
SELECT p.name, r.total_revenue
FROM products p
JOIN revenue_data r ON p.id = r.product_id
ORDER BY r.total_revenue DESC

Transformation Fundamentals with Spark SQL

Spark SQL provides a comprehensive set of functions for data manipulation:

Basic Transformations

# Filtering data
filtered_df = df.filter(df.age > 25)
# SQL equivalent
spark.sql("SELECT * FROM people WHERE age > 25")

# Selecting columns
selected_df = df.select("name", "age", "department")
# SQL equivalent
spark.sql("SELECT name, age, department FROM people")

# Adding new columns
enhanced_df = df.withColumn("age_group", 
                          when(df.age < 18, "minor")
                         .when(df.age < 65, "adult")
                         .otherwise("senior"))
# SQL equivalent
spark.sql("""
  SELECT *, 
    CASE 
      WHEN age < 18 THEN 'minor'
      WHEN age < 65 THEN 'adult'
      ELSE 'senior'
    END as age_group
  FROM people
""")

Aggregations

# Grouping and aggregating
agg_df = df.groupBy("department").agg(
    avg("salary").alias("avg_salary"),
    count("*").alias("employee_count"),
    sum("salary").alias("total_salary")
)

# SQL equivalent
spark.sql("""
  SELECT 
    department, 
    AVG(salary) as avg_salary,
    COUNT(*) as employee_count,
    SUM(salary) as total_salary
  FROM people
  GROUP BY department
""")

Working with DataFrames and Tables

Converting DataFrames to Delta Tables

# Saving as a managed Delta table
df.write.format("delta").saveAsTable("database_name.table_name")

# Saving as an external Delta table
df.write.format("delta").option("path", "/path/to/data").saveAsTable("database_name.table_name")

# Overwriting data
df.write.format("delta").mode("overwrite").saveAsTable("database_name.table_name")

# Appending data
df.write.format("delta").mode("append").saveAsTable("database_name.table_name")

Reading from Delta Tables

# Reading a Delta table
table_df = spark.read.table("database_name.table_name")

# Alternative syntax
table_df = spark.table("database_name.table_name")

# SQL equivalent
table_df = spark.sql("SELECT * FROM database_name.table_name")

Hands-on Exercise: Building an ETL Pipeline

Let’s create a basic ETL pipeline that extracts data from a CSV file, transforms it, and loads it into a Delta table:

First, we’ll create sample data:

# Create a sample dataset
from pyspark.sql import functions as F

# Generate sample customer data
data = [
    (1, "John Smith", "1980-05-15", "New York", 35000),
    (2, "Mary Johnson", "1992-07-22", "Los Angeles", 42000),
    (3, "James Brown", "1975-11-03", "Chicago", 55000),
    (4, "Patricia Davis", "1988-03-29", "Houston", 67000),
    (5, "Robert Miller", None, "Philadelphia", 48000),
    (6, "Linda Wilson", "1990-09-12", "Phoenix", None),
    (7, "Michael Moore", "1982-04-08", "San Antonio", 51000),
    (8, "Elizabeth Taylor", "1985-12-25", "San Diego", 44000),
    (9, "William Anderson", "1978-02-17", "Dallas", 39000),
    (10, "Jennifer Thomas", "1995-06-10", None, 61000)
]

# Create DataFrame and save as CSV
columns = ["customer_id", "name", "birthdate", "city", "annual_income"]
customer_df = spark.createDataFrame(data, columns)
customer_df.write.mode("overwrite").csv("/tmp/customer_data", header=True)

print("Sample data created successfully")

Now, let’s build an ETL pipeline:

# EXTRACT: Read data from CSV
raw_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/tmp/customer_data")

print("Extracted data:")
display(raw_df)

# TRANSFORM: Clean and enhance the data
transformed_df = raw_df \
    .withColumn("birthdate", F.to_date(F.col("birthdate"))) \
    .withColumn("age", F.floor(F.months_between(F.current_date(), F.col("birthdate")) / 12)) \
    .withColumn("city", F.when(F.col("city").isNull(), "Unknown").otherwise(F.col("city"))) \
    .withColumn("annual_income", F.when(F.col("annual_income").isNull(), 0).otherwise(F.col("annual_income"))) \
    .withColumn("income_bracket", 
               F.when(F.col("annual_income") < 40000, "Low")
                .when(F.col("annual_income") < 60000, "Medium")
                .otherwise("High"))

print("Transformed data:")
display(transformed_df)

# LOAD: Save as a Delta table
transformed_df.write.format("delta").mode("overwrite").saveAsTable("customer_data")

print("Data loaded into Delta table 'customer_data'")

# Verify the data
print("Data in Delta table:")
display(spark.table("customer_data"))

Advanced Transformations

Data Deduplication Techniques

Duplicated data is a common issue in data engineering. Spark provides several methods for handling duplicates:

Identifying Duplicates

# Count duplicates
duplicate_counts = df.groupBy("id").count().filter("count > 1")

# SQL equivalent
spark.sql("""
  SELECT id, COUNT(*) as count
  FROM table_name
  GROUP BY id
  HAVING COUNT(*) > 1
""")

Removing Duplicates

# Remove complete duplicates
deduplicated_df = df.distinct()

# SQL equivalent
spark.sql("SELECT DISTINCT * FROM table_name")

# Remove duplicates based on specific columns
deduplicated_df = df.dropDuplicates(["id", "transaction_date"])

# SQL equivalent
spark.sql("""
  WITH ranked_data AS (
    SELECT *,
      ROW_NUMBER() OVER (PARTITION BY id, transaction_date ORDER BY id) as rn
    FROM table_name
  )
  SELECT * FROM ranked_data WHERE rn = 1
""")

Data Validation Techniques

Validating data quality is crucial in ETL processes:

Primary Key Validation

# Check if id column contains unique values
id_counts = df.groupBy("id").count()
duplicate_ids = id_counts.filter("count > 1")

if duplicate_ids.count() > 0:
    print("Primary key constraint violated")
    display(duplicate_ids)
else:
    print("Primary key constraint satisfied")

Foreign Key Validation

# Check if all product_ids exist in the products table
product_ids_in_orders = orders_df.select("product_id").distinct()
valid_product_ids = products_df.select("id").distinct()

invalid_product_ids = product_ids_in_orders.join(
    valid_product_ids,
    product_ids_in_orders["product_id"] == valid_product_ids["id"],
    "left_anti"
)

if invalid_product_ids.count() > 0:
    print("Foreign key constraint violated")
    display(invalid_product_ids)
else:
    print("Foreign key constraint satisfied")

Working with Complex Data Types

Handling Timestamps

# Converting string to timestamp
df = df.withColumn("event_time", F.to_timestamp(F.col("event_time_string")))

# Extracting components from timestamps
df = df.withColumn("year", F.year("event_time")) \
       .withColumn("month", F.month("event_time")) \
       .withColumn("day", F.dayofmonth("event_time")) \
       .withColumn("hour", F.hour("event_time"))

# Calculating date differences
df = df.withColumn("days_since_purchase", 
                  F.datediff(F.current_date(), F.col("purchase_date")))

Working with JSON and Nested Structures

# Parsing JSON string into struct
df = df.withColumn("json_data", F.from_json(F.col("json_string"), schema))

# Accessing nested fields
df = df.withColumn("customer_name", F.col("json_data.customer.name"))

# Exploding arrays into multiple rows
df = df.withColumn("item", F.explode("items"))

# JSON schema definition example
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType

schema = StructType([
    StructField("customer", StructType([
        StructField("name", StringType()),
        StructField("email", StringType())
    ])),
    StructField("items", ArrayType(StructType([
        StructField("item_id", StringType()),
        StructField("quantity", IntegerType())
    ])))
])

SQL User-Defined Functions (UDFs)

SQL UDFs allow for custom logic within SQL queries:

-- Creating a SQL UDF
CREATE OR REPLACE FUNCTION calculate_tax(amount DOUBLE)
RETURNS DOUBLE
RETURN amount * 0.07;

-- Using the UDF
SELECT item_name, price, calculate_tax(price) as tax_amount
FROM items;

Control Flow in Spark SQL

The CASE WHEN statement provides conditional logic in Spark SQL:

-- Using CASE WHEN for conditional logic
SELECT 
  customer_id,
  total_purchase,
  CASE 
    WHEN total_purchase < 100 THEN 'Low Value'
    WHEN total_purchase < 1000 THEN 'Medium Value'
    ELSE 'High Value'
  END as customer_segment
FROM customer_purchases;

Hands-on Exercise: Advanced Transformations

Let’s work with a more complex scenario involving nested data and deduplication:

# Create sample order data with duplicates and nested structures
from pyspark.sql.types import *
import json

# Sample order data with JSON
order_data = [
    (1, "2023-01-15", '{"customer": {"id": 101, "name": "John Doe"}, "items": [{"id": "A1", "qty": 2}, {"id": "B3", "qty": 1}]}'),
    (2, "2023-01-16", '{"customer": {"id": 102, "name": "Jane Smith"}, "items": [{"id": "C2", "qty": 3}]}'),
    (1, "2023-01-15", '{"customer": {"id": 101, "name": "John Doe"}, "items": [{"id": "A1", "qty": 2}, {"id": "B3", "qty": 1}]}'),  # Duplicate
    (3, "2023-01-17", '{"customer": {"id": 103, "name": "Bob Johnson"}, "items": [{"id": "A1", "qty": 1}, {"id": "D4", "qty": 4}]}'),
    (4, "2023-01-18", '{"customer": {"id": 101, "name": "John Doe"}, "items": [{"id": "E5", "qty": 2}]}')
]

# Define the schema for the JSON data
json_schema = StructType([
    StructField("customer", StructType([
        StructField("id", IntegerType()),
        StructField("name", StringType())
    ])),
    StructField("items", ArrayType(StructType([
        StructField("id", StringType()),
        StructField("qty", IntegerType())
    ])))
])

# Create DataFrame
order_df = spark.createDataFrame(order_data, ["order_id", "order_date", "order_details"])

# Save as table
order_df.write.format("delta").mode("overwrite").saveAsTable("raw_orders")

print("Sample order data created:")
display(spark.table("raw_orders"))

# Now process the data with various transformations
processed_df = spark.table("raw_orders")

# 1. Remove duplicates
deduplicated_df = processed_df.dropDuplicates(["order_id", "order_date", "order_details"])
print(f"Removed {processed_df.count() - deduplicated_df.count()} duplicate orders")

# 2. Parse the JSON data
parsed_df = deduplicated_df.withColumn("order_details_parsed", 
                                    F.from_json(F.col("order_details"), json_schema))

# 3. Extract nested fields
extracted_df = parsed_df \
    .withColumn("customer_id", F.col("order_details_parsed.customer.id")) \
    .withColumn("customer_name", F.col("order_details_parsed.customer.name")) \
    .withColumn("items", F.col("order_details_parsed.items"))

# 4. Convert order_date to date type
date_df = extracted_df \
    .withColumn("order_date", F.to_date(F.col("order_date"))) \
    .withColumn("order_day", F.dayofweek(F.col("order_date"))) \
    .withColumn("order_month", F.month(F.col("order_date")))

print("After parsing JSON and extracting fields:")
display(date_df.select("order_id", "order_date", "customer_id", "customer_name", "items"))

# 5. Explode the items array into separate rows
exploded_df = date_df \
    .withColumn("item", F.explode("items")) \
    .withColumn("item_id", F.col("item.id")) \
    .withColumn("quantity", F.col("item.qty")) \
    .drop("items", "item", "order_details", "order_details_parsed")

print("After exploding items array:")
display(exploded_df)

# 6. Save the processed data
exploded_df.write.format("delta").mode("overwrite").saveAsTable("processed_orders")

print("Data saved to 'processed_orders' table")

Let’s also demonstrate working with pivots and user-defined functions:

# Create sample sales data
sales_data = [
    ("2023-01", "Electronics", 12500),
    ("2023-01", "Clothing", 8300),
    ("2023-01", "Home Goods", 5600),
    ("2023-02", "Electronics", 14200),
    ("2023-02", "Clothing", 9100),
    ("2023-02", "Home Goods", 6200),
    ("2023-03", "Electronics", 13800),
    ("2023-03", "Clothing", 8900),
    ("2023-03", "Home Goods", 7100)
]

sales_df = spark.createDataFrame(sales_data, ["month", "category", "revenue"])
sales_df.write.format("delta").mode("overwrite").saveAsTable("monthly_sales")

print("Monthly sales data:")
display(spark.table("monthly_sales"))

# Create a SQL UDF for calculating tax
spark.sql("""
CREATE OR REPLACE FUNCTION calculate_tax(amount DOUBLE)
RETURNS DOUBLE
RETURN amount * 0.08;
""")

# Use the UDF in a query
spark.sql("""
SELECT month, category, revenue, calculate_tax(revenue) AS tax
FROM monthly_sales
ORDER BY month, category
""").show()

# Create a pivot table
pivoted_sales = spark.sql("""
SELECT *
FROM monthly_sales
PIVOT (
    SUM(revenue) FOR category IN ('Electronics', 'Clothing', 'Home Goods')
)
ORDER BY month
""")

print("Pivoted sales data:")
display(pivoted_sales)

Assessment Quiz

When reading a CSV file in Spark, which option is used to treat the first row as column headers?
- A) firstRowHeader
- B) header
- C) hasHeader
- D) includeHeader
What is the difference between a temporary view and a global temporary view in Spark?
- A) Temporary views persist across Spark sessions, global temporary views don’t
- B) Global temporary views are accessible across all sessions in the same Spark application
- C) Temporary views can only be used in SQL, global temporary views can be used in both SQL and DataFrame API
- D) There is no difference; they are synonyms
Which Spark function can you use to handle duplicate records in a dataset?
- A) removeDuplicates()
- B) dropDuplicates()
- C) deduplicate()
- D) distinctRows()
How can you extract the year from a date column in Spark SQL?
- A) EXTRACT(YEAR FROM date_column)
- B) year(date_column)
- C) date_column.getYear()
- D) date_part('year', date_column)
Which function would you use to convert a JSON string into a structured column in Spark?
- A) parse_json
- B) json_parse
- C) from_json
- D) to_struct
When would you use the explode function in Spark?
- A) To split a string column into multiple parts
- B) To convert array elements into separate rows
- C) To expand a DataFrame by duplicating records
- D) To decompress compressed data

What does the following SQL query do?

WITH ranked_data AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn
  FROM orders
)
SELECT * FROM ranked_data WHERE rn = 1

A) Selects all orders
B) Selects the most recent order for each customer
C) Selects the first order for each customer
D) Selects orders with duplicate customer IDs

How would you create a SQL UDF to calculate a 5% discount on a price?
- A) CREATE FUNCTION discount(price DOUBLE) RETURN price * 0.95
- B) CREATE OR REPLACE FUNCTION discount(price DOUBLE) RETURNS DOUBLE RETURN price * 0.95
- C) CREATE UDF discount(price DOUBLE) AS price * 0.95
- D) DEFINE FUNCTION discount(price DOUBLE) AS price * 0.95
What is the correct way to check if a column contains any null values in Spark?
- A) df.filter(df.column.isNull()).count() > 0
- B) df.select(df.column).isNull().any()
- C) df.where("column IS NULL").count() > 0
- D) Both A and C are correct
Which statement creates a pivot table from a DataFrame in Spark SQL?
- A) SELECT * FROM table ROTATE (sum(value) FOR category IN ('A', 'B', 'C'))
- B) SELECT * FROM table PIVOT (sum(value) FOR category IN ('A', 'B', 'C'))
- C) SELECT * FROM table CROSS TAB (sum(value) FOR category IN ('A', 'B', 'C'))
- D) SELECT * FROM table TRANSFORM (sum(value) FOR category IN ('A', 'B', 'C'))

Answers to Assessment Quiz:

B) header
B) Global temporary views are accessible across all sessions in the same Spark application
B) dropDuplicates()
B) year(date_column)
C) from_json
B) To convert array elements into separate rows
B) Selects the most recent order for each customer
B) CREATE OR REPLACE FUNCTION discount(price DOUBLE) RETURNS DOUBLE RETURN price * 0.95
D) Both A and C are correct
B) SELECT * FROM table PIVOT (sum(value) FOR category IN ('A', 'B', 'C'))

Key Takeaways

Spark provides versatile capabilities for extracting data from various sources, including files and databases.
Transformations in Spark can be performed using both the DataFrame API and Spark SQL.
Temporary views and CTEs help create modular and readable data transformation pipelines.
Data deduplication and validation are essential steps in ensuring data quality.
Spark offers powerful functions for handling complex data types, including timestamps and nested structures.
User-defined functions extend SQL’s capabilities for custom transformations.
The PIVOT operation transforms row data into columnar format for reporting.

Incremental Data Processing

Delta Lake and ACID Transactions

Delta Lake Fundamentals

Delta Lake is an open-source storage layer that brings reliability to data lakes. As a key technology in the Databricks Lakehouse platform, Delta Lake provides critical enterprise features for data management.

What is Delta Lake?

Delta Lake is a storage layer that sits on top of your existing data lake, providing:

ACID transactions for reliable data operations
Schema enforcement and evolution capabilities
Time travel (data versioning)
Audit history of all changes
Unified batch and streaming data processing

Delta Lake Architecture

Delta Lake stores data as Parquet files but adds a transaction log that tracks all changes to the table. This architecture consists of:

Data Files: Parquet-formatted files containing the actual data
Delta Log: A transaction log (stored in the _delta_log directory) that records all operations performed on the table
Checkpoint Files: Periodic snapshots of the table state for faster access

When a Delta table is queried, the Delta Lake engine consults the transaction log to determine which data files to read, ensuring a consistent view of the data.

ACID Transactions in Delta Lake

ACID transactions are a set of properties that guarantee reliability in database operations:

Atomicity: Operations either complete entirely or not at all
Consistency: Transactions bring the database from one valid state to another
Isolation: Concurrent transactions produce the same results as if executed sequentially
Durability: Committed transactions remain saved even during system failures

Delta Lake implements these properties through its transaction log mechanism:

# Example of an atomic operation
# This entire operation either succeeds or fails as a unit
df.write.format("delta").mode("overwrite").save("/path/to/delta-table")

Benefits of ACID Transactions

Data Consistency: Ensures data is always in a valid state even during concurrent operations
Failure Recovery: Maintains data integrity even if operations fail midway
Concurrency Control: Allows multiple users to access and modify data simultaneously
Reliable Streaming: Enables exactly-once processing semantics for streaming data

Data and Metadata Management

Understanding Data vs. Metadata

Data: The actual values stored in your tables (rows and columns)
Metadata: Information about the data, including:
- Schema definitions
- Partitioning information
- Table properties
- File locations
- Statistics for query optimization

In Databricks, metadata is stored in the metastore, which can be:

The Hive metastore (traditional)
Unity Catalog (enterprise grade with enhanced governance)

Managed vs. External Tables

Databricks supports two types of Delta tables:

Managed Tables:
- Both data and metadata are managed by Databricks
- When you drop a managed table, both the data and metadata are deleted
- The data is stored in the default Databricks storage location
External Tables:
- Databricks manages only the metadata
- The data is stored in a location you specify
- When you drop an external table, only the metadata is deleted; the data remains intact

-- Creating a managed table
CREATE TABLE managed_table (id INT, name STRING);

-- Creating an external table
CREATE TABLE external_table (id INT, name STRING)
LOCATION '/path/to/external/storage';

To identify if a table is managed or external:

-- Check if a table is managed or external
DESCRIBE EXTENDED table_name;

Look for the Type property (managed or external) and the Location property.

Hands-On Exercise: Delta Lake Operations

Let’s build a practical exercise to explore Delta Lake fundamentals:

# Create a sample dataset
data = [(1, "Product A", 10.5, "2023-01-01"),
        (2, "Product B", 20.75, "2023-01-02"),
        (3, "Product C", 15.0, "2023-01-03"),
        (4, "Product D", 8.25, "2023-01-04"),
        (5, "Product E", 12.99, "2023-01-05")]

columns = ["id", "product_name", "price", "created_date"]
df = spark.createDataFrame(data, columns)

# Save as a managed Delta table
df.write.format("delta").mode("overwrite").saveAsTable("products_managed")

# Save as an external Delta table
external_path = "/tmp/delta/products_external"
df.write.format("delta").mode("overwrite").option("path", external_path).saveAsTable("products_external")

print("Created both managed and external Delta tables")

# Examine table information
print("\nManaged Table Information:")
display(spark.sql("DESCRIBE EXTENDED products_managed"))

print("\nExternal Table Information:")
display(spark.sql("DESCRIBE EXTENDED products_external"))

# Inspect Delta log structure
print("\nDelta Log Structure:")
display(dbutils.fs.ls(external_path + "/_delta_log"))

# Make modifications to track history
# Add a new product
new_product = [(6, "Product F", 22.5, "2023-01-06")]
new_df = spark.createDataFrame(new_product, columns)
new_df.write.format("delta").mode("append").saveAsTable("products_managed")

# Update a product price
spark.sql("UPDATE products_managed SET price = 11.99 WHERE id = 1")

# View table history
print("\nTable History:")
display(spark.sql("DESCRIBE HISTORY products_managed"))

# Time travel query
print("\nData as of Version 0:")
display(spark.sql("SELECT * FROM products_managed VERSION AS OF 0"))

print("\nCurrent data (latest version):")
display(spark.sql("SELECT * FROM products_managed"))

Delta Lake Operations and Optimizations

Table History and Time Travel

Delta Lake’s transaction log enables powerful versioning capabilities:

-- View the history of a Delta table
DESCRIBE HISTORY delta_table;

-- Query a specific version of a table
SELECT * FROM delta_table VERSION AS OF 3;

-- Query a table as of a specific timestamp
SELECT * FROM delta_table TIMESTAMP AS OF '2023-01-15T00:00:00.000Z';

-- Restore a table to a previous version
RESTORE TABLE delta_table TO VERSION AS OF 3;

Optimizing Delta Tables

Delta Lake provides several operations to optimize table performance:

Z-Ordering

Z-ordering is a technique that co-locates related data in the same files, improving query performance by reducing the amount of data that needs to be read:

-- Z-order by one or more columns
OPTIMIZE delta_table
ZORDER BY (date_column, region_column);

Benefits of Z-Ordering:

Improves filtering and join performance
Particularly useful for high-cardinality columns frequently used in query predicates
Makes data skipping more effective

VACUUM and Data Retention

The VACUUM command permanently removes files no longer needed by the Delta table:

-- Remove files not needed for versions older than 7 days
VACUUM delta_table RETAIN 7 DAYS;

By default, Delta Lake retains 30 days of history. This can be configured through table properties:

-- Set retention period to 90 days
ALTER TABLE delta_table 
SET TBLPROPERTIES ('delta.logRetentionDuration' = '90 days');

OPTIMIZE for File Compaction

Small files can degrade query performance. The OPTIMIZE command compacts small files into larger ones:

-- Compact small files without reordering data
OPTIMIZE delta_table;

Advanced Table Features

Generated Columns

Generated columns are automatically calculated from other columns:

-- Create a table with a generated column
CREATE TABLE sales (
  id INT,
  amount DOUBLE,
  tax DOUBLE GENERATED ALWAYS AS (amount * 0.07)
);

Table Comments and Properties

Add documentation and configure behavior through comments and properties:

-- Add a comment to a table
COMMENT ON TABLE sales IS 'Daily sales transactions';

-- Add a comment to a column
COMMENT ON COLUMN sales.amount IS 'Sale amount in USD';

-- Set table properties
ALTER TABLE sales 
SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true'
);

Incremental Data Loading

Overwrite Operations

CREATE OR REPLACE TABLE

This operation replaces a table with a new definition:

-- Create or replace a table
CREATE OR REPLACE TABLE customer_data
AS SELECT * FROM customer_source WHERE region = 'Europe';

INSERT OVERWRITE

This operation replaces existing data while preserving the table schema and properties:

-- Overwrite all data in the table
INSERT OVERWRITE TABLE customer_data
SELECT * FROM customer_source WHERE region = 'North America';

Differences between these approaches:

CREATE OR REPLACE TABLE recreates the entire table, potentially changing the schema
INSERT OVERWRITE preserves the table structure and only replaces the data

MERGE Operations

The MERGE statement is a powerful tool for upserting data (insert, update, or delete in a single atomic operation):

-- Basic MERGE operation
MERGE INTO target_table
USING source_table
ON target_table.id = source_table.id
WHEN MATCHED THEN
  UPDATE SET 
    target_table.column1 = source_table.column1,
    target_table.column2 = source_table.column2
WHEN NOT MATCHED THEN
  INSERT (id, column1, column2)
  VALUES (source_table.id, source_table.column1, source_table.column2);

Benefits of MERGE:

Atomic updates (all-or-nothing)
Efficient handling of inserts and updates in a single operation
Support for conditional logic
Eliminates the need for staging tables

COPY INTO

The COPY INTO command is a simple, idempotent way to load data incrementally:

-- Load data incrementally from a directory
COPY INTO target_table
FROM '/path/to/source/files'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

Key features:

Loads only new files not previously ingested
Tracks loaded files to avoid duplication
Simpler than MERGE for basic ingestion patterns
Supports various file formats (CSV, JSON, Parquet, etc.)

Delta Live Tables (DLT)

Delta Live Tables provides a declarative framework for building reliable data pipelines:

Components of a DLT Pipeline

Target: The database where tables will be created
Notebook Libraries: Notebooks containing the transformation logic
Configuration: Settings for the pipeline execution

Pipeline Types

Triggered Pipelines: Run on demand or on a schedule
Continuous Pipelines: Process data as it arrives (near real-time)

Auto Loader

Auto Loader simplifies streaming ingestion from file sources:

# Using Auto Loader in Python
spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/path/to/schema")
  .load("/path/to/source/files")

Features:

Efficiently processes new files as they arrive
Handles schema inference and evolution
Scales to millions of files without listing directories

Data Quality Constraints

DLT supports data quality validation through constraints:

# Table with constraints
@dlt.table(
  comment="Validated customer data",
  table_properties={"quality": "silver"}
)
@dlt.expect_or_drop("valid_id", "id IS NOT NULL")
@dlt.expect_or_fail("valid_email", "email LIKE '%@%.%'")
def validated_customers():
  return spark.table("raw_customers").filter("age > 0")

Constraint handling options:

@dlt.expect_or_drop: Drop records that violate the constraint
@dlt.expect_or_fail: Fail the pipeline if any record violates the constraint
@dlt.expect: Track violations without affecting the pipeline

Change Data Capture (CDC)

CDC processes data changes from source systems:

# Apply CDC changes
@dlt.table
def customers_target():
  return (
    dlt.apply_changes(
      target = "customers_target",
      source = "customers_raw_stream",
      keys = ["customer_id"],
      sequence_by = "operation_timestamp",
      ignore_null_updates = False,
      apply_as_deletes = "operation = 'DELETE'"
    )
  )

Benefits:

Efficiently processes change data from databases
Supports insert, update, and delete operations
Maintains a consistent target state

Hands-On Exercise: Incremental Data Processing

Let’s implement a comprehensive incremental processing example:

# Set up sample data
from pyspark.sql import functions as F
from datetime import datetime, timedelta

# Create initial customer data
base_customers = [
    (1, "John Doe", "john@example.com", "New York"),
    (2, "Jane Smith", "jane@example.com", "Los Angeles"),
    (3, "Mike Johnson", "mike@example.com", "Chicago"),
    (4, "Lisa Brown", "lisa@example.com", "Houston"),
    (5, "David Wilson", "david@example.com", "Phoenix")
]

# Save as initial table
customers_df = spark.createDataFrame(base_customers, ["id", "name", "email", "city"])
customers_df.write.format("delta").mode("overwrite").saveAsTable("customers")

print("Initial customer data created:")
display(spark.table("customers"))

# Create updates (some new, some modified records)
customer_updates = [
    (3, "Michael Johnson", "michael@example.com", "Chicago"),  # Modified
    (4, "Lisa Brown", "lisa@example.com", "Dallas"),  # Modified
    (6, "Sarah Lee", "sarah@example.com", "Miami"),  # New
    (7, "Robert Chen", "robert@example.com", "Seattle")  # New
]

updates_df = spark.createDataFrame(customer_updates, ["id", "name", "email", "city"])
updates_df.createOrReplaceTempView("customer_updates")

print("Customer updates:")
display(updates_df)

# Approach 1: Using MERGE
spark.sql("""
MERGE INTO customers target
USING customer_updates source
ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET 
    target.name = source.name,
    target.email = source.email,
    target.city = source.city
WHEN NOT MATCHED THEN
  INSERT (id, name, email, city)
  VALUES (source.id, source.name, source.email, source.city)
""")

print("After MERGE operation:")
display(spark.table("customers"))

# Approach 2: CREATE OR REPLACE TABLE with partitioning
# Create a new dataset with dates for partitioning
customers_with_dates = []
start_date = datetime(2023, 1, 1)

for i in range(1, 8):
    if i <= 5:
        date = (start_date + timedelta(days=i)).strftime("%Y-%m-%d")
        customers_with_dates.append((i, f"Customer {i}", f"customer{i}@example.com", f"City {i}", date))

# Create partitioned table
partitioned_df = spark.createDataFrame(
    customers_with_dates, 
    ["id", "name", "email", "city", "registration_date"]
)

# Save as partitioned table
partitioned_df.write.format("delta") \
    .partitionBy("registration_date") \
    .mode("overwrite") \
    .saveAsTable("customers_partitioned")

print("Partitioned customer table:")
display(spark.table("customers_partitioned"))

# Show partitions
print("Table partitions:")
display(spark.sql("SHOW PARTITIONS customers_partitioned"))

# Create updates for specific partition
partition_updates = [
    (3, "Customer 3 Updated", "customer3new@example.com", "New City 3", "2023-01-03"),
    (8, "Customer 8 New", "customer8@example.com", "City 8", "2023-01-03")
]

partition_updates_df = spark.createDataFrame(
    partition_updates,
    ["id", "name", "email", "city", "registration_date"]
)

# Overwrite specific partition
spark.sql("""
INSERT OVERWRITE TABLE customers_partitioned
PARTITION (registration_date = '2023-01-03')
SELECT id, name, email, city, registration_date
FROM partition_updates_df
""")

print("After partition overwrite:")
display(spark.table("customers_partitioned"))

# Approach 3: COPY INTO for idempotent loads
# Create files to be loaded
incremental_data = [
    (9, "New Customer 9", "customer9@example.com", "Seattle"),
    (10, "New Customer 10", "customer10@example.com", "Portland")
]

incremental_df = spark.createDataFrame(incremental_data, ["id", "name", "email", "city"])
incremental_df.write.format("csv").option("header", "true").mode("overwrite").save("/tmp/incremental_loads/batch1")

# Second batch with some overlap
incremental_data2 = [
    (10, "New Customer 10", "customer10@example.com", "Portland"),  # Duplicate
    (11, "New Customer 11", "customer11@example.com", "Boston")
]

incremental_df2 = spark.createDataFrame(incremental_data2, ["id", "name", "email", "city"])
incremental_df2.write.format("csv").option("header", "true").mode("overwrite").save("/tmp/incremental_loads/batch2")

# Create target table
spark.sql("CREATE TABLE IF NOT EXISTS copy_into_target (id INT, name STRING, email STRING, city STRING)")

# Load data using COPY INTO
spark.sql("""
COPY INTO copy_into_target
FROM '/tmp/incremental_loads'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
PATTERN = '*/batch*'
""")

print("After COPY INTO operation:")
display(spark.table("copy_into_target"))

# Run second time to demonstrate idempotency
spark.sql("""
COPY INTO copy_into_target
FROM '/tmp/incremental_loads'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
PATTERN = '*/batch*'
""")

print("After second COPY INTO operation (should be idempotent):")
display(spark.table("copy_into_target"))

Assessment Quiz

Which of the following is NOT a property of ACID transactions in Delta Lake?
- A) Atomicity
- B) Concurrency
- C) Isolation
- D) Durability
What is the key difference between managed and external Delta tables?
- A) External tables support time travel while managed tables don’t
- B) When a managed table is dropped, both data and metadata are deleted
- C) External tables cannot be updated with MERGE operations
- D) Managed tables don’t support partitioning
Which command would you use to view the history of operations on a Delta table?
- A) SHOW HISTORY delta_table
- B) DESCRIBE HISTORY delta_table
- C) SELECT HISTORY FROM delta_table
- D) DISPLAY OPERATIONS delta_table
What is the purpose of Z-ordering in Delta Lake?
- A) To encrypt sensitive data in Delta tables
- B) To sort data lexicographically for faster access
- C) To co-locate related data in the same files for query optimization
- D) To compress data files for storage efficiency
Which operation permanently removes files that are no longer needed by a Delta table?
- A) CLEAN
- B) PURGE
- C) VACUUM
- D) REMOVE
What is the difference between CREATE OR REPLACE TABLE and INSERT OVERWRITE?
- A) CREATE OR REPLACE TABLE can change the schema, INSERT OVERWRITE preserves it
- B) INSERT OVERWRITE supports condition-based updates, CREATE OR REPLACE TABLE doesn’t
- C) CREATE OR REPLACE TABLE is only for managed tables, INSERT OVERWRITE works with both
- D) There is no difference; they are synonyms
Which statement is true about the MERGE operation in Delta Lake?
- A) It can only handle inserts, not updates
- B) It requires creating a staging table first
- C) It combines insert, update, and delete operations in a single atomic transaction
- D) It can only be used with external tables
What is the advantage of using COPY INTO for data loading?
- A) It automatically partitions data based on content
- B) It loads only new files not previously ingested
- C) It transforms data during the loading process
- D) It’s faster than MERGE for updating existing records
In Delta Live Tables, what does the @dlt.expect_or_drop annotation do?
- A) Drops the entire table if any records violate the constraint
- B) Drops records that violate the constraint
- C) Drops the column if it contains invalid values
- D) Marks records for later review
Which component of Delta Live Tables is used for efficiently processing new files as they arrive?
- A) Change Data Capture
- B) Auto Loader
- C) File Tracker
- D) Stream Processor

Answers to Assessment Quiz:

B) Concurrency (ACID stands for Atomicity, Consistency, Isolation, Durability)
B) When a managed table is dropped, both data and metadata are deleted
B) DESCRIBE HISTORY delta_table
C) To co-locate related data in the same files for query optimization
C) VACUUM
A) CREATE OR REPLACE TABLE can change the schema, INSERT OVERWRITE preserves it
C) It combines insert, update, and delete operations in a single atomic transaction
B) It loads only new files not previously ingested
B) Drops records that violate the constraint
B) Auto Loader

Key Takeaways

Delta Lake enhances data lakes with ACID transactions, ensuring data reliability.
Understanding the difference between managed and external tables is crucial for proper data management.
Time travel capabilities allow for auditing and error recovery.
Optimization operations like Z-ordering and VACUUM improve query performance and manage storage.
Incremental data loading can be accomplished through multiple patterns (MERGE, COPY INTO, etc.).
Delta Live Tables provides a declarative framework for building reliable data pipelines.
Change Data Capture enables efficient processing of data changes from source systems.

Production Pipelines & Data Governance

Production Pipelines

Introduction to Databricks Jobs

Databricks Jobs provide a way to orchestrate and schedule your data processing workflows in a production environment. They allow you to automate notebook execution, specify dependencies between tasks, and set up monitoring and alerting.

Key Concepts of Databricks Jobs

Jobs in Databricks consist of several components:

Jobs: The top-level container for workloads.
Tasks: Individual units of work within a job (e.g., a notebook, JAR file, or Python script).
Runs: Specific executions of jobs, including all task runs.
Task Runs: Individual executions of tasks within a job run.

Benefits of Using Jobs

Automation: Schedule workloads to run on a specific cadence.
Orchestration: Define complex workflows with dependencies between tasks.
Resource Isolation: Jobs run on dedicated job clusters that terminate after completion.
Monitoring: Track job progress, execution time, and failures.
Notifications: Set up alerts for job failures or completion.
Cost Optimization: Jobs clusters automatically terminate after completion.

Multi-Task Workflows

Modern data pipelines often consist of multiple interconnected steps that need to be executed in a specific order. Databricks Jobs support this through multi-task workflows.

Creating Task Dependencies

# Pseudocode representation of a multi-task workflow
job = {
    "name": "Sales Processing Pipeline",
    "tasks": [
        {
            "task_key": "extract_data",
            "notebook_task": {
                "notebook_path": "/path/to/extract_notebook"
            },
            "job_cluster_key": "job_cluster"
        },
        {
            "task_key": "transform_data",
            "notebook_task": {
                "notebook_path": "/path/to/transform_notebook"
            },
            "job_cluster_key": "job_cluster",
            "depends_on": [
                {"task_key": "extract_data"}
            ]
        },
        {
            "task_key": "load_data",
            "notebook_task": {
                "notebook_path": "/path/to/load_notebook"
            },
            "job_cluster_key": "job_cluster",
            "depends_on": [
                {"task_key": "transform_data"}
            ]
        }
    ]
}

The depends_on property specifies that a task should only run after its dependencies have completed successfully.

Task Execution Flow

When a multi-task job runs:

Tasks with no dependencies start immediately.
Tasks with dependencies wait until all dependencies complete successfully.
If a task fails, its dependent tasks will not run.
The job completes when all tasks either succeed or fail.

Job Scheduling with CRON

Databricks jobs can be scheduled to run automatically using CRON expressions.

CRON Expression Format

A CRON expression consists of 5-6 fields that define when a job should run:

┌────────── minute (0 - 59)
│ ┌──────── hour (0 - 23)
│ │ ┌────── day of month (1 - 31)
│ │ │ ┌──── month (1 - 12)
│ │ │ │ ┌── day of week (0 - 6) (Sunday = 0)
│ │ │ │ │
* * * * *

Common CRON patterns:

0 0 * * * - Run at midnight every day
0 */6 * * * - Run every 6 hours
0 8 * * 1-5 - Run at 8 AM on weekdays
0 0 1 * * - Run at midnight on the first day of each month

Setting Up a Schedule

When creating a job, you can specify a schedule using either a CRON expression or a simpler UI-based approach that generates the CRON expression for you.

Job Monitoring and Alerting

Monitoring job execution and setting up alerts for failures are crucial for maintaining reliable data pipelines.

Monitoring Job Execution

Databricks provides several ways to monitor jobs:

Jobs UI: View all jobs, their status, and execution history.
Run Details: Examine logs, task outputs, and execution time for a specific run.
Metrics: Track execution time, success rates, and resource utilization.

Setting Up Alerts

Databricks supports various notification mechanisms:

Email Notifications: Send emails on job failure or completion.
Webhook Notifications: Integrate with external systems like Slack or PagerDuty.
REST API: Programmatically check job status and take action.

Retry Policies

For transient failures, you can set up retry policies:

{
  "task_key": "extract_data",
  "notebook_task": {
    "notebook_path": "/path/to/extract_notebook"
  },
  "retry_on_failure": {
    "max_retries": 3,
    "min_duration_between_retries_seconds": 60
  }
}

This configuration will retry the task up to 3 times with a 60-second delay between retries if it fails.

Hands-on Exercise: Creating a Production Pipeline

Let’s build a complete production pipeline that performs the following steps:

Extracts and processes daily sales data
Computes key metrics
Updates a reporting dashboard
Sends an email notification upon completion

For this exercise, we’ll create three notebooks and link them in a job:

Notebook 1: Extract and Process Sales Data

# Notebook: extract_sales_data
# This notebook extracts daily sales data and performs initial processing

# Log the start of the process
import datetime
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Starting sales data extraction at {current_time}")

# Create sample sales data (in a real scenario, you would extract from a source)
from pyspark.sql import functions as F
import random
from datetime import datetime, timedelta

# Generate sales data for yesterday
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
products = ["Product A", "Product B", "Product C", "Product D", "Product E"]
regions = ["North", "South", "East", "West", "Central"]

# Generate 100 random sales records
data = []
for i in range(100):
    product = random.choice(products)
    region = random.choice(regions)
    quantity = random.randint(1, 10)
    unit_price = round(random.uniform(10, 100), 2)
    total = round(quantity * unit_price, 2)
    
    data.append((yesterday, product, region, quantity, unit_price, total))

# Create DataFrame
columns = ["date", "product", "region", "quantity", "unit_price", "total_amount"]
sales_df = spark.createDataFrame(data, columns)

# Save as a Delta table, partitioned by date
sales_df.write.format("delta").partitionBy("date").mode("append").saveAsTable("daily_sales")

# Verify the data was written
count = spark.table("daily_sales").filter(F.col("date") == yesterday).count()
print(f"Successfully extracted {count} sales records for {yesterday}")

# Set a notebook parameter to pass to downstream tasks
dbutils.jobs.taskValues.set(key = "sales_date", value = yesterday)
dbutils.jobs.taskValues.set(key = "record_count", value = count)

print("Data extraction complete")

Notebook 2: Compute Sales Metrics

# Notebook: compute_sales_metrics
# This notebook computes key sales metrics based on the extracted data

# Get the sales date from the upstream task
sales_date = dbutils.jobs.taskValues.get(taskKey = "extract_sales_data", key = "sales_date")
record_count = dbutils.jobs.taskValues.get(taskKey = "extract_sales_data", key = "record_count")

print(f"Computing sales metrics for {sales_date} with {record_count} records")

# Read the daily sales data
daily_sales = spark.table("daily_sales").filter(f"date = '{sales_date}'")

# Compute product-level metrics
product_metrics = daily_sales.groupBy("product").agg(
    F.sum("quantity").alias("total_quantity"),
    F.sum("total_amount").alias("total_revenue"),
    F.avg("unit_price").alias("average_price")
).orderBy(F.col("total_revenue").desc())

# Save product metrics
product_metrics.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("product_metrics_daily")

# Compute region-level metrics
region_metrics = daily_sales.groupBy("region").agg(
    F.countDistinct("product").alias("product_count"),
    F.sum("quantity").alias("total_quantity"),
    F.sum("total_amount").alias("total_revenue")
).orderBy(F.col("total_revenue").desc())

# Save region metrics
region_metrics.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("region_metrics_daily")

# Compute overall metrics for dashboard
overall_metrics = {
    "date": sales_date,
    "total_sales": daily_sales.select(F.sum("total_amount")).first()[0],
    "total_units": daily_sales.select(F.sum("quantity")).first()[0],
    "avg_order_value": daily_sales.select(F.avg("total_amount")).first()[0],
    "top_product": product_metrics.first()["product"],
    "top_region": region_metrics.first()["region"]
}

# Convert to DataFrame and save
metrics_df = spark.createDataFrame([overall_metrics])
metrics_df.write.format("delta").mode("append").saveAsTable("sales_overall_metrics")

# Pass metrics to the next task
dbutils.jobs.taskValues.set(key = "total_sales", value = overall_metrics["total_sales"])
dbutils.jobs.taskValues.set(key = "top_product", value = overall_metrics["top_product"])
dbutils.jobs.taskValues.set(key = "top_region", value = overall_metrics["top_region"])

print("Metrics computation complete")

Notebook 3: Update Dashboard and Notify

# Notebook: update_dashboard
# This notebook updates a dashboard and sends a notification

# Get values from upstream tasks
sales_date = dbutils.jobs.taskValues.get(taskKey = "extract_sales_data", key = "sales_date")
total_sales = dbutils.jobs.taskValues.get(taskKey = "compute_sales_metrics", key = "total_sales")
top_product = dbutils.jobs.taskValues.get(taskKey = "compute_sales_metrics", key = "top_product")
top_region = dbutils.jobs.taskValues.get(taskKey = "compute_sales_metrics", key = "top_region")

print(f"Updating dashboard for {sales_date}")
print(f"Total sales: ${total_sales:.2f}")
print(f"Top product: {top_product}")
print(f"Top region: {top_region}")

# In a real scenario, you would use the Databricks API to refresh a dashboard
# or update an external BI tool

# Create a summary for notification
summary = f"""
Sales Report for {sales_date}
---------------------------
Total Sales: ${float(total_sales):.2f}
Top Product: {top_product}
Top Region: {top_region}

The complete report is available in the Databricks dashboard.
"""

# In a real scenario, you would send an email notification
# For this exercise, we'll just print the notification
print("Notification would be sent with the following content:")
print(summary)

print("Dashboard update and notification complete")

Creating a Multi-Task Job

To set up this workflow as a job:

Navigate to the Workflows section in Databricks
Click “Create Job”
Set up the first task:
- Task name: extract_sales_data
- Type: Notebook
- Path: /path/to/extract_sales_data
- Cluster: New job cluster (select appropriate configuration)
Add the second task:
- Task name: compute_sales_metrics
- Type: Notebook
- Path: /path/to/compute_sales_metrics
- Cluster: Same as task 1
- Dependencies: extract_sales_data
Add the third task:
- Task name: update_dashboard
- Type: Notebook
- Path: /path/to/update_dashboard
- Cluster: Same as task 1
- Dependencies: compute_sales_metrics
Set up a schedule:
- Frequency: Daily
- Start time: 01:00 AM
Configure notifications:
- On failure: Email to team members
Save the job

This creates a complete ETL pipeline that runs daily, processes sales data, and updates dashboards with the results.

Data Governance

Introduction to Data Governance

Data governance encompasses the policies, procedures, and standards that ensure data is managed as a valuable organizational asset. It focuses on data availability, usability, integrity, and security.

Four Areas of Data Governance

Data Discovery: Finding and understanding available data assets
Access Control: Managing who can access what data
Audit and Compliance: Tracking data usage and ensuring regulatory compliance
Data Quality: Ensuring data accuracy, completeness, and reliability

Unity Catalog Architecture

Unity Catalog is Databricks’ solution for unified data governance across clouds and regions. It provides a centralized way to manage access to data, ML models, and analytics objects.

Metastores vs. Catalogs

Metastore:

The top-level container in Unity Catalog
Typically, one metastore per organization
Contains catalogs, schemas (databases), and securable objects
Linked to one or more Databricks workspaces

Catalog:

A collection of schemas (databases)
Used to organize data by business unit or project
Examples: marketing_catalog, finance_catalog, hr_catalog

Three-Level Namespace

Unity Catalog uses a three-level namespace to address objects:

catalog.schema.object

For example:

-- Querying a table using the three-level namespace
SELECT * FROM marketing.customer_data.transactions;

Where:

marketing is the catalog
customer_data is the schema (database)
transactions is the table

Security Models and Access Control

Unity Catalog implements fine-grained access control to data and analytical assets.

Securables in Unity Catalog

Securables are objects to which permissions can be granted:

Metastore: The top-level container
Catalog: Collection of schemas
Schema: Collection of tables and views
Table/View: Data objects
Function: User-defined functions
Volume: Locations for files
Model: Machine learning models

Access Control Principles

Unity Catalog follows these principles:

Inheritance: Permissions flow from higher-level objects to lower-level ones
Least Privilege: Users get only the permissions they need
Separation of Duties: Different roles handle different responsibilities

Common Permissions

-- Grant permission to read data
GRANT SELECT ON TABLE catalog.schema.table TO user@example.com;

-- Grant permission to modify data
GRANT MODIFY ON TABLE catalog.schema.table TO group sales_analysts;

-- Grant permission to manage a schema
GRANT USAGE, CREATE ON SCHEMA catalog.schema TO group data_engineers;

-- Grant permission to access all tables in a schema
GRANT SELECT ON SCHEMA catalog.schema TO group analysts;

Service Principals

A service principal is an identity used by automated processes, applications, or services, rather than people.

Benefits of Service Principals

No password expiration or rotation issues
Clear separation between human and non-human access
Easier to audit automated processes
Can be restricted to specific operations

Creating and Using Service Principals

Service principals are created in the Databricks account console and can be assigned permissions like human users:

-- Grant a service principal access to a table
GRANT SELECT ON TABLE catalog.schema.table TO `service-principal://etl-process`;

-- Use service principal for a connection
CREATE CONNECTION my_connection
TYPE MYSQL
PROPERTIES (
  host = 'myhost.example.com',
  port = '3306',
  user = 'myuser',
  password = 'supersecret'
)
WITH USER SERVICE PRINCIPAL `service-principal://etl-process`;

Cluster Security Modes

Unity Catalog works with specific cluster security modes:

Unity Catalog-enabled Clusters:
- Support the three-level namespace
- Enforce access controls defined in Unity Catalog
- Required for accessing data governed by Unity Catalog
User Isolation Mode:
- Each user’s code runs as their own user identity
- Users cannot access each other’s data or resources
- Increased security for multi-tenant use cases

Best Practices for Data Governance

Colocate Metastores with Workspaces:
- Assign each workspace to a specific metastore
- Reduces network latency and improves performance
Use Service Principals for Connections:
- Avoid hardcoding credentials in notebooks
- Better security and auditability
Segregate Business Units Across Catalogs:
- Create separate catalogs for different departments
- Simplifies access control and improves organization
Implement Row-Level Security When Needed:
- For highly sensitive data that requires filtering based on user
- Can be implemented using dynamic views

Hands-on Exercise: Implementing Data Governance

Let’s implement a comprehensive data governance structure for a fictional organization:

# This is a demonstration of Unity Catalog concepts
# Note: Some commands require appropriate permissions to execute

# 1. Create a catalog structure for different departments
spark.sql("CREATE CATALOG IF NOT EXISTS marketing")
spark.sql("CREATE CATALOG IF NOT EXISTS finance")
spark.sql("CREATE CATALOG IF NOT EXISTS hr")

# 2. Create schemas within each catalog
spark.sql("CREATE SCHEMA IF NOT EXISTS marketing.customer_data")
spark.sql("CREATE SCHEMA IF NOT EXISTS marketing.campaign_analytics")
spark.sql("CREATE SCHEMA IF NOT EXISTS finance.transactions")
spark.sql("CREATE SCHEMA IF NOT EXISTS finance.reporting")
spark.sql("CREATE SCHEMA IF NOT EXISTS hr.employee_records")

# 3. Create sample tables with different security requirements
# Marketing customer data
spark.sql("""
CREATE TABLE IF NOT EXISTS marketing.customer_data.profiles (
  customer_id INT,
  name STRING,
  email STRING,
  signup_date DATE,
  last_purchase_date DATE,
  total_spend DOUBLE
)
COMMENT 'Customer profile information'
""")

# Finance transaction data (sensitive)
spark.sql("""
CREATE TABLE IF NOT EXISTS finance.transactions.orders (
  order_id INT,
  customer_id INT,
  order_date DATE,
  amount DOUBLE,
  payment_method STRING,
  status STRING
)
COMMENT 'Financial transaction records - Sensitive data'
""")

# HR employee data (highly sensitive)
spark.sql("""
CREATE TABLE IF NOT EXISTS hr.employee_records.employees (
  employee_id INT,
  name STRING,
  hire_date DATE,
  department STRING,
  salary DOUBLE,
  manager_id INT
)
COMMENT 'Employee records - Highly sensitive data'
""")

# 4. Insert some sample data
spark.sql("""
INSERT INTO marketing.customer_data.profiles VALUES
  (1, 'John Doe', 'john@example.com', '2022-01-15', '2023-02-20', 1250.50),
  (2, 'Jane Smith', 'jane@example.com', '2022-03-10', '2023-03-05', 876.25),
  (3, 'Bob Johnson', 'bob@example.com', '2022-05-22', '2023-01-30', 543.75)
""")

spark.sql("""
INSERT INTO finance.transactions.orders VALUES
  (101, 1, '2023-02-20', 250.50, 'CREDIT_CARD', 'COMPLETED'),
  (102, 2, '2023-03-05', 876.25, 'PAYPAL', 'COMPLETED'),
  (103, 3, '2023-01-30', 543.75, 'BANK_TRANSFER', 'COMPLETED'),
  (104, 1, '2023-04-10', 125.00, 'CREDIT_CARD', 'PENDING')
""")

spark.sql("""
INSERT INTO hr.employee_records.employees VALUES
  (1001, 'Sarah Wilson', '2020-05-10', 'Marketing', 85000.00, 1005),
  (1002, 'Mark Davis', '2021-02-15', 'Finance', 92000.00, 1006),
  (1003, 'Lisa Brown', '2019-11-20', 'HR', 78000.00, 1007)
""")

# 5. Create user groups (in real scenario, these would be created in account console)
# For demonstration purposes only
print("In a real deployment, you would create these groups in the account console:")
print("- marketing_team")
print("- finance_team")
print("- hr_team")
print("- data_analysts")
print("- data_engineers")
print("- executives")

# 6. Set up access control permissions
# Marketing team access
print("""
-- Grant marketing team access
GRANT USAGE ON CATALOG marketing TO GROUP marketing_team;
GRANT SELECT ON SCHEMA marketing.customer_data TO GROUP marketing_team;
GRANT SELECT, MODIFY ON SCHEMA marketing.campaign_analytics TO GROUP marketing_team;
""")

# Finance team access
print("""
-- Grant finance team access
GRANT USAGE ON CATALOG finance TO GROUP finance_team;
GRANT SELECT, MODIFY ON SCHEMA finance.transactions TO GROUP finance_team;
GRANT SELECT, MODIFY ON SCHEMA finance.reporting TO GROUP finance_team;
""")

# HR team access
print("""
-- Grant HR team access
GRANT USAGE ON CATALOG hr TO GROUP hr_team;
GRANT SELECT, MODIFY ON SCHEMA hr.employee_records TO GROUP hr_team;
""")

# Data analysts - limited access across departments
print("""
-- Grant data analysts limited access
GRANT USAGE ON CATALOG marketing TO GROUP data_analysts;
GRANT USAGE ON CATALOG finance TO GROUP data_analysts;
GRANT SELECT ON SCHEMA marketing.customer_data TO GROUP data_analysts;
GRANT SELECT ON TABLE finance.reporting.financial_metrics TO GROUP data_analysts;
-- Note: Explicitly deny access to sensitive HR data
DENY SELECT ON CATALOG hr TO GROUP data_analysts;
""")

# Data engineers - broader access
print("""
-- Grant data engineers broader access
GRANT USAGE ON CATALOG marketing, finance TO GROUP data_engineers;
GRANT SELECT, MODIFY ON SCHEMA marketing.customer_data TO GROUP data_engineers;
GRANT SELECT, MODIFY ON SCHEMA finance.reporting TO GROUP data_engineers;
GRANT SELECT ON SCHEMA finance.transactions TO GROUP data_engineers;
-- Limited HR access
GRANT USAGE ON CATALOG hr TO GROUP data_engineers;
GRANT SELECT ON SCHEMA hr.employee_records TO GROUP data_engineers;
""")

# Executives - read-only access to specific dashboards and reports
print("""
-- Grant executives read-only access
GRANT USAGE ON CATALOG marketing, finance, hr TO GROUP executives;
GRANT SELECT ON TABLE marketing.campaign_analytics.campaign_performance TO GROUP executives;
GRANT SELECT ON TABLE finance.reporting.financial_metrics TO GROUP executives;
GRANT SELECT ON TABLE hr.employee_records.headcount_summary TO GROUP executives;
""")

# 7. Create a service principal for automated ETL processes
print("""
-- In a real scenario, create a service principal in the account console
-- service-principal://etl-process

-- Grant appropriate permissions
GRANT USAGE ON CATALOG marketing, finance, hr TO `service-principal://etl-process`;
GRANT SELECT, MODIFY ON SCHEMA marketing.customer_data TO `service-principal://etl-process`;
GRANT SELECT, MODIFY ON SCHEMA finance.transactions TO `service-principal://etl-process`;
GRANT SELECT, MODIFY ON SCHEMA hr.employee_records TO `service-principal://etl-process`;
""")

# 8. Demonstrate querying with the three-level namespace
# Query marketing data
display(spark.sql("SELECT * FROM marketing.customer_data.profiles"))

# Query finance data
display(spark.sql("SELECT * FROM finance.transactions.orders"))

# Query HR data
display(spark.sql("SELECT * FROM hr.employee_records.employees"))

# 9. Create a row-level security policy on sensitive data
# This is a simplified example of how you might implement RLS
print("""
-- Create a view with row-level security for salary data
CREATE OR REPLACE VIEW hr.employee_records.employee_details AS
SELECT
  employee_id,
  name,
  hire_date,
  department,
  -- Only show salary to HR team and the employee's manager
  CASE 
    WHEN is_member('hr_team') OR current_user() IN (
      SELECT DISTINCT manager_email 
      FROM hr.employee_records.manager_directory 
      WHERE employee_id = e.employee_id
    )
    THEN salary
    ELSE NULL
  END as salary,
  manager_id
FROM hr.employee_records.employees e
""")

Assessment Quiz

What is the primary benefit of using Jobs in Databricks for production workflows?
- A) They provide access to more powerful compute resources
- B) They can integrate with version control systems
- C) They enable automation, scheduling, and orchestration of workflows
- D) They bypass Unity Catalog security restrictions
In a multi-task workflow, what happens if a task with dependencies fails?
- A) All dependent tasks will still run
- B) All dependent tasks will not run
- C) The entire job is immediately terminated
- D) The job pauses and waits for manual intervention
Which CRON expression would schedule a job to run at 2:30 PM every weekday?
- A) 30 14 * * 1-5
- B) 30 2 * * 1-5
- C) 14 30 * * 1-5
- D) 30 2 * * 0-4
How can a task pass values to downstream tasks in a Databricks job?
- A) By writing to a shared Delta table
- B) Using the dbutils.jobs.taskValues API
- C) Through environment variables
- D) By updating job parameters
What is the purpose of setting up a retry policy for a task?
- A) To automatically restart the entire job if any task fails
- B) To attempt the task again if it fails due to transient issues
- C) To skip failing tasks and continue with the rest of the job
- D) To resubmit the job with different parameters
What is the correct representation of the three-level namespace in Unity Catalog?
- A) workspace.catalog.table
- B) catalog.schema.table
- C) workspace.database.table
- D) metastore.catalog.table
What is the primary difference between a metastore and a catalog in Unity Catalog?
- A) A metastore contains catalogs, while a catalog contains schemas
- B) A metastore is cloud-specific, while catalogs work across clouds
- C) A metastore is for structured data, while catalogs are for unstructured data
- D) A metastore is read-only, while catalogs can be modified
When should you use a service principal instead of a user account?
- A) For interactive data exploration
- B) For automated processes and scheduled jobs
- C) For administrative tasks in the account console
- D) For query development in Databricks SQL
Which cluster security mode is required to use Unity Catalog?
- A) Single user mode
- B) Standard mode
- C) High concurrency mode
- D) Unity Catalog-enabled mode
What is a best practice for organizing data across catalogs in Unity Catalog?
- A) Put all data in a single catalog for simplicity
- B) Segregate data by business unit across different catalogs
- C) Create a new catalog for each table
- D) Use only the default catalog

Answers to Assessment Quiz:

C) They enable automation, scheduling, and orchestration of workflows
B) All dependent tasks will not run
A) 30 14 * * 1-5
B) Using the dbutils.jobs.taskValues API
B) To attempt the task again if it fails due to transient issues
B) catalog.schema.table
A) A metastore contains catalogs, while a catalog contains schemas
B) For automated processes and scheduled jobs
D) Unity Catalog-enabled mode
B) Segregate data by business unit across different catalogs

Key Takeaways

Databricks Jobs provide a robust framework for automating, scheduling, and orchestrating data workflows in production environments.
Multi-task workflows with dependencies enable complex data pipelines with clearly defined execution sequences.
CRON expressions allow flexible scheduling of jobs based on time, day, month, and other parameters.
Monitoring, alerting, and retry policies are essential components of production pipelines to ensure reliability.
Unity Catalog provides centralized data governance across cloud environments using a three-level namespace (catalog.schema.object).
Service principals should be used for automated processes rather than user accounts, improving security and auditability.
Access control should follow the principle of least privilege, granting users only the permissions they need to perform their roles.
Organizing data by business unit across different catalogs simplifies access control and improves manageability.

Review and Exam Preparation

Comprehensive Topic Review

Databricks Lakehouse Platform

Key Concepts:

The Databricks Lakehouse Platform combines the best features of data warehouses and data lakes, providing a unified platform for all data workloads.

Lakehouse Architecture:

Combines data lake storage flexibility with data warehouse reliability and performance
Supports both structured and unstructured data
Enables batch and streaming analytics in a single platform
Implements ACID transactions through Delta Lake

Medallion Architecture:

Bronze layer: Raw data ingested with minimal processing
Silver layer: Cleansed, conformed, and validated data
Gold layer: Business-level aggregates ready for consumption

Databricks Environment Components:

Workspace: Web-based interface for collaboration
Notebooks: Interactive documents for code development
Clusters: Compute resources that execute code
Jobs: Scheduled or triggered notebook executions
Databricks Runtime (DBR): Software environment with Apache Spark and optimizations

Databricks Repos and Version Control:

Git integration within Databricks
Support for common Git operations (clone, commit, push/pull)
Enables CI/CD workflows and collaborative development

ELT with Apache Spark

Data Extraction:

Reading from various file formats (CSV, JSON, Parquet)
Reading from databases using JDBC
Extracting from directories of files

Data References:

Temporary views: df.createOrReplaceTempView("view_name")
Global temporary views: df.createOrReplaceGlobalTempView("name")
Common Table Expressions (CTEs): WITH clause in SQL

Data Transformation Techniques:

Filtering: df.filter() or WHERE clause in SQL
Selection: df.select() or SELECT in SQL
Column creation: df.withColumn() or expressions in SQL
Aggregations: df.groupBy().agg() or GROUP BY in SQL

Data Deduplication:

Complete duplicates: df.distinct() or SELECT DISTINCT in SQL
Based on specific columns: df.dropDuplicates(["col1", "col2"])
Using window functions: ROW_NUMBER() with PARTITION BY

Complex Data Types:

Working with timestamps and dates: to_timestamp(), date_format()
JSON parsing: from_json(), dot notation for accessing nested fields
Array operations: explode(), array_contains(), flatten()

SQL User-Defined Functions:

Creating UDFs: CREATE OR REPLACE FUNCTION function_name(param_type) RETURNS return_type RETURN expression
Using UDFs in queries: SELECT function_name(column) FROM table

Control Flow:

Using CASE WHEN statements for conditional logic
Using custom control flow with UDFs

Incremental Data Processing

Delta Lake Fundamentals:

ACID transaction support for data reliability
Schema enforcement and evolution
Time travel capabilities for data versioning
Audit history of all operations

Table Management:

Managed tables: Databricks manages both data and metadata
External tables: Databricks manages only metadata, data is stored in a specified location

Delta Operations:

Table history: DESCRIBE HISTORY table_name
Time travel: SELECT * FROM table_name VERSION AS OF version_number
Optimization: OPTIMIZE table ZORDER BY (column_name)
File management: VACUUM table_name RETAIN num_hours HOURS

Incremental Data Loading Patterns:

Complete overwrite: CREATE OR REPLACE TABLE or INSERT OVERWRITE
Merge operations: MERGE INTO target USING source ON condition
Idempotent loads: COPY INTO target FROM source

Delta Live Tables (DLT):

Declarative data pipelines: Define what rather than how
Pipeline types: Triggered vs. continuous
Auto Loader: Efficient incremental file processing
Data quality enforcement: Constraints and expectations
Change data capture: APPLY CHANGES INTO

Production Pipelines & Data Governance

Databricks Jobs:

Multi-task workflows with dependencies
Scheduling with CRON expressions
Monitoring and alerting mechanisms
Retry policies for failure handling

Data Governance with Unity Catalog:

Three-level namespace: catalog.schema.object
Metastores vs. catalogs hierarchy
Access control and permissions
Service principals for automated processes
Cluster security modes

Best Practices:

Organizing data by business unit
Using service principals for connections
Implementing appropriate security controls
Following the principle of least privilege

Common Exam Pitfalls and Misconceptions

Confusion between Managed and External Tables:
- Remember that dropping a managed table deletes both data and metadata
- Dropping an external table deletes only the metadata, not the data
Delta Lake Operations:
- VACUUM permanently removes files, not just marks them for deletion
- OPTIMIZE compacts small files but doesn’t affect data organization without ZORDER BY
- Time travel queries are read-only; use RESTORE to revert a table
Incremental Loading Patterns:
- MERGE is for upserts (update + insert)
- COPY INTO is for idempotent new file ingestion
- INSERT OVERWRITE preserves schema but replaces all data
Unity Catalog Concepts:
- The correct namespace order is catalog.schema.object
- Permissions inherit from parent objects to child objects
- Service principals are for automated processes, not interactive usage
Production Pipeline Design:
- Tasks with failed dependencies don’t run
- CRON expressions use the format: minute hour day-of-month month day-of-week
- Parameters can be passed between tasks using the dbutils.jobs.taskValues API

First Full-Length Mock Exam (90 minutes)

Now let’s start our first full-length practice exam to assess your knowledge. This exam contains 45 questions covering all sections of the certification. You have 90 minutes to complete it.

Section 1: Databricks Lakehouse Platform (9 questions)

What is a key difference between data lakes and data lakehouses?
- A) Data lakes store unstructured data, while data lakehouses only store structured data
- B) Data lakehouses provide ACID transactions, while data lakes typically don’t
- C) Data lakes use columnar formats, while data lakehouses use row-based formats
- D) Data lakehouses require schema-on-write, while data lakes use schema-on-read
In the medallion architecture, which layer contains cleansed and validated data?
- A) Bronze
- B) Silver
- C) Gold
- D) Platinum
What is the primary difference between all-purpose clusters and jobs clusters?
- A) All-purpose clusters support multiple languages, while jobs clusters only support one language
- B) Jobs clusters terminate automatically after job completion, while all-purpose clusters persist until manually terminated
- C) All-purpose clusters run on a fixed schedule, while jobs clusters run on-demand
- D) Jobs clusters support Delta Lake, while all-purpose clusters don’t
Which statement about Databricks Runtime (DBR) is true?
- A) DBR only contains Apache Spark components
- B) DBR is a collection of software components optimized for the Databricks platform
- C) DBR must be manually installed on each cluster
- D) DBR only supports the Standard edition, not the ML edition
How can notebooks be shared with other users in Databricks?
- A) By exporting them as JAR files
- B) By attaching them to emails
- C) By setting permissions in the notebook’s sharing settings
- D) By converting them to Delta tables
Which statement about cluster termination is correct?
- A) Terminating a cluster deletes all data stored in the cluster
- B) Terminating a cluster deletes notebooks attached to it
- C) Terminating a cluster stops all running operations and releases compute resources
- D) Terminated clusters cannot be restarted
Which feature allows you to run one notebook from within another notebook?
- A) %include command
- B) %run command
- C) %import command
- D) %execute command
What is a primary benefit of using Databricks Repos over the built-in notebook version control?
- A) It provides unlimited storage for notebook revisions
- B) It enables proper Git operations such as branching and merging
- C) It automatically backs up notebooks hourly
- D) It allows notebooks to be shared with external users
Which of the following languages is NOT supported natively in Databricks notebooks?
- A) Python
- B) SQL
- C) C++
- D) Scala

Section 2: ELT with Apache Spark (12 questions)

Which option is required when reading a CSV file with a header row in Spark?
- A) option("header", "true")
- B) option("hasHeaders", "true")
- C) option("firstRowHeader", "true")
- D) option("includeHeader", "true")
What is the scope of a temporary view created with createOrReplaceTempView()?
- A) Available only within the current notebook
- B) Available across all notebooks in the current workspace
- C) Available only within the current Spark session
- D) Available until the cluster is terminated
Which SQL construct allows you to define auxiliary statements for use within a larger query?
- A) Temporary View
- B) Subquery
- C) Common Table Expression (CTE)
- D) Stored Procedure
How would you extract the year from a timestamp column in Spark SQL?
- A) EXTRACT(YEAR FROM timestamp_column)
- B) year(timestamp_column)
- C) get_year(timestamp_column)
- D) timestamp_column.getYear()
Which function would you use to remove duplicate records based on specific columns?
- A) distinct(["column1", "column2"])
- B) dropDuplicates(["column1", "column2"])
- C) removeDuplicates(["column1", "column2"])
- D) uniqueRows(["column1", "column2"])
What does the following SQL code check for?
```
SELECT column_name, COUNT(*) as count
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1
```
- A) Records with null values in column_name
- B) Records with duplicate values in column_name
- C) The total count of distinct values in column_name
- D) Records with empty strings in column_name
Which function would you use to split an array column into multiple rows?
- A) split()
- B) explode()
- C) flatten()
- D) array_to_rows()
How would you parse a JSON string into a structured column in Spark?
- A) parse_json()
- B) from_json()
- C) cast_json()
- D) json_to_struct()
Which keyword would you use to convert data from a tall format to a wide format in Spark SQL?
- A) TRANSPOSE
- B) ROTATE
- C) PIVOT
- D) CROSS TAB
What is the correct syntax to create a SQL user-defined function?
- A) CREATE FUNCTION name(param type) RETURNS type AS expression
- B) CREATE OR REPLACE FUNCTION name(param type) RETURNS type RETURN expression
- C) DEFINE FUNCTION name(param type) RETURNS type AS expression
- D) CREATE UDF name(param type) RETURN expression
Which expression would you use in a SQL CASE statement to handle null values?
- A) CASE WHEN column IS NULL THEN 'Unknown' ELSE column END
- B) CASE WHEN column = NULL THEN 'Unknown' ELSE column END
- C) CASE NULL THEN 'Unknown' ELSE column END
- D) CASE column WHEN NULL THEN 'Unknown' ELSE column END
How would you validate that a product_id column contains only unique values?
- A) SELECT COUNT(DISTINCT product_id) = COUNT(*) FROM table
- B) SELECT COUNT(*) FROM (SELECT product_id, COUNT(*) FROM table GROUP BY product_id HAVING COUNT(*) > 1)
- C) SELECT product_id FROM table GROUP BY product_id HAVING COUNT(*) = 1
- D) SELECT DISTINCT product_id FROM table WHERE product_id IS NOT NULL

Section 3: Incremental Data Processing (12 questions)

Which of the following is NOT one of the ACID properties provided by Delta Lake?
- A) Atomicity
- B) Concurrency
- C) Isolation
- D) Durability
What is the main difference between a managed table and an external table?
- A) External tables support ACID transactions, while managed tables don’t
- B) Managed tables can be queried using SQL, while external tables require the DataFrame API
- C) When a managed table is dropped, both data and metadata are deleted
- D) External tables are stored in Delta format, while managed tables use Parquet
Which command would you use to see the history of operations performed on a Delta table?
- A) SHOW HISTORY table_name
- B) DESCRIBE HISTORY table_name
- C) SELECT HISTORY FROM table_name
- D) DISPLAY OPERATIONS table_name
What is the primary purpose of Z-ordering in Delta Lake?
- A) To compress data for efficient storage
- B) To encrypt sensitive data fields
- C) To co-locate related data in the same files
- D) To partition data across storage devices
What does the VACUUM command do in Delta Lake?
- A) Optimizes the layout of Delta files
- B) Permanently removes files no longer needed by the table
- C) Validates the integrity of the Delta log
- D) Cleans up temporary files created during queries
Which operation would you use to add a generated column that automatically calculates tax as 7% of the price?
- A) ALTER TABLE table_name ADD COLUMN tax DOUBLE DEFAULT price * 0.07
- B) ALTER TABLE table_name ADD COLUMN tax DOUBLE GENERATED ALWAYS AS (price * 0.07)
- C) CREATE FUNCTION calculate_tax(price DOUBLE) RETURNS DOUBLE RETURN price * 0.07
- D) UPDATE table_name SET tax = price * 0.07
What is the key difference between CREATE OR REPLACE TABLE and INSERT OVERWRITE?
- A) CREATE OR REPLACE TABLE can change the schema, while INSERT OVERWRITE preserves it
- B) INSERT OVERWRITE supports partitioned tables, while CREATE OR REPLACE TABLE doesn’t
- C) CREATE OR REPLACE TABLE is atomic, while INSERT OVERWRITE isn’t
- D) INSERT OVERWRITE can target specific partitions, while CREATE OR REPLACE TABLE affects the entire table
When would you use the MERGE command in Delta Lake?
- A) Only for inserting new records
- B) Only for updating existing records
- C) For upserting data (update + insert in a single operation)
- D) For deleting records only
What makes the COPY INTO command idempotent?
- A) It automatically detects and skips previously loaded files
- B) It always replaces all data in the target table
- C) It uses a transaction ID to track changes
- D) It requires manual tracking of loaded files
Which component of Delta Live Tables (DLT) is designed for efficiently processing new files as they arrive?
- A) Change Data Capture
- B) Auto Loader
- C) Structured Streaming
- D) File Tracker
What does the @dlt.expect_or_drop constraint do in Delta Live Tables?
- A) Drops the entire pipeline if the constraint is violated
- B) Drops records that violate the constraint
- C) Drops the column with invalid values
- D) Drops the constraint if too many records violate it
What is the purpose of the APPLY CHANGES INTO operation in Delta Live Tables?
- A) To apply schema changes to an existing table
- B) To process CDC data and apply changes to a target table
- C) To enforce data quality rules across tables
- D) To merge two Delta tables together

Section 4: Production Pipelines (6 questions)

What is the primary advantage of using multi-task workflows in Databricks Jobs?
- A) They enable access to more powerful compute resources
- B) They allow tasks to run in parallel for faster execution
- C) They organize complex workflows with dependencies between tasks
- D) They provide better security isolation between tasks
What happens if a task with dependencies fails in a multi-task job?
- A) The dependent tasks will still run
- B) The dependent tasks will not run
- C) The entire job is terminated immediately
- D) The job pauses until the failed task is manually restarted
Which CRON expression would schedule a job to run at 3:30 AM every Monday?
- A) 30 3 * * 1
- B) 30 3 * * 0
- C) 3 30 * * 1
- D) 3 30 * * MON
How can one task pass parameters to downstream tasks in a multi-task workflow?
- A) By writing to a shared Delta table
- B) Using the dbutils.jobs.taskValues API
- C) Through environment variables
- D) By modifying the job definition
What configuration would you use to automatically retry a task if it fails?
- A) retry_on_failure: { max_retries: 3, min_duration_between_retries_seconds: 60 }
- B) auto_retry: { attempts: 3, delay_seconds: 60 }
- C) failure_handling: { retry_attempts: 3, retry_delay: 60 }
- D) error_recovery: { retries: 3, delay: 60 }
How would you configure email notifications for a job?
- A) By setting up an alert in the Jobs UI
- B) By adding email addresses to the job’s notification settings
- C) By creating a separate notification task in the workflow
- D) By writing custom notification code in a notebook task

Section 5: Data Governance (6 questions)

What is the correct representation of the three-level namespace in Unity Catalog?
- A) workspace.catalog.table
- B) catalog.schema.table
- C) metastore.database.table
- D) organization.catalog.table
What is a metastore in Unity Catalog?
- A) A database containing metadata about tables and views
- B) The top-level container that holds catalogs
- C) A collection of schemas within a catalog
- D) A service that synchronizes metadata between workspaces
When should you use a service principal instead of a user account?
- A) For data exploration and analysis
- B) For automated processes and scheduled jobs
- C) For accessing the Databricks UI
- D) For making one-time changes to tables
Which cluster security mode is required to use Unity Catalog?
- A) User isolation mode
- B) Single user mode
- C) Unity Catalog-enabled mode
- D) Token-based security mode
What SQL command would you use to grant a user permission to query a table?
- A) GRANT READ ON TABLE catalog.schema.table TO user@example.com
- B) GRANT SELECT ON TABLE catalog.schema.table TO user@example.com
- C) GRANT ACCESS ON TABLE catalog.schema.table TO user@example.com
- D) GRANT QUERY ON TABLE catalog.schema.table TO user@example.com
What is a best practice for organizing data in Unity Catalog?
- A) Storing all data in a single catalog for simplicity
- B) Creating a new catalog for each table
- C) Segregating data by business unit across different catalogs
- D) Duplicating tables across multiple catalogs for redundancy

Review of First Mock Exam and Explanations

Now let’s review the answers to the first mock exam and provide explanations for each question:

Section 1: Databricks Lakehouse Platform

B) Data lakehouses provide ACID transactions, while data lakes typically don’t
- Data lakehouses bring ACID transaction support to data lakes, which traditionally lack this capability. This is a fundamental difference between the two architectures.
B) Silver
- In the medallion architecture, the Silver layer contains data that has been cleansed, conformed, and validated. Bronze contains raw data, and Gold contains business-level aggregates.
B) Jobs clusters terminate automatically after job completion, while all-purpose clusters persist until manually terminated
- Jobs clusters are ephemeral and automatically terminate after completing their assigned job, making them cost-effective for scheduled workloads. All-purpose clusters remain running until manually terminated or until they reach their auto-termination period.
B) DBR is a collection of software components optimized for the Databricks platform
- Databricks Runtime includes Apache Spark plus additional optimizations, libraries, and integrations specific to the Databricks platform.
C) By setting permissions in the notebook’s sharing settings
- Notebooks can be shared with specific users or groups by configuring permissions in the notebook’s sharing settings.
C) Terminating a cluster stops all running operations and releases compute resources
- When you terminate a cluster, it stops all running operations and releases the compute resources. It does not delete any data or notebooks.
B) %run command
- The %run command allows you to execute one notebook from within another notebook, making the variables and functions from the called notebook available in the calling notebook.
B) It enables proper Git operations such as branching and merging
- Databricks Repos provides full Git functionality, including branching, merging, and collaborative development workflows, which aren’t available in the built-in notebook version control.
C) C++
- Databricks notebooks natively support Python, SQL, Scala, and R, but not C++.

Section 2: ELT with Apache Spark

A) option(“header”, “true”)
- When reading a CSV file with a header row, you need to specify option("header", "true") to tell Spark to treat the first row as column names.
C) Available only within the current Spark session
- Temporary views created with createOrReplaceTempView() are only available within the current Spark session and are deleted when the session ends.
C) Common Table Expression (CTE)
- Common Table Expressions (CTEs) defined with the WITH clause allow you to define auxiliary statements that can be referenced within a larger query.
B) year(timestamp_column)
- The year() function extracts the year component from a timestamp column in Spark SQL.
B) dropDuplicates([“column1”, “column2”])
- The dropDuplicates() function removes duplicate records based on the specified columns.
B) Records with duplicate values in column_name
- This query groups by column_name and finds groups with more than one record, which identifies duplicate values in that column.
B) explode()
- The explode() function transforms each element in an array column into a separate row, maintaining all other columns.
B) from_json()
- The from_json() function parses a JSON string into a structured column based on a provided schema.
C) PIVOT
- The PIVOT keyword in Spark SQL converts data from a tall (normalized) format to a wide format.
B) CREATE OR REPLACE FUNCTION name(param type) RETURNS type RETURN expression
- This is the correct syntax for creating a SQL user-defined function in Databricks.
A) CASE WHEN column IS NULL THEN ‘Unknown’ ELSE column END
- To check for NULL values in SQL, you must use the IS NULL operator, not = NULL.
B) SELECT COUNT() FROM (SELECT product_id, COUNT() FROM table GROUP BY product_id HAVING COUNT(*) > 1)
- This query counts how many product_id values appear more than once. If the result is 0, then all product_id values are unique.

Section 3: Incremental Data Processing

B) Concurrency
- The ACID properties are Atomicity, Consistency, Isolation, and Durability. Concurrency is a related concept but not one of the core ACID properties.
C) When a managed table is dropped, both data and metadata are deleted
- The key difference is that dropping a managed table deletes both the data and metadata, while dropping an external table only deletes the metadata, leaving the data intact.
B) DESCRIBE HISTORY table_name
- The DESCRIBE HISTORY command shows the history of operations performed on a Delta table.
C) To co-locate related data in the same files
- Z-ordering co-locates related data within files, improving query performance by reducing the amount of data that needs to be read for certain queries.
B) Permanently removes files no longer needed by the table
- The VACUUM command permanently removes files that are no longer needed by the table based on the retention period.
B) ALTER TABLE table_name ADD COLUMN tax DOUBLE GENERATED ALWAYS AS (price * 0.07)
- This command adds a generated column that automatically calculates tax as 7% of the price.
A) CREATE OR REPLACE TABLE can change the schema, while INSERT OVERWRITE preserves it
- CREATE OR REPLACE TABLE can create a table with a different schema, while INSERT OVERWRITE preserves the existing schema and only replaces the data.
C) For upserting data (update + insert in a single operation)
- The MERGE command allows you to perform both update and insert operations in a single atomic transaction, commonly known as an upsert.
A) It automatically detects and skips previously loaded files
- The COPY INTO command tracks which files have been loaded and automatically skips them in subsequent runs, making it idempotent.
B) Auto Loader
- Auto Loader is designed to efficiently process new files as they arrive without having to list all files in a directory.
B) Drops records that violate the constraint
- The @dlt.expect_or_drop annotation drops individual records that violate the specified constraint.
B) To process CDC data and apply changes to a target table
- The APPLY CHANGES INTO operation processes Change Data Capture (CDC) data and applies the changes (inserts, updates, deletes) to a target table.

Section 4: Production Pipelines

C) They organize complex workflows with dependencies between tasks
- Multi-task workflows allow you to organize complex data pipelines with clearly defined dependencies between tasks.
B) The dependent tasks will not run
- If a task fails in a multi-task job, any tasks that depend on it will not run.
A) 30 3 * * 1
- In CRON expressions, the format is minute hour day-of-month month day-of-week. 30 3 * * 1 means 3:30 AM every Monday (1 represents Monday).
B) Using the dbutils.jobs.taskValues API
- The dbutils.jobs.taskValues API allows tasks to share values with downstream tasks in a multi-task workflow.
A) retry_on_failure: { max_retries: 3, min_duration_between_retries_seconds: 60 }
- This is the correct configuration for automatically retrying a task if it fails, with up to 3 retries and a 60-second delay between attempts.
B) By adding email addresses to the job’s notification settings
- Email notifications for jobs are configured in the job’s notification settings, where you can specify email addresses to be notified on success or failure.

Section 5: Data Governance

B) catalog.schema.table
- The three-level namespace in Unity Catalog is represented as catalog.schema.table.
B) The top-level container that holds catalogs
- In Unity Catalog, a metastore is the top-level container that holds catalogs, which in turn hold schemas (databases).
B) For automated processes and scheduled jobs
- Service principals should be used for automated processes and scheduled jobs rather than interactive user actions.
C) Unity Catalog-enabled mode
- Clusters must be in Unity Catalog-enabled mode to access data governed by Unity Catalog.
B) GRANT SELECT ON TABLE catalog.schema.table TO user@example.com
- The GRANT SELECT command gives a user permission to query (read) a table.
C) Segregating data by business unit across different catalogs
- A best practice for organizing data in Unity Catalog is to segregate data by business unit or department across different catalogs.

Second Full-Length Mock Exam (90 minutes)

Now let’s take a second full-length practice exam to further assess your knowledge and readiness for the certification.

Section 1: Databricks Lakehouse Platform

Which statement accurately describes the relationship between a data lakehouse and a data warehouse?
- A) A data lakehouse is always less expensive than a data warehouse
- B) A data lakehouse combines the low-cost storage of data lakes with the data management features of data warehouses
- C) A data lakehouse requires proprietary hardware, while a data warehouse runs on commodity hardware
- D) A data warehouse supports structured and unstructured data, while a data lakehouse only supports structured data
In the medallion architecture, which layer would contain aggregated data ready for business intelligence tools?
- A) Bronze
- B) Silver
- C) Gold
- D) Raw
Which component of the Databricks platform architecture resides in the customer’s cloud account?
- A) Control plane
- B) Data plane
- C) Metastore
- D) Web application
How are clusters software versions managed in Databricks?
- A) Through manual installation of packages on each node
- B) Using the Databricks Runtime (DBR)
- C) By specifying package versions in notebook cells
- D) Through external package repositories
What happens to running commands when a cluster is restarted?
- A) They continue running without interruption
- B) They are paused and resume after restart
- C) They are terminated and need to be executed again
- D) They are automatically queued for execution after restart
Which of the following is TRUE about using multiple languages within the same notebook?
- A) Each notebook can only use one primary language
- B) You can switch languages using magic commands like %sql or %python
- C) You must create separate notebooks for each language
- D) Language switching requires cluster reconfiguration
How can you execute one notebook from within another notebook?
- A) Using the import statement
- B) Using the %run command
- C) Using the %include command
- D) Using the source() function
What is a limitation of Databricks Notebooks version control compared to Databricks Repos?
- A) Notebooks version control doesn’t track changes
- B) Notebooks version control can’t revert to previous versions
- C) Notebooks version control lacks branching and merging capabilities
- D) Notebooks version control only works for Python notebooks
How can notebooks be shared with others in Databricks?
- A) Only through email attachments
- B) By configuring permissions in the notebook’s sharing settings
- C) By converting them to PDF files
- D) Through external file sharing services only

Section 2: ELT with Apache Spark

What is the correct syntax to extract data from a CSV file with a header row?
- A) spark.read.csv("path/to/file.csv", header=True)
- B) spark.read.format("csv").option("header", "true").load("path/to/file.csv")
- C) spark.read.csv("path/to/file.csv", firstRowAsHeader=True)
- D) spark.read.format("csv").option("firstRow", "header").load("path/to/file.csv")
What is the difference between a temporary view and a global temporary view?
- A) Temporary views persist across Spark sessions, global temporary views don’t
- B) Global temporary views are accessible across all sessions within the same Spark application
- C) Temporary views support SQL queries, global temporary views support only DataFrame operations
- D) Global temporary views persist after cluster termination, temporary views don’t
Which SQL feature allows you to define named queries that can be referenced multiple times within the same statement?
- A) Subqueries
- B) Temporary Views
- C) Common Table Expressions (CTEs)
- D) User-Defined Functions (UDFs)
How would you validate that a customer_id is always associated with the same email address?
- A) SELECT customer_id, COUNT(DISTINCT email) FROM customers GROUP BY customer_id HAVING COUNT(DISTINCT email) > 1
- B) SELECT customer_id, email FROM customers WHERE email IS NOT NULL
- C) SELECT DISTINCT customer_id, email FROM customers
- D) SELECT customer_id FROM customers GROUP BY customer_id HAVING COUNT(*) > 1
Which function is used to cast a string column to a timestamp in Spark SQL?
- A) CAST(column AS TIMESTAMP)
- B) to_timestamp(column)
- C) timestamp(column)
- D) convert_to_timestamp(column)
How would you extract the day of the week from a date column?
- A) day_of_week(date_column)
- B) dayofweek(date_column)
- C) extract(DOW FROM date_column)
- D) date_column.day
Which function would you use to extract a specific pattern from a string using regular expressions?
- A) extract_pattern()
- B) regex_extract()
- C) regexp_extract()
- D) pattern_match()
How do you access nested data using dot notation in Spark SQL?
- A) SELECT customer.name FROM customers
- B) SELECT name FROM customers.customer
- C) SELECT customer->name FROM customers
- D) SELECT customers.customer.name FROM customers
Which array function allows you to check if an array contains a specific value?
- A) contains(array, value)
- B) array_contains(array, value)
- C) has_value(array, value)
- D) exists(array, value)
How would you parse a JSON string into a struct in Spark?
- A) json_parse(json_column, schema)
- B) from_json(json_column, schema)
- C) parse_json(json_column, schema)
- D) to_struct(json_column, schema)
What is the purpose of the PIVOT clause in SQL?
- A) To transpose rows and columns
- B) To filter data based on aggregated values
- C) To convert data from a long format to a wide format
- D) To join tables on multiple columns
How do you define a SQL UDF (User-Defined Function)?
- A) CREATE FUNCTION function_name(param type) RETURNS type AS expression
- B) CREATE OR REPLACE FUNCTION function_name(param type) RETURNS type RETURN expression
- C) DEFINE FUNCTION function_name(param type) AS expression
- D) CREATE UDF function_name(param type) RETURNS type USING expression

Section 3: Incremental Data Processing

Which of the following is a key benefit of ACID transactions in Delta Lake?
- A) Faster query performance compared to Parquet
- B) Ability to handle larger datasets than traditional formats
- C) Reliable data operations even during concurrent modifications
- D) Support for unstructured data formats
How can you identify whether a table is managed or external?
- A) By checking if the table name starts with “external_”
- B) By running DESCRIBE FORMATTED table_name and checking the Type property
- C) By checking if the table is stored in Delta format
- D) By running SELECT is_managed FROM table_name
How would you query a specific version of a Delta table?
- A) SELECT * FROM table_name.history(3)
- B) SELECT * FROM table_name VERSION AS OF 3
- C) SELECT * FROM table_name@v3
- D) SELECT * FROM table_name WHERE version = 3
What is the benefit of Z-ordering in Delta Lake?
- A) It encrypts sensitive data
- B) It compresses data to save storage space
- C) It improves query performance by co-locating related data
- D) It splits data across multiple storage locations for redundancy
What does the OPTIMIZE command do in Delta Lake?
- A) It reduces the size of Delta log files
- B) It compacts small files into larger ones
- C) It encrypts data for security
- D) It creates statistics for the query optimizer
What is the retention period for Delta table history by default?
- A) 7 days
- B) 14 days
- C) 30 days
- D) 90 days
Which statement is TRUE about generated columns in Delta tables?
- A) Generated columns are computed at query time
- B) Generated columns are stored physically in the table
- C) Generated columns can only use built-in functions
- D) Generated columns can be updated directly
When would you use INSERT OVERWRITE instead of MERGE?
- A) When you need to perform conditional updates
- B) When you want to completely replace the data with a new dataset
- C) When you need to insert and update in the same operation
- D) When you need to process CDC data
What makes the COPY INTO command useful for data ingestion?
- A) It’s faster than MERGE for large datasets
- B) It supports transformations during ingestion
- C) It loads only new files that haven’t been processed yet
- D) It allows selective column loading
Which component is necessary to create a new Delta Live Tables pipeline?
- A) A target database and notebook libraries
- B) A streaming source and sink
- C) An Auto Loader configuration
- D) A predefined schema
What is the difference between triggered and continuous pipelines in Delta Live Tables?
- A) Triggered pipelines support streaming sources, continuous pipelines don’t
- B) Continuous pipelines have lower latency but higher cost
- C) Triggered pipelines support constraints, continuous pipelines don’t
- D) Continuous pipelines can only read from Delta tables
What happens by default when a constraint is violated in Delta Live Tables?
- A) The entire pipeline fails
- B) The violating record is logged but still processed
- C) The violating record is automatically fixed
- D) The violating record is sent to a quarantine table

Section 4: Production Pipelines

What is the main benefit of using multi-task workflows in Databricks Jobs?
- A) They enable parallel processing across multiple clusters
- B) They allow defining dependencies between tasks
- C) They provide better security isolation
- D) They reduce the cost of job execution
If Task B depends on Task A in a multi-task workflow, and Task A fails, what happens to Task B?
- A) Task B runs anyway
- B) Task B does not run
- C) Task B runs with limited functionality
- D) Task B runs only if manually triggered
Which CRON expression would schedule a job to run at 15 minutes past every hour?
- A) 15 * * * *
- B) * 15 * * *
- C) */15 * * * *
- D) 0 */15 * * *
How can you review a task’s execution history in Databricks Jobs?
- A) Through the task’s properties panel
- B) In the job’s run history section
- C) Using the DESCRIBE HISTORY command
- D) By querying the jobs API directly
What is the purpose of setting up a retry policy for a job task?
- A) To automatically retry the task if it fails due to transient issues
- B) To retry the task with different parameters
- C) To skip the task after multiple failures
- D) To pause the job for manual intervention
How can you create an alert for a failed task in Databricks Jobs?
- A) By configuring email notifications in the job settings
- B) By writing custom alerting code in the task
- C) By setting up a monitoring dashboard
- D) By creating a separate alerting task

Section 5: Data Governance

What is Unity Catalog in Databricks?
- A) A tool for cataloging all available datasets
- B) A unified governance solution for data, analytics, and AI
- C) A visual interface for exploring table schemas
- D) A system for cataloging machine learning models
What is the relationship between metastores and catalogs in Unity Catalog?
- A) Metastores and catalogs are synonyms
- B) A metastore contains catalogs, which contain schemas
- C) A catalog contains metastores, which contain schemas
- D) Metastores and catalogs exist at the same level of hierarchy
What is a service principal in the context of Databricks?
- A) A human user with administrative privileges
- B) A non-human identity used for automated processes
- C) A type of compute cluster optimized for services
- D) A security protocol for API access
Which cluster security mode is compatible with Unity Catalog?
- A) Standard security mode
- B) High concurrency mode
- C) Unity Catalog-enabled mode
- D) Isolation security mode
How would you implement data object access control in Unity Catalog?
- A) By setting permissions at the workspace level
- B) Using the GRANT command to assign permissions to users and groups
- C) By configuring access control lists in the data files
- D) Through the cluster configuration
What is considered a best practice for Unity Catalog deployment?
- A) Creating a single catalog for all data
- B) Storing sensitive data in external locations
- C) Segregating business units across different catalogs
- D) Duplicating data across multiple catalogs for redundancy

Answers to Second Full-Length Mock Exam

Section 1: Databricks Lakehouse Platform

B) A data lakehouse combines the low-cost storage of data lakes with the data management features of data warehouses
- The data lakehouse architecture brings together the cost-efficient storage of data lakes with the reliability, governance, and performance features of data warehouses.
C) Gold
- In the medallion architecture, the Gold layer contains business-level aggregated data that’s optimized for consumption by analysts and BI tools.
B) Data plane
- In the Databricks architecture, the data plane (which includes compute clusters and storage) resides in the customer’s cloud account, while the control plane is managed by Databricks.
B) Using the Databricks Runtime (DBR)
- The Databricks Runtime (DBR) is a packaged distribution that includes Apache Spark and other optimized components for different workloads.
C) They are terminated and need to be executed again
- When a cluster is restarted, all running commands are terminated and will need to be executed again.
B) You can switch languages using magic commands like %sql or %python
- Magic commands like %sql, %python, %scala, and %r enable language switching within a single notebook.
B) Using the %run command
- The %run command allows you to execute one notebook from within another, making variables and functions defined in the called notebook available in the calling notebook.
C) Notebooks version control lacks branching and merging capabilities
- While the built-in notebook version control tracks changes, it doesn’t support Git operations like branching and merging that are available in Databricks Repos.
B) By configuring permissions in the notebook’s sharing settings
- Notebooks can be shared with specific users or groups by configuring permissions in the notebook’s sharing settings.

Section 2: ELT with Apache Spark

B) spark.read.format("csv").option("header", "true").load("path/to/file.csv")
- This is the correct syntax for reading a CSV file with a header row in Spark.
B) Global temporary views are accessible across all sessions within the same Spark application
- While temporary views are only available in the current session, global temporary views can be accessed across all sessions within the same Spark application.
C) Common Table Expressions (CTEs)
- CTEs, defined with the WITH clause, allow you to define named query blocks that can be referenced multiple times within a query.
A) SELECT customer_id, COUNT(DISTINCT email) FROM customers GROUP BY customer_id HAVING COUNT(DISTINCT email) > 1
- This query finds customer IDs associated with multiple distinct email addresses, helping validate the one-to-one relationship.
B) to_timestamp(column)
- The to_timestamp() function is used to convert a string column to a timestamp datatype.
B) dayofweek(date_column)
- The dayofweek() function extracts the day of the week from a date column.
C) regexp_extract()
- regexp_extract() is the correct function for extracting patterns from strings using regular expressions.
A) SELECT customer.name FROM customers
- When accessing nested data, dot notation is used to navigate the structure hierarchy.
B) array_contains(array, value)
- The array_contains() function checks whether an array column contains a specific value.
B) from_json(json_column, schema)
- The from_json() function parses a JSON string into a structured column based on the provided schema.
C) To convert data from a long format to a wide format
- The PIVOT clause transforms data from a normalized (long) format to a denormalized (wide) format, typically for reporting purposes.
B) CREATE OR REPLACE FUNCTION function_name(param type) RETURNS type RETURN expression
- This is the correct syntax for creating a SQL user-defined function in Databricks.

Section 3: Incremental Data Processing

C) Reliable data operations even during concurrent modifications
- ACID transactions ensure that data operations are reliable and consistent, even when multiple users or processes are modifying the data concurrently.
B) By running DESCRIBE FORMATTED table_name and checking the Type property
- The DESCRIBE FORMATTED or DESCRIBE EXTENDED commands show detailed table information, including whether it’s managed or external.
B) SELECT * FROM table_name VERSION AS OF 3
- This is the correct syntax for querying a specific historical version of a Delta table.
C) It improves query performance by co-locating related data
- Z-ordering co-locates related data in the same files, improving query performance by reducing the amount of data that needs to be read.
B) It compacts small files into larger ones
- The OPTIMIZE command combines small files into larger ones to improve query performance and reduce file system overhead.
C) 30 days
- By default, Delta Lake retains 30 days of transaction history, which can be adjusted using table properties.
B) Generated columns are stored physically in the table
- Generated columns are computed at write time and stored physically in the table, not calculated at query time.
B) When you want to completely replace the data with a new dataset
- INSERT OVERWRITE is best used when you want to replace all data in a table or partition, while preserving the schema.
C) It loads only new files that haven’t been processed yet
- COPY INTO tracks which files have been loaded and automatically skips previously processed files, making it idempotent.
A) A target database and notebook libraries
- To create a Delta Live Tables pipeline, you need to specify a target database and notebook libraries containing the transformation logic.
B) Continuous pipelines have lower latency but higher cost
- Continuous pipelines process data as it arrives (lower latency) but consume more resources (higher cost) than triggered pipelines.
A) The entire pipeline fails
- By default, a constraint violation causes the entire pipeline to fail unless explicitly configured otherwise.

Section 4: Production Pipelines

B) They allow defining dependencies between tasks
- Multi-task workflows enable you to define complex data pipelines with clear dependencies between tasks.
B) Task B does not run
- If a task fails, all dependent tasks will not run, maintaining the integrity of the workflow.
A) 15 * * * *
- In CRON syntax, this expression means “at minute 15 of every hour,” following the pattern: minute hour day-of-month month day-of-week.
B) In the job’s run history section
- You can review a task’s execution details, logs, and performance in the job’s run history section.
A) To automatically retry the task if it fails due to transient issues
- Retry policies allow tasks to automatically retry after failure, which is useful for handling transient issues like network problems.
A) By configuring email notifications in the job settings
- Email notifications for job failures can be configured in the job settings, specifying recipients and conditions.

Section 5: Data Governance

B) A unified governance solution for data, analytics, and AI
- Unity Catalog provides a centralized governance solution across clouds, workspaces, and the full range of data assets.
B) A metastore contains catalogs, which contain schemas
- In Unity Catalog, the hierarchy is: metastore → catalogs → schemas (databases) → objects.
B) A non-human identity used for automated processes
- Service principals are non-human identities used for automated processes and applications rather than individual users.
C) Unity Catalog-enabled mode
- Clusters must be in Unity Catalog-enabled mode to access data governed by Unity Catalog.
B) Using the GRANT command to assign permissions to users and groups
- Access control in Unity Catalog is implemented using GRANT commands to assign permissions to specific securable objects.
C) Segregating business units across different catalogs
- A best practice for Unity Catalog deployment is to organize data by business unit or department across different catalogs.

Comprehensive Review and Preparation Strategy

Key Exam Strategies

Time Management:
- You have 90 minutes to answer 45 questions, which gives you approximately 2 minutes per question
- Don’t spend too much time on difficult questions; mark them and come back later
- Complete a first pass through all questions, then review any marked questions
Question Analysis:
- Read each question carefully to identify what is being asked
- Pay attention to key words like “NOT”, “EXCEPT”, “PRIMARY” that can change the meaning of a question
- Eliminate obviously incorrect answers first to narrow down your choices
Technical Focus Areas:
- Delta Lake operations and when to use them
- Incremental data loading patterns (MERGE, COPY INTO, etc.)
- Unity Catalog structure and permissions
- Databricks Jobs configuration and dependencies
Common Pitfalls to Avoid:
- Confusing temporary views with persistent tables
- Mixing up the syntax for different SQL operations
- Forgetting the three-level namespace structure in Unity Catalog
- Misunderstanding when to use different incremental loading patterns

Final Preparation Checklist

✅ Review all exam sections, especially those where you scored lower on practice exams

✅ Memorize key commands and syntax for common operations

✅ Understand the core concepts behind each technology rather than just memorizing facts

✅ Practice time management to ensure you can complete all questions

✅ Get a good night’s sleep before the exam

✅ Arrive early if taking the exam at a testing center, or ensure your environment is ready for online proctoring

‹ Databricks Data Analyst Associate Certification

Databricks Data Engineer Professional Certification ›

Databricks Data Engineer Associate Certification

Table of contents

Preparation Plan

Foundations & Databricks Lakehouse Platform

ELT with Apache Spark

Incremental Data Processing

Production Pipelines & Data Governance

Knowledge Base

Foundations & Databricks Lakehouse Platform

Databricks Fundamentals

Introduction to Databricks

Lakehouse Architecture Fundamentals

Databricks Clusters and Runtimes

Advanced Databricks Features

Version Control with Databricks Repos

Multi-Language Notebook Development

Hands-On Practice: Creating and Managing Databricks Resources

Assessment Quiz

Assessment Quiz Answers

Hands-On Exercises

Exercise 1: Setting Up Your Databricks Environment

Creating and Configuring a Cluster

Exploring the Databricks Workspace

Exercise 2: Working with Notebooks

Creating Your First Notebook

Multi-Language Notebook Operations

Exercise 3: Implementing a Simple Medallion Architecture

Step 1: Generate Sample Data (Bronze Layer)

Step 2: Clean and Validate Data (Silver Layer)

Step 3: Create Aggregated Business Views (Gold Layer)

Exercise 4: Running Notebooks from Another Notebook

Exercise 5: Working with Databricks Repos (Version Control)

Key Takeaways

ELT with Apache Spark

Data Extraction and Basic Transformations

Data Extraction Techniques

Extracting Data from Files

Extracting Data from JDBC Sources

Temporary Tables, Views, and CTEs

Temporary Views

Global Temporary Views

Common Table Expressions (CTEs)

Transformation Fundamentals with Spark SQL

Basic Transformations

Aggregations

Working with DataFrames and Tables

Converting DataFrames to Delta Tables

Reading from Delta Tables

Hands-on Exercise: Building an ETL Pipeline

Advanced Transformations

Data Deduplication Techniques

Identifying Duplicates

Removing Duplicates

Data Validation Techniques

Primary Key Validation

Foreign Key Validation

Working with Complex Data Types

Handling Timestamps

Working with JSON and Nested Structures

SQL User-Defined Functions (UDFs)

Control Flow in Spark SQL

Hands-on Exercise: Advanced Transformations

Assessment Quiz

Answers to Assessment Quiz:

Key Takeaways

Incremental Data Processing

Delta Lake and ACID Transactions

Delta Lake Fundamentals

What is Delta Lake?

Delta Lake Architecture

ACID Transactions in Delta Lake

Benefits of ACID Transactions

Data and Metadata Management

Understanding Data vs. Metadata

Managed vs. External Tables

Hands-On Exercise: Delta Lake Operations

Delta Lake Operations and Optimizations

Table History and Time Travel

Optimizing Delta Tables

Z-Ordering