Databricks SQL is a service within the Databricks platform designed specifically for data analysis using SQL. It provides a familiar SQL interface while leveraging the power of Apache Spark in the background. As a data analyst, this tool allows you to query large datasets efficiently without needing to understand the complexities of distributed computing.
Key Audience and Purpose:
Databricks SQL primarily serves data analysts who need to:
The service bridges the gap between traditional SQL analytics and modern big data processing, allowing analysts to work with massive datasets that wouldn’t be feasible in traditional database systems.
Components of Databricks SQL:
The Lakehouse architecture combines the best aspects of data lakes and data warehouses, providing:
Traditional Data Architecture Challenges:
Lakehouse Benefits:
The Medallion Architecture:
This is a data organization approach within the Lakehouse that creates progressively refined data through three layers:
As a data analyst, you’ll most frequently work with Gold layer tables, but understanding the entire architecture helps you navigate the data landscape and troubleshoot when needed.
Databricks SQL uses a dialect that’s compatible with ANSI SQL, with extensions that leverage Apache Spark’s capabilities. Here’s what you need to know:
Basic Query Structure:
SELECT column1, column2, function(column3) AS derived_value
FROM database_name.table_name
WHERE condition
GROUP BY column1
HAVING group_condition
ORDER BY column1 [ASC|DESC]
LIMIT number;
Databricks SQL Differentiators:
Example Basic Query:
SELECT
customer_id,
SUM(order_amount) AS total_spend,
COUNT(order_id) AS order_count,
AVG(order_amount) AS average_order
FROM sales.orders
WHERE order_date >= '2022-01-01'
GROUP BY customer_id
HAVING COUNT(order_id) > 5
ORDER BY total_spend DESC
LIMIT 100;
SQL Warehouses (also called endpoints) are the compute resources that execute your SQL queries in Databricks. Understanding their configuration is essential for both performance and cost management.
Key Concepts:
Serverless vs. Classic: Serverless warehouses start quickly and don’t require management, making them ideal for most analytical workloads.
Sizing Options: Warehouses range from 2X-Small to 4X-Large, with larger sizes providing more cores and memory.
Auto-Stopping: Warehouses can be configured to automatically stop after a period of inactivity (reducing costs).
Scaling: Warehouses can scale up and down based on query load.
Practical Exercise 1: Configuring a SQL Warehouse
Steps to configure your first SQL warehouse:
Considerations for Sizing:
As a data analyst, you’ll need to understand how data enters the Databricks environment. There are several methods available:
Small-File Upload:
Object Storage Import:
Partner Connect:
Practical Exercise 2: Importing a Small CSV File
Partner Connect allows you to easily integrate with third-party tools for data ingestion, visualization, and other services.
Key Integration Categories:
Benefits of Partner Connect:
Practical Exercise 3: Exploring Available Partner Integrations
Practical Exercise 4: Using the Query Editor and Schema Browser
SELECT * FROM samples.nyctaxi.trips LIMIT 10;
Answer the following questions to test your understanding of today’s material:
Which layer of the medallion architecture would a data analyst most commonly work with? a) Bronze b) Silver c) Gold d) Platinum
What is the primary benefit of using Serverless SQL warehouses in Databricks? a) They support more complex queries b) They start quickly and require no management c) They cost significantly less than classic warehouses d) They can process larger datasets
Which statement about Databricks SQL is FALSE? a) It allows analysts to query data using standard SQL syntax b) It can only work with structured data in table format c) It can create visualizations directly from query results d) It uses Apache Spark for distributed processing
When would small-file upload be the most appropriate data ingestion method? a) For streaming data sources b) For multi-terabyte datasets c) For reference tables and lookup data d) For real-time sensor data
The Lakehouse architecture combines elements of: a) Data lakes and data marts b) Data warehouses and databases c) Data lakes and data warehouses d) Data marts and data warehouses
Which layer of the medallion architecture would a data analyst most commonly work with? Answer: c) Gold
The Gold layer contains business-ready data that has been refined, aggregated, and optimized specifically for analytics use cases. Data analysts typically work with Gold layer tables because they provide the most accessible, clean, and performance-optimized data for business intelligence and reporting.
What is the primary benefit of using Serverless SQL warehouses in Databricks? Answer: b) They start quickly and require no management
Serverless SQL warehouses eliminate the need for infrastructure management while providing fast start-up times. This makes them ideal for analytical workloads that require immediate availability without the overhead of cluster management.
Which statement about Databricks SQL is FALSE? Answer: b) It can only work with structured data in table format
This statement is false because Databricks SQL can work with various data types including complex and semi-structured data (arrays, maps, structs) across different sources. It’s not limited to just structured tabular data.
When would small-file upload be the most appropriate data ingestion method? Answer: c) For reference tables and lookup data
Small-file upload is best suited for smaller datasets like reference tables or lookup data that complement your main analytical datasets. These typically include dimension tables, configuration data, or other supplementary information.
The Lakehouse architecture combines elements of: Answer: c) Data lakes and data warehouses
The Lakehouse architecture specifically merges the flexibility and scalability of data lakes with the reliability, governance, and performance capabilities of traditional data warehouses into a unified platform.
To reinforce learning:
Delta Lake is an open-source storage layer that brings reliability to data lakes. As a critical component of the Databricks Lakehouse Platform, Delta Lake provides essential capabilities that make your data management more robust and efficient.
Core Delta Lake Benefits:
ACID Transactions: Delta Lake ensures atomicity, consistency, isolation, and durability for all operations, preventing data corruption during concurrent operations.
Metadata Management: Delta Lake maintains comprehensive metadata about your tables, including schema information, partitioning details, and transaction history.
Time Travel: Delta Lake preserves the history of data changes, allowing you to access and query previous versions of your data using timestamps or version numbers.
Schema Enforcement and Evolution: Delta Lake validates that incoming data adheres to the table’s schema and supports controlled schema changes over time.
Unified Batch and Streaming: Delta Lake provides a consistent way to work with both batch and streaming data sources.
Delta Lake Architecture:
Delta Lake tables consist of three main components:
How Delta Lake Works Behind the Scenes:
table_directory/
├── _delta_log/ # Transaction log directory
│ ├── 00000000000000000000.json # First transaction
│ ├── 00000000000000000001.json # Second transaction
│ ├── ...
│ └── 00000000000000000010.checkpoint.parquet # Checkpoint file
├── part-00000-123abc.snappy.parquet # Data file 1
├── part-00001-456def.snappy.parquet # Data file 2
└── ... # More data files
When you perform operations like INSERT, UPDATE, DELETE, or MERGE on a Delta table, Delta Lake:
In Databricks, understanding the distinction between managed and unmanaged tables is crucial for effective data management.
Managed Tables:
-- Creating a managed table
CREATE TABLE managed_sales (
id INT,
product STRING,
amount DOUBLE
);
Unmanaged (External) Tables:
-- Creating an unmanaged table
CREATE TABLE unmanaged_sales (
id INT,
product STRING,
amount DOUBLE
)
LOCATION 's3://your-bucket/sales-data/';
Key Differences:
| Aspect | Managed Tables | Unmanaged Tables |
|---|---|---|
| Data Files | Managed by Databricks | Managed by you |
| Location | Default Hive location | User-specified location |
| DROP TABLE behavior | Deletes data and metadata | Deletes only metadata |
| Use Case | Transient or temporary data | Persistent data shared across tools |
| Data Lifecycle | Tied to table lifecycle | Independent of table lifecycle |
Identifying Table Type:
You can determine if a table is managed or unmanaged by:
DESCRIBE EXTENDED table_name;
Effective data management requires familiarity with key database and table operations in Databricks SQL.
Database Operations:
– With location specified CREATE DATABASE marketing_db LOCATION ‘s3://your-bucket/marketing-data/’;
– Creating if not exists CREATE DATABASE IF NOT EXISTS finance_db;
2. **Using Databases**:
```sql
-- Set current database
USE sales_db;
-- Qualify table names with database
SELECT * FROM marketing_db.campaigns;
– Force drop even if it contains tables DROP DATABASE marketing_db CASCADE;
**Table Operations:**
1. **Creating Tables**:
```sql
-- Create new empty table
CREATE TABLE customers (
customer_id INT,
name STRING,
email STRING,
signup_date DATE
);
-- Create table from query results
CREATE TABLE active_customers AS
SELECT * FROM customers
WHERE last_activity_date > current_date() - INTERVAL 90 DAYS;
-- Create table with partitioning
CREATE TABLE orders (
order_id INT,
customer_id INT,
amount DOUBLE,
order_date DATE
)
PARTITIONED BY (order_date);
– Drop a column ALTER TABLE customers DROP COLUMN phone;
– Rename a table ALTER TABLE customers RENAME TO customers_renamed;
3. **Dropping Tables**:
```sql
-- Drop table
DROP TABLE customers;
-- Drop if exists
DROP TABLE IF EXISTS temporary_customers;
Understanding the persistence and scope of tables is essential for managing your data environment effectively.
Persistence Levels:
-- Create a temporary table
CREATE TEMPORARY TABLE session_metrics AS
SELECT user_id, COUNT(*) as page_views
FROM page_events
GROUP BY user_id;
-- Create a global temporary table
CREATE GLOBAL TEMPORARY TABLE app_metrics AS
SELECT app_id, AVG(response_time) as avg_response
FROM app_events
GROUP BY app_id;
-- Access global temporary table
SELECT * FROM global_temp.app_metrics;
Database Scope Control:
When creating databases, you can use the LOCATION parameter to control where the database stores its tables:
-- Create database with custom location
CREATE DATABASE sales_db
LOCATION 's3://your-bucket/sales-data/';
This affects all managed tables created in this database - they will be stored at the specified location rather than the default Hive warehouse directory.
Views are virtual tables based on the result set of a SQL statement. They don’t store data themselves but provide a way to simplify complex queries and control access to data.
Types of Views in Databricks:
-- Create a permanent view
CREATE VIEW customer_summary AS
SELECT
region,
COUNT(DISTINCT customer_id) AS customer_count,
SUM(total_spend) AS region_revenue
FROM customers
GROUP BY region;
-- Create a temporary view
CREATE TEMPORARY VIEW active_users AS
SELECT user_id, last_activity_date
FROM users
WHERE last_activity_date > current_date() - INTERVAL 30 DAYS;
-- Create a global temporary view
CREATE GLOBAL TEMPORARY VIEW quarterly_sales AS
SELECT
QUARTER(order_date) AS quarter,
SUM(amount) AS total_sales
FROM orders
WHERE YEAR(order_date) = YEAR(current_date())
GROUP BY QUARTER(order_date);
-- Access global temporary view
SELECT * FROM global_temp.quarterly_sales;
Comparing Views and Tables:
| Aspect | Views | Tables |
|---|---|---|
| Data Storage | No data storage (query definition only) | Stores actual data |
| Update Frequency | Reflects current data when queried | Reflects data at time of creation/update |
| Performance | Computed each time when queried | Direct data access (faster) |
| Storage Space | Minimal (only metadata) | Requires space for full dataset |
| Use Case | Simplifying complex queries, access control | Storing persistent datasets |
Practical Exercise 1: Creating and Using Views
CREATE VIEW sales_by_region AS
SELECT
region,
product_category,
SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY region, product_category;
CREATE TEMPORARY VIEW high_value_transactions AS
SELECT *
FROM transactions
WHERE amount > 10000;
SELECT
hvt.transaction_id,
hvt.amount,
sbr.region,
sbr.total_sales AS region_total
FROM high_value_transactions hvt
JOIN sales_by_region sbr
ON hvt.region = sbr.region
WHERE hvt.transaction_date > '2023-01-01';
Data Explorer in Databricks SQL provides a graphical interface for browsing and managing databases, tables, and views.
Key Features of Data Explorer:
Practical Exercise 2: Exploring Data Objects
Proper security management is crucial for protecting sensitive data and ensuring appropriate access.
Key Security Concepts:
-- Grant select access
GRANT SELECT ON TABLE customer_data TO user@example.com;
-- Grant all privileges
GRANT ALL PRIVILEGES ON DATABASE marketing_db TO group_marketing;
-- Revoke access
REVOKE SELECT ON TABLE financial_data FROM user@example.com;
-- Show grants
SHOW GRANTS ON TABLE customer_data;
Handling PII Data:
Personal Identifiable Information (PII) requires special handling:
Practical Exercise 3: Managing Table Security
Test your understanding of today’s material by answering these questions:
What happens to the data files when you drop a managed Delta table? a) Nothing, the files remain in storage b) The files are deleted c) The files are archived for 30 days then deleted d) Only the oldest versions are deleted
Which SQL command would you use to create an unmanaged table? a) CREATE EXTERNAL TABLE customers… b) CREATE TABLE customers… LOCATION ‘…’ c) CREATE UNMANAGED TABLE customers… d) CREATE TABLE customers… EXTERNAL=true
What is the key difference between a temporary view and a permanent view? a) Temporary views cannot be queried by other users b) Temporary views exist only for the current session c) Temporary views do not support complex queries d) Temporary views cannot join with permanent tables
In Delta Lake, what is the purpose of the transaction log? a) To store table statistics b) To record all changes made to the table c) To improve query performance d) To compress data files
Which statement about views is TRUE? a) Views store a copy of the data from the source table b) Views execute their defining query each time they are accessed c) Views cannot be created on top of other views d) Views always perform better than direct table queries
What happens to the data files when you drop a managed Delta table? Answer: b) The files are deleted
With managed tables, Databricks controls both the metadata and the underlying data files. When you drop a managed table, both the metadata in the metastore and all associated data files are deleted permanently. This is an important distinction from unmanaged tables, where dropping the table only removes the metadata reference.
Which SQL command would you use to create an unmanaged table? Answer: b) CREATE TABLE customers… LOCATION ‘…‘
Adding the LOCATION parameter to your CREATE TABLE statement is what designates a table as unmanaged (external). This tells Databricks that you’re managing the storage location of the data files independently, rather than allowing Databricks to determine the storage location.
What is the key difference between a temporary view and a permanent view? Answer: b) Temporary views exist only for the current session
Temporary views are session-scoped objects that automatically disappear when your current session ends. In contrast, permanent views persist in the metastore and remain available across sessions, clusters, and to other users with appropriate permissions.
In Delta Lake, what is the purpose of the transaction log? Answer: b) To record all changes made to the table
The transaction log is the cornerstone of Delta Lake’s ACID compliance. It maintains a sequential record of all operations performed on the table, including inserts, updates, deletes, schema changes, and optimizations. This log enables time travel capabilities and ensures data consistency during concurrent operations.
Which statement about views is TRUE? Answer: b) Views execute their defining query each time they are accessed
Unlike tables, views don’t store data themselves but rather store the query definition. When you query a view, it executes its underlying SQL statement against the source tables each time, always returning the current data from those source tables.
To reinforce learning:
Building on your foundation, today we dive into more sophisticated SQL capabilities in Databricks that allow you to perform complex data transformations and analysis.
Beyond Basic SELECT Statements
While basic SELECT statements form the foundation of SQL, Databricks SQL supports advanced clauses and operations for complex data manipulation:
SELECT
customer_id,
CONCAT(first_name, ' ', last_name) AS full_name,
CASE
WHEN total_purchases > 10000 THEN 'Premium'
WHEN total_purchases > 5000 THEN 'Gold'
WHEN total_purchases > 1000 THEN 'Silver'
ELSE 'Bronze'
END AS customer_tier,
DATE_DIFF(CURRENT_DATE(), first_purchase_date, 'MONTH') AS customer_tenure_months
FROM customers
WHERE status = 'Active'
AND (region = 'North America' OR region = 'Europe')
AND NOT EXISTS (SELECT 1 FROM complaints WHERE complaints.customer_id = customers.customer_id)
ORDER BY total_purchases DESC
LIMIT 100;
This query demonstrates several advanced techniques:
Data Modification Operations
Databricks SQL supports several methods for modifying data in tables:
-- Basic insert
INSERT INTO customers (customer_id, first_name, last_name, email)
VALUES (1001, 'John', 'Smith', 'john.smith@example.com');
-- Insert multiple rows
INSERT INTO customers
VALUES
(1002, 'Jane', 'Doe', 'jane.doe@example.com'),
(1003, 'Robert', 'Johnson', 'robert.j@example.com');
-- Insert from query results
INSERT INTO active_customers
SELECT customer_id, first_name, last_name, email
FROM customers
WHERE last_activity_date > CURRENT_DATE() - INTERVAL 90 DAYS;
MERGE INTO customers AS target
USING customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.status = 'Inactive' THEN
DELETE
WHEN MATCHED THEN
UPDATE SET
target.email = source.email,
target.last_updated = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (customer_id, first_name, last_name, email, status, last_updated)
VALUES (
source.customer_id,
source.first_name,
source.last_name,
source.email,
source.status,
CURRENT_TIMESTAMP()
);
COPY INTO sales
FROM 's3://data-bucket/new-sales/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Comparison of Data Modification Methods
| Method | Primary Use Case | Advantages | Limitations |
|---|---|---|---|
| INSERT INTO | Adding new data | Simple syntax, familiar | Cannot update existing data |
| MERGE INTO | Upserting data (update + insert) | Handles multiple operations in one statement | More complex syntax |
| COPY INTO | Bulk loading from files | Efficient for large data loads | Works with files, not query results |
Joins are fundamental for combining data from multiple tables. Understanding the different types is crucial for accurate analysis.
INNER JOIN
Returns only matching rows from both tables. This is the most common join type.
SELECT
o.order_id,
c.customer_name,
o.order_date,
o.total_amount
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id;
LEFT JOIN (LEFT OUTER JOIN)
Returns all rows from the left table and matching rows from the right table. If there’s no match, NULL values are returned for right table columns.
SELECT
c.customer_id,
c.customer_name,
o.order_id,
o.order_date
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id;
Use case: Finding all customers and their orders, including customers who haven’t placed any orders.
RIGHT JOIN (RIGHT OUTER JOIN)
Returns all rows from the right table and matching rows from the left table. If there’s no match, NULL values are returned for left table columns.
SELECT
o.order_id,
p.product_name,
o.quantity,
o.unit_price
FROM order_items o
RIGHT JOIN products p
ON o.product_id = p.product_id;
Use case: Finding all products and their orders, including products that haven’t been ordered.
FULL JOIN (FULL OUTER JOIN)
Returns all rows when there’s a match in either the left or right table. NULL values are returned for non-matching sides.
SELECT
e.employee_id,
e.employee_name,
d.department_name
FROM employees e
FULL JOIN departments d
ON e.department_id = d.department_id;
Use case: Finding all employees and departments, including employees not assigned to departments and departments without employees.
CROSS JOIN
Returns the Cartesian product of both tables (every row from the first table combined with every row from the second table).
SELECT
p.product_name,
c.category_name
FROM products p
CROSS JOIN categories c;
Use case: Creating combinations of all possible values, such as a product matrix or date dimension table.
SELF JOIN
Joining a table to itself, typically using different aliases.
SELECT
e.employee_name AS employee,
m.employee_name AS manager
FROM employees e
JOIN employees m
ON e.manager_id = m.employee_id;
Use case: Handling hierarchical data like organizational structures or category hierarchies.
Subqueries (queries nested within other queries) are powerful tools for complex data analysis but require careful optimization.
Types of Subqueries
SELECT
product_name,
price,
(SELECT AVG(price) FROM products) AS average_price,
price - (SELECT AVG(price) FROM products) AS price_difference
FROM products;
SELECT
department_name,
(SELECT COUNT(*) FROM employees WHERE department_id = d.department_id) AS employee_count
FROM departments d;
SELECT
c.customer_name,
o.order_count,
o.total_spent
FROM customers c
JOIN (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
) o ON c.customer_id = o.customer_id;
SELECT
product_name,
category,
price,
(SELECT AVG(price) FROM products p2 WHERE p2.category = p1.category) AS category_avg_price
FROM products p1
WHERE price > (SELECT AVG(price) FROM products p2 WHERE p2.category = p1.category);
Optimizing Subqueries
Subqueries can impact performance if not used carefully. Here are optimization strategies:
Use JOINs instead of correlated subqueries when possible
Instead of:
SELECT
c.customer_name,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count
FROM customers c;
Use:
SELECT
c.customer_name,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
Use Common Table Expressions (CTEs) for readability and reuse
WITH customer_orders AS (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT
c.customer_name,
co.order_count,
co.total_spent
FROM customers c
JOIN customer_orders co ON c.customer_id = co.customer_id;
Materialize frequently used subqueries as temporary tables or views
CREATE OR REPLACE TEMPORARY VIEW high_value_customers AS
SELECT
customer_id,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(total_amount) > 10000;
-- Now use this view in multiple queries
SELECT * FROM high_value_customers;
Push predicates into subqueries to filter early
Instead of:
SELECT *
FROM (
SELECT * FROM orders
) o
WHERE o.order_date > '2023-01-01';
Use:
SELECT *
FROM (
SELECT * FROM orders WHERE order_date > '2023-01-01'
) o;
Aggregation functions and window functions are essential tools for data analysis that allow you to summarize and compare data across different dimensions.
Basic Aggregation Functions
These functions operate on a set of rows to return a single value:
SELECT
category,
COUNT(*) AS product_count,
AVG(price) AS avg_price,
MIN(price) AS min_price,
MAX(price) AS max_price,
SUM(inventory) AS total_inventory
FROM products
GROUP BY category;
Advanced Aggregation Techniques
SELECT
COALESCE(region, 'All Regions') AS region,
COALESCE(category, 'All Categories') AS category,
SUM(sales_amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS (
(region, category),
(region),
(category),
()
);
SELECT
COALESCE(region, 'All Regions') AS region,
COALESCE(category, 'All Categories') AS category,
COALESCE(CAST(year AS STRING), 'All Years') AS year,
SUM(sales_amount) AS total_sales
FROM sales
GROUP BY CUBE(region, category, year);
SELECT
COALESCE(region, 'All Regions') AS region,
COALESCE(state, 'All States') AS state,
COALESCE(city, 'All Cities') AS city,
SUM(sales_amount) AS total_sales
FROM sales
GROUP BY ROLLUP(region, state, city);
Window Functions
Window functions perform calculations across a set of rows related to the current row, without collapsing the result into a single output row.
SELECT
category,
product_name,
price,
RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank,
DENSE_RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_dense_rank,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS price_row_number
FROM products;
SELECT
product_name,
category,
price,
AVG(price) OVER (PARTITION BY category) AS category_avg_price,
price - AVG(price) OVER (PARTITION BY category) AS price_diff_from_avg,
price / SUM(price) OVER (PARTITION BY category) * 100 AS pct_of_category_price
FROM products;
SELECT
order_date,
order_id,
total_amount,
SUM(total_amount) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
AVG(total_amount) OVER (
ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
) AS moving_avg_5day
FROM orders;
Practical Exercise 1: Working with Aggregations and Window Functions
SELECT
year,
quarter,
total_sales,
LAG(total_sales) OVER (PARTITION BY quarter ORDER BY year) AS prev_year_sales,
(total_sales - LAG(total_sales) OVER (PARTITION BY quarter ORDER BY year)) /
LAG(total_sales) OVER (PARTITION BY quarter ORDER BY year) * 100 AS yoy_growth_pct
FROM (
SELECT
YEAR(order_date) AS year,
QUARTER(order_date) AS quarter,
SUM(total_amount) AS total_sales
FROM orders
GROUP BY YEAR(order_date), QUARTER(order_date)
) quarterly_sales
ORDER BY year, quarter;
WITH product_sales AS (
SELECT
p.product_id,
p.product_name,
p.category,
SUM(oi.quantity) AS total_quantity_sold,
RANK() OVER (PARTITION BY p.category ORDER BY SUM(oi.quantity) DESC) AS sales_rank
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.product_name, p.category
)
SELECT
product_id,
product_name,
category,
total_quantity_sold
FROM product_sales
WHERE sales_rank <= 3
ORDER BY category, sales_rank;
Databricks SQL excels at handling complex, nested data structures that are common in modern data lakes.
Types of Nested Data Structures
-- Creating a table with array columns
CREATE TABLE users (
user_id INT,
name STRING,
interests ARRAY<STRING>,
purchase_amounts ARRAY<DOUBLE>
);
-- Inserting data with arrays
INSERT INTO users VALUES
(1, 'Alice', ARRAY('hiking', 'reading', 'cooking'), ARRAY(120.50, 25.00, 75.99)),
(2, 'Bob', ARRAY('gaming', 'photography'), ARRAY(249.99, 525.50));
-- Creating a table with map columns
CREATE TABLE product_attributes (
product_id INT,
name STRING,
specifications MAP<STRING, STRING>
);
-- Inserting data with maps
INSERT INTO product_attributes VALUES
(101, 'Smartphone', MAP('color', 'black', 'storage', '128GB', 'ram', '8GB')),
(102, 'Laptop', MAP('color', 'silver', 'storage', '512GB', 'ram', '16GB'));
-- Creating a table with struct columns
CREATE TABLE orders (
order_id INT,
customer_id INT,
shipping_address STRUCT<street: STRING, city: STRING, zip: STRING>,
billing_address STRUCT<street: STRING, city: STRING, zip: STRING>
);
-- Inserting data with structs
INSERT INTO orders VALUES
(1001, 5001,
STRUCT('123 Main St', 'Portland', '97201'),
STRUCT('123 Main St', 'Portland', '97201')),
(1002, 5002,
STRUCT('456 Oak Ave', 'Seattle', '98101'),
STRUCT('789 Pine Dr', 'New York', '10001'));
Querying Nested Data
-- Accessing array elements
SELECT
user_id,
name,
interests[0] AS primary_interest,
size(interests) AS interest_count
FROM users;
-- Exploding arrays
SELECT
user_id,
name,
exploded_interest
FROM users
LATERAL VIEW explode(interests) AS exploded_interest;
-- Filtering with array containment
SELECT *
FROM users
WHERE array_contains(interests, 'hiking');
-- Aggregating arrays
SELECT
user_id,
name,
array_join(interests, ', ') AS interest_list,
array_min(purchase_amounts) AS min_purchase,
array_max(purchase_amounts) AS max_purchase,
array_sum(purchase_amounts) AS total_spent
FROM users;
-- Accessing map values
SELECT
product_id,
name,
specifications['color'] AS color,
specifications['storage'] AS storage
FROM product_attributes;
-- Exploding maps
SELECT
product_id,
name,
spec_name,
spec_value
FROM product_attributes
LATERAL VIEW explode(specifications) AS spec_name, spec_value;
-- Filtering with map keys
SELECT *
FROM product_attributes
WHERE map_keys(specifications) ARRAY_CONTAINS 'ram';
-- Accessing struct fields
SELECT
order_id,
customer_id,
shipping_address.street AS shipping_street,
shipping_address.city AS shipping_city,
billing_address.zip AS billing_zip
FROM orders;
-- Comparing struct fields
SELECT *
FROM orders
WHERE shipping_address = billing_address;
Practical Exercise 2: Working with Nested Data
CREATE TABLE customer_transactions (
customer_id INT,
name STRING,
transactions ARRAY<STRUCT
transaction_id: STRING,
date: DATE,
amount: DOUBLE,
items: ARRAY<STRUCT
item_id: INT,
item_name: STRING,
quantity: INT,
price: DOUBLE
>>
>>
);
INSERT INTO customer_transactions VALUES
(1, 'Alice', ARRAY(
STRUCT('T1001', DATE '2023-01-15', 125.40, ARRAY(
STRUCT(101, 'Product A', 2, 45.50),
STRUCT(102, 'Product B', 1, 34.40)
)),
STRUCT('T1002', DATE '2023-02-20', 85.00, ARRAY(
STRUCT(103, 'Product C', 1, 85.00)
))
)),
(2, 'Bob', ARRAY(
STRUCT('T2001', DATE '2023-01-05', 200.00, ARRAY(
STRUCT(101, 'Product A', 1, 45.50),
STRUCT(104, 'Product D', 1, 154.50)
))
));
SELECT
ct.customer_id,
ct.name,
t.transaction_id,
t.date,
i.item_name,
i.quantity,
i.price,
i.quantity * i.price AS line_total
FROM customer_transactions ct
LATERAL VIEW explode(transactions) AS t
LATERAL VIEW explode(t.items) AS i
ORDER BY ct.customer_id, t.date, i.item_name;
Optimizing query performance is crucial for efficient data analysis, especially with large datasets.
Key Optimization Techniques
-- Instead of this
SELECT
c.customer_name,
o.order_date,
o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date > '2023-01-01';
-- Do this
SELECT
c.customer_name,
o.order_date,
o.total_amount
FROM customers c
JOIN (
SELECT * FROM orders WHERE order_date > '2023-01-01'
) o ON c.customer_id = o.customer_id;
-- Instead of this
SELECT * FROM customers JOIN orders ON customers.customer_id = orders.customer_id;
-- Do this
SELECT
customers.customer_name,
orders.order_id,
orders.order_date,
orders.total_amount
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id;
-- For a table partitioned by date
SELECT *
FROM sales_data
WHERE sale_date BETWEEN '2023-01-01' AND '2023-01-31';
Avoiding Cartesian Products: Be cautious with CROSS JOINs and ensure proper join conditions
Using Appropriate Join Types: Choose the right join type based on your data and requirements
Leveraging Query Caching: Databricks SQL caches query results for improved performance
-- Check if query results are cached
SELECT * FROM samples.nyctaxi.trips LIMIT 10;
-- Run again to use cached results
Higher-Order Functions
Spark SQL provides higher-order functions for working efficiently with complex data types:
SELECT
user_id,
name,
interests,
transform(interests, i -> upper(i)) AS uppercase_interests
FROM users;
SELECT
user_id,
name,
purchase_amounts,
filter(purchase_amounts, a -> a > 100) AS large_purchases
FROM users;
SELECT *
FROM users
WHERE exists(purchase_amounts, a -> a > 500);
SELECT
user_id,
name,
purchase_amounts,
aggregate(purchase_amounts, 0, (acc, x) -> acc + x) AS total_spent
FROM users;
User-Defined Functions (UDFs)
UDFs allow you to extend SQL functionality with custom logic:
-- Creating a simple UDF
CREATE OR REPLACE FUNCTION celsius_to_fahrenheit(celsius DOUBLE)
RETURNS DOUBLE
RETURN (celsius * 9/5) + 32;
-- Using the UDF
SELECT
city,
temperature_celsius,
celsius_to_fahrenheit(temperature_celsius) AS temperature_fahrenheit
FROM weather_data;
Practical Exercise 3: Performance Optimization
Original query:
SELECT *
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id
WHERE s.sale_date BETWEEN '2022-01-01' AND '2023-12-31';
Optimized query:
SELECT
s.sale_id,
s.sale_date,
s.quantity,
s.total_amount,
c.customer_name,
c.customer_email,
p.product_name,
p.category
FROM (
SELECT * FROM sales
WHERE sale_date BETWEEN '2022-01-01' AND '2023-12-31'
) s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id;
-- Create a UDF to calculate discount price
CREATE OR REPLACE FUNCTION calculate_discount(price DOUBLE, discount_pct DOUBLE)
RETURNS DOUBLE
RETURN price * (1 - discount_pct/100);
-- Use the UDF in a query
SELECT
product_id,
product_name,
price AS original_price,
discount_percentage,
calculate_discount(price, discount_percentage) AS discounted_price
FROM products;
Test your understanding of today’s material with these questions:
Which SQL operation would you use to simultaneously insert, update, and delete rows in a target table based on values from a source table? a) INSERT b) UPDATE c) MERGE d) COPY
When using a window function, what does PARTITION BY do? a) Splits the result set into partitions for distributed processing b) Divides the result set into groups for which the window function is applied separately c) Creates physical partitions in the underlying table d) Filters the result set based on partition values
Which join type returns all rows from both tables, with NULL values for non-matching rows? a) INNER JOIN b) LEFT JOIN c) RIGHT JOIN d) FULL JOIN
Which higher-order function would you use to filter elements from an array based on a condition? a) transform() b) filter() c) exists() d) aggregate()
What is the primary advantage of using Common Table Expressions (CTEs) over subqueries? a) CTEs always perform better b) CTEs can be reused multiple times in the same query c) CTEs support recursion d) CTEs are easier to update
Which SQL operation would you use to simultaneously insert, update, and delete rows in a target table based on values from a source table? Answer: c) MERGE
The MERGE operation is designed specifically for this purpose, allowing you to perform multiple data manipulation actions in a single statement. It matches rows between source and target tables, then applies different actions (INSERT, UPDATE, DELETE) based on whether matches are found and any additional conditions you specify.
When using a window function, what does PARTITION BY do? Answer: b) Divides the result set into groups for which the window function is applied separately
PARTITION BY creates logical divisions in your data. The window function calculations are performed independently within each partition, similar to how GROUP BY works for aggregations, but without collapsing the rows. This allows for more sophisticated analytical calculations while maintaining the detail rows.
Which join type returns all rows from both tables, with NULL values for non-matching rows? Answer: d) FULL JOIN
A FULL JOIN (or FULL OUTER JOIN) returns all rows from both tables. When there’s no match for a row in either table, NULL values are returned for columns from the non-matching table. This join type is useful when you need to see all data from both tables regardless of whether relationships exist.
Which higher-order function would you use to filter elements from an array based on a condition? Answer: b) filter()
The filter() function takes an array and a lambda function that specifies a filtering condition. It returns a new array containing only the elements that satisfy that condition. This is especially useful when working with array columns in nested data structures.
What is the primary advantage of using Common Table Expressions (CTEs) over subqueries? Answer: b) CTEs can be reused multiple times in the same query
While CTEs offer several benefits (including improved readability and support for recursion), their ability to be referenced multiple times within the same query is a key advantage. With a CTE, you define a complex intermediate result set once and can then reference it in multiple places, eliminating the need to repeat the same subquery logic.
To reinforce learning:
Data visualization is a critical component of analytics, transforming raw data into meaningful visual representations that highlight patterns, trends, and insights. Databricks SQL provides integrated visualization capabilities that allow you to create compelling visuals directly from your query results.
The Visualization Workflow in Databricks SQL:
Benefits of Integrated Visualizations:
Databricks SQL supports various visualization types, each suited for specific analytical purposes. Understanding when to use each type is fundamental to effective data storytelling.
Table Visualizations
Tables display raw data in rows and columns, ideal when precise values are important.
Key features:
Use cases:
Counter Visualizations
Counters highlight single values, often KPIs or important metrics that need immediate attention.
Key features:
Use cases:
Bar and Column Charts
Bar charts (horizontal) and column charts (vertical) compare values across categories.
Key features:
Use cases:
Line Charts
Line charts show trends over time or continuous dimensions, highlighting patterns and changes.
Key features:
Use cases:
Pie and Donut Charts
These charts show part-to-whole relationships, displaying each category as a slice of the whole.
Key features:
Use cases:
Scatter Plots
Scatter plots reveal relationships between two numerical variables, showing correlation patterns.
Key features:
Use cases:
Maps
Geographical visualizations display data in relation to physical locations.
Key features:
Use cases:
Pivot Tables
Dynamic tables that reorganize and summarize data, allowing for interactive analysis.
Key features:
Use cases:
Box Plots
Box plots display distribution characteristics for numerical data.
Key features:
Use cases:
Effective visualization goes beyond selecting the right chart type—proper formatting enhances clarity and impact.
Color Selection Principles:
Text and Label Formatting:
Layout and Composition:
Practical Examples of Formatting Impact:
Example 1: Sales Dashboard KPI Counter
Example 2: Regional Performance Map
Effective data visualization is about communicating insights, not just displaying data. Data storytelling combines visualizations into a coherent narrative.
Key Principles of Data Storytelling:
Visualization Sequence for Effective Storytelling:
Example Storytelling Sequence for Sales Analysis:
Let’s practice creating visualizations tailored to specific data schemas and analytical needs.
Practical Exercise 1: Sales Performance Dashboard
Start with a query that retrieves sales data:
SELECT
date_trunc('month', order_date) AS month,
region,
product_category,
SUM(sales_amount) AS total_sales,
COUNT(DISTINCT customer_id) AS customer_count,
SUM(sales_amount) / COUNT(DISTINCT customer_id) AS avg_customer_spend
FROM sales
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY date_trunc('month', order_date), region, product_category
ORDER BY month, region, product_category;
Using this query result, create the following visualizations:
Practical Exercise 2: Customer Analysis Dashboard
Start with a query that retrieves customer metrics:
SELECT
customer_segment,
acquisition_channel,
COUNT(DISTINCT customer_id) AS customer_count,
AVG(lifetime_value) AS avg_ltv,
SUM(total_orders) AS order_count,
SUM(total_orders) / COUNT(DISTINCT customer_id) AS avg_orders_per_customer,
SUM(total_spend) / SUM(total_orders) AS avg_order_value
FROM customers
GROUP BY customer_segment, acquisition_channel
ORDER BY customer_segment, acquisition_channel;
Using this query result, create the following visualizations:
Dashboards combine multiple visualizations into a cohesive interface, allowing users to monitor key metrics and explore data relationships.
Dashboard Creation Process:
Dashboard Layout Best Practices:
Practical Exercise 3: Creating a Sales Performance Dashboard
Query parameters enable interactive filtering and dynamic content in dashboards, allowing users to explore data without writing SQL.
Types of Dashboard Parameters:
Creating and Using Parameters:
– Using parameter in query SELECT date_trunc(‘month’, order_date) AS month, SUM(sales_amount) AS total_sales FROM sales WHERE order_date >= ‘’ GROUP BY date_trunc(‘month’, order_date) ORDER BY month;
2. Create query-based dropdown parameters:
```sql
-- Query to populate dropdown options
SELECT DISTINCT region FROM sales ORDER BY region;
-- Main query using the parameter
SELECT
product_category,
SUM(sales_amount) AS total_sales
FROM sales
WHERE region = ''
GROUP BY product_category
ORDER BY total_sales DESC;
Practical Exercise 4: Adding Parameters to Sales Dashboard
SELECT
date_trunc('month', order_date) AS month,
region,
product_category,
SUM(sales_amount) AS total_sales,
COUNT(DISTINCT customer_id) AS customer_count,
SUM(sales_amount) / COUNT(DISTINCT customer_id) AS avg_customer_spend
FROM sales
WHERE order_date BETWEEN '' AND ''
AND region = ''
AND product_category IN ()
GROUP BY date_trunc('month', order_date), region, product_category
ORDER BY month, region, product_category;
Effective dashboards deliver insights to stakeholders through appropriate sharing and automated updates.
Dashboard Sharing Options:
Considerations for Sharing:
Setting Up Refresh Schedules:
Important Refresh Considerations:
Practical Exercise 5: Sharing and Scheduling the Sales Dashboard
Alerts automatically monitor your data and notify stakeholders when metrics reach specific thresholds.
Alert Components:
Setting Up Basic Alerts:
Alert Best Practices:
Practical Exercise 6: Creating Alerts for Key Metrics
SELECT
product_id,
product_name,
current_stock,
reorder_level
FROM inventory
WHERE current_stock < reorder_level;
SELECT
SUM(sales_amount) AS daily_sales,
(SELECT AVG(daily_total) FROM daily_sales_targets) AS daily_target,
SUM(sales_amount) / (SELECT AVG(daily_total) FROM daily_sales_targets) * 100 AS target_percentage
FROM sales
WHERE sale_date = CURRENT_DATE() - INTERVAL 1 DAY;
Test your understanding of today’s material with these questions:
Which visualization type is most appropriate for showing part-to-whole relationships across categories? a) Line chart b) Bar chart c) Pie or donut chart d) Scatter plot
What is the primary purpose of using dashboard parameters? a) To improve dashboard loading speed b) To allow users to interactively filter and explore data c) To restrict access to sensitive data d) To enable automated dashboard refreshes
When setting up a dashboard refresh schedule, which consideration is most important to prevent refresh failures? a) Time zone settings b) Query complexity c) SQL warehouse availability d) Number of visualizations
Which statement about visualizations in Databricks SQL is FALSE? a) Visualizations can be created directly from query results b) Multiple visualizations can be created from a single query c) Visualizations automatically update when the underlying data changes d) Customizable tables can be used as visualizations
When creating a query-based dropdown parameter, what type of query should you write? a) A query that returns exactly one column with distinct values b) A query that returns multiple columns with distinct combinations c) A query that returns a single row with multiple columns d) A query that returns aggregated summary data
Which visualization type is most appropriate for showing part-to-whole relationships across categories? Answer: c) Pie or donut chart
Pie and donut charts are specifically designed to represent proportional parts of a whole. Each slice visually represents a category’s proportion of the total, making these chart types ideal for showing percentage distributions or composition across categories.
What is the primary purpose of using dashboard parameters? Answer: b) To allow users to interactively filter and explore data
Dashboard parameters enable non-technical users to interact with data visualizations by changing filter values without writing SQL. This interactive capability allows stakeholders to explore different segments of data, change time periods, or focus on specific categories, all while maintaining the dashboard’s overall structure and relationships.
When setting up a dashboard refresh schedule, which consideration is most important to prevent refresh failures? Answer: c) SQL warehouse availability
SQL warehouse availability is critical because if the warehouse has auto-stopped due to inactivity when a scheduled refresh occurs, the refresh will fail. When configuring refresh schedules, you must ensure that either the warehouse will be running at the scheduled time or that auto-start functionality is properly configured.
Which statement about visualizations in Databricks SQL is FALSE? Answer: c) Visualizations automatically update when the underlying data changes
This statement is false. Visualizations reflect the data as of the last query execution. They only update when the query is manually run or when triggered by a scheduled refresh. Changes to the underlying data are not automatically reflected until the next query execution.
When creating a query-based dropdown parameter, what type of query should you write? Answer: a) A query that returns exactly one column with distinct values
A query-based dropdown parameter requires a query that returns a single column containing the distinct values that will populate the dropdown options. This query defines the available choices for the parameter, such as product categories, regions, or time periods that users can select from.
To reinforce learning:
Statistical analysis forms the foundation of data-driven decision making. As a data analyst using Databricks, understanding how to calculate and interpret descriptive statistics is essential for deriving meaningful insights from your data.
Discrete vs. Continuous Statistics
Understanding the distinction between discrete and continuous data is critical for selecting appropriate analytical methods:
Discrete Statistics deal with countable data that takes specific, separate values:
Continuous Statistics involve measurements that can take any value within a range:
Key Statistical Measures
Descriptive statistics are grouped into several categories:
Measures of Central Tendency describe the center or typical value of a distribution:
SELECT AVG(order_amount) AS mean_order_value
FROM orders;
SELECT percentile_approx(order_amount, 0.5) AS median_order_value
FROM orders;
SELECT product_id, COUNT(*) AS frequency
FROM orders
GROUP BY product_id
ORDER BY frequency DESC
LIMIT 1;
Measures of Dispersion describe how spread out the data is:
SELECT
MAX(order_amount) - MIN(order_amount) AS order_amount_range
FROM orders;
SELECT VARIANCE(order_amount) AS order_amount_variance
FROM orders;
SELECT STDDEV(order_amount) AS order_amount_stddev
FROM orders;
SELECT
percentile_approx(order_amount, 0.75) - percentile_approx(order_amount, 0.25) AS order_amount_iqr
FROM orders;
Distribution Moments describe the shape of the data distribution:
Practical Statistics Applications in Data Analysis
WITH stats AS (
SELECT
AVG(order_amount) AS mean,
STDDEV(order_amount) AS stddev
FROM orders
)
SELECT
order_id,
order_amount
FROM orders, stats
WHERE order_amount > mean + 3 * stddev -- 3 standard deviations above mean
OR order_amount < mean - 3 * stddev; -- 3 standard deviations below mean
WITH stats AS (
SELECT
AVG(order_amount) AS mean,
STDDEV(order_amount) AS stddev
FROM orders
)
SELECT
order_id,
order_amount,
(order_amount - mean) / stddev AS z_score
FROM orders, stats;
SELECT
percentile_approx(order_amount, array(0.1, 0.25, 0.5, 0.75, 0.9)) AS order_amount_percentiles
FROM orders;
Data enhancement involves enriching your datasets with additional information to provide deeper insights. Data blending combines information from multiple sources into a cohesive analytical view.
Data Enhancement Approaches
SELECT
customer_id,
first_purchase_date,
last_purchase_date,
datediff(last_purchase_date, first_purchase_date) AS customer_lifespan_days,
COUNT(order_id) AS order_count,
SUM(order_amount) AS total_spend,
SUM(order_amount) / COUNT(order_id) AS average_order_value
FROM orders
GROUP BY customer_id, first_purchase_date, last_purchase_date;
SELECT
customer_id,
total_spend,
CASE
WHEN total_spend >= 10000 THEN 'High Value'
WHEN total_spend >= 5000 THEN 'Medium Value'
ELSE 'Low Value'
END AS customer_segment
FROM customer_summary;
SELECT
order_id,
order_date,
DAYOFWEEK(order_date) AS day_of_week,
WEEKOFYEAR(order_date) AS week_of_year,
MONTH(order_date) AS month,
QUARTER(order_date) AS quarter,
YEAR(order_date) AS year,
CASE
WHEN MONTH(order_date) IN (12, 1, 2) THEN 'Winter'
WHEN MONTH(order_date) IN (3, 4, 5) THEN 'Spring'
WHEN MONTH(order_date) IN (6, 7, 8) THEN 'Summer'
ELSE 'Fall'
END AS season
FROM orders;
SELECT
store_id,
city,
state,
region,
CASE
WHEN region = 'Northeast' OR region = 'Midwest' THEN 'Eastern Division'
ELSE 'Western Division'
END AS division,
CASE
WHEN state IN ('CA', 'OR', 'WA') THEN 'West Coast'
WHEN state IN ('NY', 'NJ', 'CT') THEN 'Tri-State Area'
ELSE 'Other'
END AS market_zone
FROM store_locations;
Data Blending Techniques
-- Matching customers across two systems
SELECT
a.customer_id AS system_a_id,
b.customer_id AS system_b_id,
a.email,
a.name AS system_a_name,
b.name AS system_b_name
FROM system_a_customers a
JOIN system_b_customers b
ON LOWER(a.email) = LOWER(b.email)
OR (a.phone = b.phone AND a.phone IS NOT NULL);
-- Enriching sales transactions with product and customer dimensions
SELECT
s.transaction_id,
s.transaction_date,
s.quantity,
s.unit_price,
s.quantity * s.unit_price AS total_amount,
p.product_name,
p.category,
p.brand,
c.customer_name,
c.segment,
c.acquisition_channel
FROM sales_transactions s
JOIN products p ON s.product_id = p.product_id
JOIN customers c ON s.customer_id = c.customer_id;
-- Aligning weekly sales with marketing campaigns
SELECT
w.week_start_date,
w.week_end_date,
w.weekly_sales,
w.store_id,
COALESCE(c.campaign_name, 'No Campaign') AS campaign_name,
COALESCE(c.campaign_type, 'None') AS campaign_type,
COALESCE(c.campaign_budget, 0) AS campaign_budget
FROM weekly_sales w
LEFT JOIN marketing_campaigns c
ON w.week_start_date <= c.end_date
AND w.week_end_date >= c.start_date
AND w.region = c.target_region;
Last-Mile ETL
Last-mile ETL refers to the final transformations performed by analysts to prepare data for specific analytical needs. This typically occurs after data has already been processed through formal ETL pipelines and landed in the gold layer of your data lake.
Key characteristics of last-mile ETL:
Common last-mile ETL operations:
-- Combining data for a marketing analysis
CREATE OR REPLACE TEMPORARY VIEW marketing_analysis AS
SELECT
c.customer_id,
c.acquisition_date,
c.acquisition_channel,
COUNT(o.order_id) AS order_count,
SUM(o.order_amount) AS total_spend,
AVG(s.satisfaction_score) AS avg_satisfaction,
MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN satisfaction_surveys s ON o.order_id = s.order_id
GROUP BY c.customer_id, c.acquisition_date, c.acquisition_channel;
-- Creating customer lifetime value metrics
CREATE OR REPLACE TEMPORARY VIEW customer_ltv AS
SELECT
customer_id,
SUM(order_amount) AS total_spend,
COUNT(DISTINCT order_id) AS order_count,
DATEDIFF(MAX(order_date), MIN(order_date)) / 30 AS customer_tenure_months,
SUM(order_amount) / NULLIF(DATEDIFF(MAX(order_date), MIN(order_date)) / 30, 0) AS monthly_value
FROM orders
GROUP BY customer_id;
-- Aligning data to fiscal year for financial reporting
CREATE OR REPLACE TEMPORARY VIEW fiscal_year_performance AS
SELECT
CASE
WHEN MONTH(transaction_date) >= 7 THEN YEAR(transaction_date)
ELSE YEAR(transaction_date) - 1
END AS fiscal_year,
CASE
WHEN MONTH(transaction_date) >= 7 THEN MONTH(transaction_date) - 6
ELSE MONTH(transaction_date) + 6
END AS fiscal_month,
SUM(amount) AS total_amount
FROM transactions
GROUP BY fiscal_year, fiscal_month
ORDER BY fiscal_year, fiscal_month;
Understanding common analytics application patterns allows you to quickly implement solutions for frequent business needs.
Customer Analytics Applications
-- Monthly cohort retention analysis
WITH cohorts AS (
SELECT
customer_id,
DATE_FORMAT(first_purchase_date, 'yyyy-MM') AS cohort_month,
first_purchase_date
FROM (
SELECT
customer_id,
MIN(order_date) AS first_purchase_date
FROM orders
GROUP BY customer_id
)
),
customer_activity AS (
SELECT
c.customer_id,
c.cohort_month,
DATE_FORMAT(o.order_date, 'yyyy-MM') AS activity_month,
MONTHS_BETWEEN(DATE_FORMAT(o.order_date, 'yyyy-MM-01'),
DATE_FORMAT(c.first_purchase_date, 'yyyy-MM-01')) AS month_number
FROM cohorts c
JOIN orders o ON c.customer_id = o.customer_id
)
SELECT
cohort_month,
COUNT(DISTINCT customer_id) AS cohort_size,
SUM(CASE WHEN month_number = 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_0_retention,
SUM(CASE WHEN month_number = 1 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_1_retention,
SUM(CASE WHEN month_number = 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_2_retention,
SUM(CASE WHEN month_number = 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_3_retention
FROM customer_activity
GROUP BY cohort_month
ORDER BY cohort_month;
WITH customer_rfm AS (
SELECT
customer_id,
DATEDIFF(CURRENT_DATE(), MAX(order_date)) AS recency,
COUNT(order_id) AS frequency,
SUM(order_amount) AS monetary,
NTILE(5) OVER (ORDER BY DATEDIFF(CURRENT_DATE(), MAX(order_date)) DESC) AS r_score,
NTILE(5) OVER (ORDER BY COUNT(order_id)) AS f_score,
NTILE(5) OVER (ORDER BY SUM(order_amount)) AS m_score
FROM orders
GROUP BY customer_id
)
SELECT
customer_id,
recency,
frequency,
monetary,
r_score,
f_score,
m_score,
CONCAT(r_score, f_score, m_score) AS rfm_score,
CASE
WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
WHEN r_score >= 3 AND f_score >= 3 AND m_score >= 3 THEN 'Loyal Customers'
WHEN r_score >= 3 AND f_score >= 1 AND m_score >= 2 THEN 'Potential Loyalists'
WHEN r_score >= 4 AND f_score <= 2 AND m_score <= 2 THEN 'New Customers'
WHEN r_score <= 2 AND f_score >= 3 AND m_score >= 3 THEN 'At Risk'
WHEN r_score <= 2 AND f_score >= 2 AND m_score >= 2 THEN 'Needs Attention'
WHEN r_score <= 1 AND f_score >= 4 AND m_score >= 4 THEN 'Can\'t Lose Them'
WHEN r_score <= 2 AND f_score <= 2 AND m_score <= 2 THEN 'Hibernating'
ELSE 'Others'
END AS segment
FROM customer_rfm;
Product Analytics Applications
-- Market basket analysis
WITH order_pairs AS (
SELECT
o1.order_id,
p1.product_id AS product_1,
p1.product_name AS product_1_name,
p2.product_id AS product_2,
p2.product_name AS product_2_name
FROM order_items o1
JOIN order_items o2
ON o1.order_id = o2.order_id
AND o1.product_id < o2.product_id
JOIN products p1 ON o1.product_id = p1.product_id
JOIN products p2 ON o2.product_id = p2.product_id
)
SELECT
product_1,
product_1_name,
product_2,
product_2_name,
COUNT(*) AS co_occurrence_count
FROM order_pairs
GROUP BY product_1, product_1_name, product_2, product_2_name
ORDER BY co_occurrence_count DESC;
SELECT
p.product_id,
p.product_name,
p.category,
DATE_FORMAT(o.order_date, 'yyyy-MM') AS month,
COUNT(DISTINCT o.order_id) AS order_count,
SUM(oi.quantity) AS units_sold,
SUM(oi.quantity * oi.unit_price) AS revenue,
SUM(oi.quantity * oi.unit_price) / SUM(oi.quantity) AS average_selling_price,
SUM(oi.quantity * (oi.unit_price - p.cost)) AS gross_profit,
SUM(oi.quantity * (oi.unit_price - p.cost)) / SUM(oi.quantity * oi.unit_price) * 100 AS profit_margin
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
JOIN orders o ON oi.order_id = o.order_id
GROUP BY p.product_id, p.product_name, p.category, DATE_FORMAT(o.order_date, 'yyyy-MM')
ORDER BY month, revenue DESC;
Operational Analytics Applications
WITH user_stages AS (
SELECT
user_id,
MAX(CASE WHEN event_type = 'page_view' AND page = 'product_listing' THEN 1 ELSE 0 END) AS reached_listing,
MAX(CASE WHEN event_type = 'page_view' AND page = 'product_detail' THEN 1 ELSE 0 END) AS reached_detail,
MAX(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS reached_cart,
MAX(CASE WHEN event_type = 'begin_checkout' THEN 1 ELSE 0 END) AS reached_checkout,
MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS completed_purchase
FROM user_events
WHERE session_date = CURRENT_DATE() - INTERVAL 1 DAY
GROUP BY user_id
)
SELECT
COUNT(*) AS total_users,
SUM(reached_listing) AS listing_views,
SUM(reached_detail) AS detail_views,
SUM(reached_cart) AS cart_additions,
SUM(reached_checkout) AS checkouts,
SUM(completed_purchase) AS purchases,
SUM(reached_detail) / SUM(reached_listing) * 100 AS listing_to_detail_rate,
SUM(reached_cart) / SUM(reached_detail) * 100 AS detail_to_cart_rate,
SUM(reached_checkout) / SUM(reached_cart) * 100 AS cart_to_checkout_rate,
SUM(completed_purchase) / SUM(reached_checkout) * 100 AS checkout_to_purchase_rate,
SUM(completed_purchase) / SUM(reached_listing) * 100 AS overall_conversion_rate
FROM user_stages;
-- Detecting unusual sales patterns using z-scores
WITH daily_sales AS (
SELECT
transaction_date,
SUM(amount) AS daily_total
FROM transactions
GROUP BY transaction_date
),
sales_stats AS (
SELECT
AVG(daily_total) AS mean_daily_sales,
STDDEV(daily_total) AS stddev_daily_sales
FROM daily_sales
)
SELECT
ds.transaction_date,
ds.daily_total,
ss.mean_daily_sales,
(ds.daily_total - ss.mean_daily_sales) / ss.stddev_daily_sales AS z_score,
CASE
WHEN ABS((ds.daily_total - ss.mean_daily_sales) / ss.stddev_daily_sales) > 2 THEN 'Anomaly'
ELSE 'Normal'
END AS status
FROM daily_sales ds, sales_stats ss
ORDER BY ABS((ds.daily_total - ss.mean_daily_sales) / ss.stddev_daily_sales) DESC;
Let’s practice implementing some of the analytics applications we’ve discussed.
Practical Exercise 1: Customer Segmentation and Analysis
Implement a comprehensive customer segmentation analysis using RFM:
-- Step 1: Calculate base RFM metrics
CREATE OR REPLACE TEMPORARY VIEW customer_rfm_base AS
SELECT
customer_id,
DATEDIFF(CURRENT_DATE(), MAX(order_date)) AS recency_days,
COUNT(DISTINCT order_id) AS frequency,
SUM(order_amount) AS monetary
FROM orders
GROUP BY customer_id;
-- Step 2: Create RFM scores
CREATE OR REPLACE TEMPORARY VIEW customer_rfm_scores AS
SELECT
customer_id,
recency_days,
frequency,
monetary,
NTILE(5) OVER (ORDER BY recency_days ASC) AS recency_score,
NTILE(5) OVER (ORDER BY frequency DESC) AS frequency_score,
NTILE(5) OVER (ORDER BY monetary DESC) AS monetary_score
FROM customer_rfm_base;
-- Step 3: Create segments
CREATE OR REPLACE TEMPORARY VIEW customer_segments AS
SELECT
customer_id,
recency_days,
frequency,
monetary,
recency_score,
frequency_score,
monetary_score,
(recency_score + frequency_score + monetary_score) / 3.0 AS average_score,
CASE
WHEN recency_score >= 4 AND frequency_score >= 4 AND monetary_score >= 4 THEN 'Champions'
WHEN recency_score >= 4 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'Loyal Customers'
WHEN recency_score >= 3 AND frequency_score >= 1 AND monetary_score >= 2 THEN 'Potential Loyalists'
WHEN recency_score >= 4 AND frequency_score <= 2 AND monetary_score <= 2 THEN 'New Customers'
WHEN recency_score <= 2 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'At Risk'
WHEN recency_score <= 2 AND frequency_score >= 2 AND monetary_score >= 2 THEN 'Needs Attention'
WHEN recency_score <= 1 AND frequency_score >= 4 AND monetary_score >= 4 THEN 'Can\'t Lose Them'
WHEN recency_score <= 2 AND frequency_score <= 2 AND monetary_score <= 2 THEN 'Hibernating'
WHEN recency_score <= 1 AND frequency_score <= 1 AND monetary_score <= 1 THEN 'Lost'
ELSE 'Others'
END AS segment
FROM customer_rfm_scores;
-- Step 4: Analyze segment characteristics
SELECT
segment,
COUNT(*) AS customer_count,
ROUND(AVG(recency_days), 1) AS avg_recency,
ROUND(AVG(frequency), 1) AS avg_frequency,
ROUND(AVG(monetary), 2) AS avg_monetary,
ROUND(SUM(monetary) / SUM(SUM(monetary)) OVER () * 100, 2) AS pct_total_revenue
FROM customer_segments
GROUP BY segment
ORDER BY avg_recency, avg_monetary DESC;
Practical Exercise 2: Cohort Retention Analysis
Implement a monthly cohort retention analysis:
-- Step 1: Identify first purchase month for each customer
CREATE OR REPLACE TEMPORARY VIEW customer_cohorts AS
SELECT
customer_id,
DATE_FORMAT(MIN(order_date), 'yyyy-MM') AS cohort_month
FROM orders
GROUP BY customer_id;
-- Step 2: Calculate activity for each customer by month
CREATE OR REPLACE TEMPORARY VIEW customer_monthly_activity AS
SELECT
c.customer_id,
c.cohort_month,
DATE_FORMAT(o.order_date, 'yyyy-MM') AS activity_month,
CAST(MONTHS_BETWEEN(TO_DATE(DATE_FORMAT(o.order_date, 'yyyy-MM-01')),
TO_DATE(DATE_FORMAT(MIN(o.order_date) OVER (PARTITION BY c.customer_id), 'yyyy-MM-01')))
AS INT) AS month_number
FROM customer_cohorts c
JOIN orders o ON c.customer_id = o.customer_id;
-- Step 3: Build cohort retention grid
SELECT
cohort_month,
COUNT(DISTINCT customer_id) AS cohort_size,
SUM(CASE WHEN month_number = 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_0_retention,
SUM(CASE WHEN month_number = 1 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_1_retention,
SUM(CASE WHEN month_number = 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_2_retention,
SUM(CASE WHEN month_number = 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_3_retention,
SUM(CASE WHEN month_number = 4 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_4_retention,
SUM(CASE WHEN month_number = 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_5_retention,
SUM(CASE WHEN month_number = 6 THEN 1 ELSE 0 END) / COUNT(DISTINCT customer_id) * 100 AS month_6_retention
FROM customer_monthly_activity
GROUP BY cohort_month
ORDER BY cohort_month;
Let’s review the key concepts from each day of our preparation to ensure you have a solid grasp of all exam topics.
Day 1: Databricks SQL Fundamentals & Lakehouse Architecture
Day 2: Data Management with Delta Lake
Day 3: Advanced SQL in the Lakehouse
Day 4: Data Visualization and Dashboarding
Time Management
The exam consists of 45 multiple-choice questions in 90 minutes, giving you an average of 2 minutes per question.
Question Analysis Approach
Common Question Patterns
Final Preparation Checklist
Now let’s complete a full practice exam covering all the key topics from the Databricks Certified Data Analyst Associate exam. This will simulate the actual exam experience with 45 multiple-choice questions to be completed in 90 minutes.
Practice Exam Instructions:
[Note: I’ll now provide a comprehensive set of practice questions covering all exam domains. Each question will test your understanding of key concepts, with an emphasis on practical application.]
Databricks SQL Section
A data analyst needs to create a dashboard that will be viewed by executives who do not have direct access to Databricks. What is the best way to share this dashboard? a) Export the dashboard as a PDF b) Share a link that uses the dashboard owner’s credentials c) Create a scheduled refresh that emails the dashboard d) Grant the executives permission to the dashboard and underlying queries
When setting up a Databricks SQL warehouse, which configuration would be most cost-effective for a dashboard that refreshes hourly during business hours? a) 2X-Small with auto-stop set to 10 minutes b) 4X-Large with auto-stop set to 60 minutes c) Serverless endpoint with no auto-stop d) Medium size with auto-stop disabled
Which statement about the medallion architecture is FALSE? a) The bronze layer contains raw, unprocessed data b) The silver layer contains cleansed and validated data c) The gold layer contains business-level aggregates d) Data analysts typically work primarily with bronze layer data
Data Management Section
What happens to the data files when you drop an unmanaged Delta table? a) The data files are automatically deleted b) The data files remain but are marked for deletion c) The data files remain untouched d) The data files are moved to an archive location
You need to create a view that will be available only during your current session. Which SQL statement should you use? a) CREATE VIEW my_view AS SELECT… b) CREATE TEMPORARY VIEW my_view AS SELECT… c) CREATE SESSION VIEW my_view AS SELECT… d) CREATE EPHEMERAL VIEW my_view AS SELECT…
Which statement about Delta Lake is TRUE? a) Delta Lake tables cannot be modified once created b) Delta Lake stores all table history indefinitely c) Delta Lake provides ACID transaction guarantees d) Delta Lake requires data to be in a specific format
SQL in the Lakehouse Section
Which join type will return all rows from both tables, including unmatched rows? a) INNER JOIN b) LEFT JOIN c) RIGHT JOIN d) FULL JOIN
What is the purpose of the MERGE statement in Databricks SQL? a) To combine two tables into a new table b) To update, insert, or delete rows based on matching conditions c) To merge table schemas d) To combine partitioned tables
You need to find the rolling 7-day average of daily sales. Which SQL feature would be most appropriate? a) GROUP BY b) Window functions c) Common Table Expressions d) Subqueries
Which higher-order function would you use to apply a transformation to each element in an array column? a) filter() b) transform() c) aggregate() d) exists()
Data Visualization and Dashboarding Section
Which visualization type is most appropriate for showing the proportion of sales by product category? a) Line chart b) Bar chart c) Pie chart d) Scatter plot
What is the primary benefit of using query parameters in dashboards? a) They improve query performance b) They reduce dashboard loading time c) They allow users to interact with the dashboard without writing SQL d) They enable automated dashboard refreshes
When configuring a dashboard refresh schedule, what should you consider to avoid refresh failures? a) The SQL warehouse auto-stop settings b) The number of visualizations c) The dashboard sharing permissions d) The time zone of the users
Analytics Applications Section
Which statistical measure represents the middle value in a sorted dataset? a) Mean b) Median c) Mode d) Range
In a cohort analysis, what defines a cohort? a) Customers who purchased the same product b) Customers who share demographic characteristics c) Customers who joined during the same time period d) Customers with similar spending patterns
What is the purpose of data enhancement in analytics applications? a) To clean and validate data b) To combine data from multiple sources c) To enrich datasets with additional attributes or derived information d) To optimize query performance
A data analyst needs to create a dashboard that will be viewed by executives who do not have direct access to Databricks. What is the best way to share this dashboard? Answer: b) Share a link that uses the dashboard owner’s credentials
Explanation: This option allows executives without Databricks access to view the dashboard through a shared link that authenticates as the dashboard creator, enabling access without requiring Databricks accounts for the viewers.
When setting up a Databricks SQL warehouse, which configuration would be most cost-effective for a dashboard that refreshes hourly during business hours? Answer: a) 2X-Small with auto-stop set to 10 minutes
Explanation: This configuration uses the smallest warehouse size suitable for the workload while automatically stopping after 10 minutes of inactivity, minimizing costs between hourly refreshes.
Which statement about the medallion architecture is FALSE? Answer: d) Data analysts typically work primarily with bronze layer data
Explanation: This statement is false because data analysts primarily work with gold layer data, which contains business-ready, transformed metrics. Bronze layer contains raw, unprocessed data that typically requires further transformation before analysis.
What happens to the data files when you drop an unmanaged Delta table? Answer: c) The data files remain untouched
Explanation: For unmanaged (external) tables, Databricks only manages the metadata. When dropping an unmanaged table, only the metadata definition is removed while the underlying data files remain in their specified location.
You need to create a view that will be available only during your current session. Which SQL statement should you use? Answer: b) CREATE TEMPORARY VIEW my_view AS SELECT…
Explanation: Temporary views exist only for the duration of the current session and are automatically dropped when the session ends, making them appropriate for session-specific data analysis.
Which statement about Delta Lake is TRUE? Answer: c) Delta Lake provides ACID transaction guarantees
Explanation: Delta Lake ensures atomicity, consistency, isolation, and durability (ACID) for all operations, which is one of its key benefits over traditional data lake storage formats.
Which join type will return all rows from both tables, including unmatched rows? Answer: d) FULL JOIN
Explanation: A FULL JOIN (or FULL OUTER JOIN) returns all rows from both tables, with NULL values for columns from the non-matching table when there is no match between the tables.
What is the purpose of the MERGE statement in Databricks SQL? Answer: b) To update, insert, or delete rows based on matching conditions
Explanation: MERGE allows for simultaneous insert, update, and delete operations based on matching conditions between source and target tables, enabling efficient upsert operations in a single statement.
You need to find the rolling 7-day average of daily sales. Which SQL feature would be most appropriate? Answer: b) Window functions
Explanation: Window functions allow calculations across a specified range of rows related to the current row, making them ideal for computing rolling averages over time periods.
Which higher-order function would you use to apply a transformation to each element in an array column? Answer: b) transform()
Explanation: The transform() function applies a lambda function to each element in an array, returning a new array with the transformed values, perfect for element-wise transformations.
Which visualization type is most appropriate for showing the proportion of sales by product category? Answer: c) Pie chart
Explanation: Pie charts are specifically designed to show part-to-whole relationships, making them suitable for displaying proportional distribution across categories.
What is the primary benefit of using query parameters in dashboards? Answer: c) They allow users to interact with the dashboard without writing SQL
Explanation: Query parameters enable non-technical users to filter and customize dashboard data through interactive controls without needing to modify or understand the underlying SQL code.
When configuring a dashboard refresh schedule, what should you consider to avoid refresh failures? Answer: a) The SQL warehouse auto-stop settings
Explanation: If the SQL warehouse has auto-stopped when a scheduled refresh occurs, the refresh will fail. Warehouse auto-stop settings must align with scheduled refresh times to ensure the warehouse is available.
Which statistical measure represents the middle value in a sorted dataset? Answer: b) Median
Explanation: The median is the middle value in a sorted dataset, with an equal number of values above and below it, making it resistant to outliers compared to the mean.
In a cohort analysis, what defines a cohort? Answer: c) Customers who joined during the same time period
Explanation: Cohorts are typically defined by when customers joined (acquisition date/month/quarter), allowing analysis of behavior over time for groups that started their journey at the same point.
What is the purpose of data enhancement in analytics applications? Answer: c) To enrich datasets with additional attributes or derived information
Explanation: Data enhancement involves adding new calculated fields, classifications, or attributes that weren’t in the original dataset to provide additional analytical context and insights.
After completing the practice exam, we’ll review your answers and clarify any remaining questions or concepts. We’ll identify areas that may need additional focus before your actual exam.
Final Exam-Taking Tips: