Back to Home →

Databricks Machine Learning Professional Certification

Table of contents

Preparation Plan

Experimentation & Data Management

  • Delta Lake Fundamentals
  • Feature Store Concepts
  • MLflow Experiment Tracking
  • Exercise 1: Delta Lake Operations
  • Exercise 2: Feature Store Operations
  • Quiz: Experimentation & Data Management
  • Key Takeaways

Model Lifecycle Management

  • MLflow Flavors and Custom Models
  • Model Registry Fundamentals
  • Model Lifecycle Automation
  • Exercise 1: MLflow Flavors and Custom Models
  • Exercise 2: Model Registry Operations
  • Exercise 3: Automating the Model Lifecycle
  • Quiz: Model Lifecycle Management
  • Key Takeaways

Model Deployment Strategies

  • Batch Deployment Patterns
  • Streaming Deployment Concepts
  • Real-time Serving Fundamentals
  • Exercise 1: Batch Deployment with Spark UDFs
  • Exercise 2: Streaming Deployment
  • Exercise 3: Real-time Model Serving
  • Quiz: Model Deployment Strategies
  • Key Takeaways

Solution & Data Monitoring

  • Drift Types
  • Drift Tests and Monitoring
  • Comprehensive Drift Solutions
  • Exercise 1: Detecting Different Types of Drift
  • Exercise 2: Statistical Tests for Drift Detection
  • Exercise 3: Comprehensive Drift Monitoring Solution
  • Quiz: Solution & Data Monitoring
  • Key Takeaways

Knowledge Base

Experimentation & Data Management

1. Delta Lake Fundamentals

Delta Lake is a storage layer that brings reliability to data lakes. As an ML engineer, you need to understand its capabilities for managing ML data.

Key Concepts:

  • ACID Transactions: Delta Lake provides atomicity, consistency, isolation, and durability
  • Table History: Ability to access and restore previous versions of data
  • Schema Enforcement: Prevents data corruption by enforcing schema on write
  • Time Travel: Query previous versions of your data using timestamps or version numbers

Essential Operations:

  • Reading and writing Delta tables
  • Accessing table history
  • Version restoration
  • Delta optimizations (Z-ordering, vacuuming)

2. Feature Store Concepts

The Databricks Feature Store is a centralized repository for managing and sharing ML features.

Key Concepts:

  • Feature Tables: Collections of features stored with metadata
  • Online vs. Offline Store: Offline for training, online for low-latency serving
  • Feature Sharing: Reuse features across teams and models
  • Feature Lineage: Track where features are used in models

Essential Operations:

  • Creating feature tables
  • Writing and reading from feature tables
  • Using feature tables in ML workflows
  • Managing feature metadata

3. MLflow Experiment Tracking

MLflow provides tools for experiment tracking, reproducibility, and model management.

Key Concepts:

  • Runs: Individual executions of ML code
  • Experiments: Collections of runs
  • Artifacts: Files associated with runs (models, plots, data)
  • Parameters & Metrics: Track and compare model configurations and performance

Essential Operations:

  • Manual logging of parameters, metrics, and models
  • Accessing experiment data programmatically
  • Advanced tracking with model signatures and input examples
  • Nested runs for complex workflows

Exercise 1: Delta Lake Operations

# 1. Create a Delta table
data = spark.range(0, 1000).withColumn("square", col("id") * col("id"))
data.write.format("delta").save("/path/to/delta-table")

# 2. Read from Delta table
df = spark.read.format("delta").load("/path/to/delta-table")

# 3. Update Delta table (append new data)
new_data = spark.range(1000, 2000).withColumn("square", col("id") * col("id"))
new_data.write.format("delta").mode("append").save("/path/to/delta-table")

# 4. View table history
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/path/to/delta-table")
history = delta_table.history()
display(history)

# 5. Time travel (load previous version)
previous_df = spark.read.format("delta").option("versionAsOf", 0).load("/path/to/delta-table")

# 6. Optimize table (Z-ordering)
spark.sql("OPTIMIZE delta.`/path/to/delta-table` ZORDER BY (id)")

Exercise 2: Feature Store Operations

# 1. Initialize Feature Store client
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()

# 2. Create a feature table
from databricks.feature_store import feature_table
features_df = spark.read.format("delta").load("/path/to/data")

fs.create_table(
    name="customer_features",
    primary_keys=["customer_id"],
    df=features_df,
    description="Customer features for churn prediction"
)

# 3. Write to an existing feature table
fs.write_table(
    name="customer_features",
    df=updated_features_df,
    mode="merge"  # Supports "overwrite" and "merge"
)

# 4. Read from a feature table
features = fs.read_table(
    name="customer_features"
)

# 5. Use features in model training
from databricks.feature_store import FeatureLookup

feature_lookups = [
    FeatureLookup(
        table_name="customer_features",
        feature_names=["feature1", "feature2", "feature3"],
        lookup_key="customer_id"
    )
]

training_data = fs.create_training_set(
    df=training_labels_df,
    feature_lookups=feature_lookups,
    label="churn"
)

# Get the training DataFrame
training_df = training_data.load_df()

Exercise 3: MLflow Experiment Tracking

# 1. Basic MLflow tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Start a run
with mlflow.start_run(run_name="rf-classifier") as run:
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=5)
    model.fit(X_train, y_train)
    
    # Log metrics
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Get run ID for later reference
    run_id = run.info.run_id

# 2. Advanced MLflow tracking with signatures and input examples
import pandas as pd
from mlflow.models.signature import infer_signature

# Create an input example
input_example = X_train.iloc[0:5]

# Infer the model signature
signature = infer_signature(X_train, y_pred)

# Log model with signature and input example
with mlflow.start_run(run_name="rf-with-signature") as run:
    mlflow.sklearn.log_model(
        model, 
        "model", 
        signature=signature,
        input_example=input_example
    )

# 3. Working with nested runs
with mlflow.start_run(run_name="parent-run") as parent_run:
    mlflow.log_param("parent_param", "parent_value")
    
    # Create child runs for different model variations
    for n_estimators in [50, 100, 150]:
        with mlflow.start_run(run_name=f"child-run-{n_estimators}", nested=True) as child_run:
            mlflow.log_param("n_estimators", n_estimators)
            
            # Train and log model details
            model = RandomForestClassifier(n_estimators=n_estimators)
            model.fit(X_train, y_train)
            
            accuracy = accuracy_score(y_test, model.predict(X_test))
            mlflow.log_metric("accuracy", accuracy)
            
            mlflow.sklearn.log_model(model, "model")

Quiz: Experimentation & Data Management

  1. What Delta Lake operation would you use to eliminate small files and optimize performance? A) VACUUM B) OPTIMIZE C) COMPACT D) CLEAN

  2. In the Feature Store, what mode should you use when writing updates to an existing feature table for specific keys? A) “overwrite” B) “append” C) “merge” D) “update”

  3. How would you access a Delta table’s version from 3 days ago? A) Using option(“versionAsOf”, X) B) Using option(“timestampAsOf”, timestamp) C) Using timeTravel(days=3) D) Using history().filter(days=3)

  4. When tracking experiments with MLflow, what does setting nested=True allow you to do? A) Create hierarchical runs for different model configurations B) Nest models inside each other C) Create hierarchical storage of artifacts D) Track nested parameters in dictionaries

  5. Which of the following is NOT a benefit of using the Databricks Feature Store? A) Feature sharing across teams B) Automatic feature selection C) Feature discovery D) Point-in-time lookups

  6. What information is provided by a model signature in MLflow? A) The author of the model B) The input and output schema of the model C) Digital signature verifying model authenticity D) The model’s architecture details

  7. Which MLflow tracking function would you use to save metadata about a trained model? A) log_artifact() B) log_metric() C) log_param() D) log_tags()

  8. What operation can you perform to go back to a previous version of a Delta table? A) table.rollback(version=N) B) RESTORE TABLE to version N C) Read with option(“versionAsOf”, N) and rewrite D) History and select previous version

  9. When using the FeatureStoreClient to create a training set, what does the FeatureLookup parameter do? A) Searches for the best features automatically B) Specifies which features to retrieve and how to join them C) Looks up feature importance scores D) Creates a lookup table for the features

  10. How can you programmatically retrieve the metrics from a previous MLflow run? A) mlflow.get_run(run_id).data.metrics B) mlflow.search_runs(experiment_id) C) mlflow.get_metrics(run_id) D) mlflow.runs.get_metrics(run_id)

Quiz Answers

  1. What Delta Lake operation would you use to eliminate small files and optimize performance? Answer: B) OPTIMIZE

    Explanation: The OPTIMIZE command rewrites small files into larger ones to improve read performance. VACUUM removes old file versions but doesn’t consolidate small files. COMPACT isn’t a standard Delta operation, and CLEAN doesn’t exist in Delta Lake.

  2. In the Feature Store, what mode should you use when writing updates to an existing feature table for specific keys? Answer: C) “merge”

    Explanation: The “merge” mode updates existing records based on primary keys and inserts new ones. “Overwrite” would replace the entire table, “append” would add duplicate records, and “update” isn’t a standard mode in Feature Store.

  3. How would you access a Delta table’s version from 3 days ago? Answer: B) Using option(“timestampAsOf”, timestamp)

    Explanation: To access a version from a specific time point, you use the “timestampAsOf” option with a timestamp value. “versionAsOf” is used for specific version numbers, not time periods. The other options don’t exist in Delta Lake.

  4. When tracking experiments with MLflow, what does setting nested=True allow you to do? Answer: A) Create hierarchical runs for different model configurations

    Explanation: The nested=True parameter allows you to create child runs within a parent run, which is useful for organizing related experiments (like testing different hyperparameters of the same model type).

  5. Which of the following is NOT a benefit of using the Databricks Feature Store? Answer: B) Automatic feature selection

    Explanation: Feature Store doesn’t automatically select optimal features for your models. It does provide feature sharing, discovery (finding features in the organization), and point-in-time lookups (accessing feature values as they were at a specific time).

  6. What information is provided by a model signature in MLflow? Answer: B) The input and output schema of the model

    Explanation: A model signature defines the expected data types and shapes for model inputs and outputs, enabling validation when the model is used for inference.

  7. Which MLflow tracking function would you use to save metadata about a trained model? Answer: C) log_param()

    Explanation: log_param() saves named parameters about your model (like hyperparameters). log_metric() is for performance metrics, log_artifact() is for files, and log_tags() isn’t a standard MLflow function.

  8. What operation can you perform to go back to a previous version of a Delta table? Answer: C) Read with option(“versionAsOf”, N) and rewrite

    Explanation: Delta Lake doesn’t have a direct rollback command. To restore a previous version, you need to read the table at that version using “versionAsOf” and then write it back to replace the current version.

  9. When using the FeatureStoreClient to create a training set, what does the FeatureLookup parameter do? Answer: B) Specifies which features to retrieve and how to join them

    Explanation: FeatureLookup specifies which features to retrieve from which feature tables and how to join them with the training dataset based on lookup keys.

  10. How can you programmatically retrieve the metrics from a previous MLflow run? Answer: A) mlflow.get_run(run_id).data.metrics

    Explanation: You can get a run’s details with mlflow.get_run() and then access its metrics via the .data.metrics attribute. This returns a dictionary of all logged metrics from that run.

Score Assessment

  • 9-10 correct: Excellent! You have a strong grasp of the fundamentals.
  • 7-8 correct: Good understanding, but review the topics you missed.
  • 5-6 correct: You’re on the right track, but need more study in several areas.
  • Below 5 correct: More intensive review needed on the Day 1 topics.

Areas to Focus On

Here are the concepts to review:

  • Delta Lake operations: Review OPTIMIZE and VACUUM commands and their purposes
  • Feature Store operations: Understand the different write modes (merge, overwrite, append)
  • Time travel in Delta: Practice using both version-based and timestamp-based time travel
  • MLflow nested runs: Implement example workflows with parent/child runs
  • MLflow model signatures: Practice creating and using model signatures

Key Takeaways

  1. Delta Lake provides ACID transactions and time travel capabilities
  2. Feature Store centralizes feature engineering and sharing
  3. MLflow helps track experiments and model artifacts

Model Lifecycle Management

1. MLflow Flavors and Custom Models

MLflow flavors provide a standardized way to package models for different frameworks.

Key Concepts:

  • MLflow Flavors: Framework-specific formats for saving and loading models (sklearn, pytorch, tensorflow, etc.)
  • PyFunc Flavor: Universal model format that can wrap any Python model
  • Custom Model Classes: Extending MLflow with custom preprocessing logic
  • Model Signatures: Defining input/output schemas for models

Essential Operations:

  • Creating custom PyFunc models with preprocessing
  • Saving models with different flavors
  • Loading models using different flavors
  • Defining model signatures and input examples

2. Model Registry Fundamentals

The Model Registry provides a centralized repository for managing the full lifecycle of your ML models.

Key Concepts:

  • Registered Models: Named entities that contain different versions of a model
  • Model Versions: Distinct iterations of a model under the same name
  • Model Stages: Organizational states (None, Staging, Production, Archived)
  • Model Metadata: Tags and descriptions to document models

Essential Operations:

  • Creating and registering models
  • Transitioning models between stages
  • Adding metadata to models
  • Retrieving models from the registry

3. Model Lifecycle Automation

Automating the model lifecycle enables CI/CD workflows for ML models.

Key Concepts:

  • Webhooks: Event-triggered callbacks when model states change
  • Databricks Jobs: Scheduled or triggered batch processing
  • CI/CD for ML: Testing, deployment, and monitoring automation
  • Job Clusters: Purpose-specific compute for model tasks

Essential Operations:

  • Creating and managing webhooks
  • Setting up automated testing jobs
  • Building model deployment pipelines
  • Connecting webhooks to jobs

Exercise 1: MLflow Flavors and Custom Models

# 1. Basic model with sklearn flavor
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

# Train a model
X = np.random.rand(100, 4)
y = X[:, 0] + 2 * X[:, 1] + np.random.rand(100)
model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)

# Log with sklearn flavor
with mlflow.start_run() as run:
    mlflow.sklearn.log_model(model, "sklearn_model")
    model_uri = f"runs:/{run.info.run_id}/sklearn_model"
    
# Load the model
loaded_model = mlflow.sklearn.load_model(model_uri)
predictions = loaded_model.predict(X)

# 2. Custom PyFunc model with preprocessing
import mlflow.pyfunc

# Define a custom model class with preprocessing
class CustomRFModel(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
        self.model = model
        
    def predict(self, context, model_input):
        # Add preprocessing logic
        if isinstance(model_input, pd.DataFrame):
            # Scale numeric features
            numeric_cols = model_input.select_dtypes(include=[np.number]).columns
            model_input[numeric_cols] = model_input[numeric_cols] - model_input[numeric_cols].mean()
            model_input[numeric_cols] = model_input[numeric_cols] / model_input[numeric_cols].std()
            
        # Return predictions
        return self.model.predict(model_input)

# Create and log the custom model
custom_model = CustomRFModel(model)
with mlflow.start_run() as run:
    # Define the model signature
    from mlflow.models.signature import infer_signature
    signature = infer_signature(X, model.predict(X))
    
    # Provide an input example
    input_example = pd.DataFrame(X[0:5])
    
    # Log the model with all metadata
    mlflow.pyfunc.log_model(
        "custom_model",
        python_model=custom_model,
        signature=signature,
        input_example=input_example
    )
    custom_model_uri = f"runs:/{run.info.run_id}/custom_model"

# Load and use the custom model
loaded_custom_model = mlflow.pyfunc.load_model(custom_model_uri)
custom_predictions = loaded_custom_model.predict(X)

Exercise 2: Model Registry Operations

# 1. Register a model directly from a run
import mlflow.sklearn
from mlflow.tracking import MlflowClient

client = MlflowClient()

# First, log a model with MLflow
with mlflow.start_run() as run:
    mlflow.sklearn.log_model(model, "sk_model")
    run_id = run.info.run_id
    model_uri = f"runs:/{run_id}/sk_model"

# Register the model
model_name = "RandomForestRegressor"
mv = mlflow.register_model(model_uri, model_name)
print(f"Name: {mv.name}")
print(f"Version: {mv.version}")

# 2. Add description and tags to the registered model
client.update_registered_model(
    name=model_name,
    description="Random Forest Regressor for predicting target values"
)

client.set_registered_model_tag(
    name=model_name,
    key="team",
    value="data_science"
)

# 3. Add description and tags to a specific model version
client.update_model_version(
    name=model_name,
    version=mv.version,
    description="Model trained with 100 trees and default parameters"
)

client.set_model_version_tag(
    name=model_name,
    version=mv.version,
    key="train_data",
    value="synthetic_data"
)

# 4. Transition model to staging
client.transition_model_version_stage(
    name=model_name,
    version=mv.version,
    stage="Staging"
)

# 5. Create a new version and transition to production
with mlflow.start_run() as new_run:
    mlflow.sklearn.log_model(
        model,
        "sk_model",
        registered_model_name=model_name
    )
    new_run_id = new_run.info.run_id

# Find the latest version
latest_version = max([mv.version for mv in client.search_model_versions(f"name='{model_name}'")])

# Transition to production
client.transition_model_version_stage(
    name=model_name,
    version=latest_version,
    stage="Production"
)

# 6. Load a specific model version by stage
prod_model = mlflow.pyfunc.load_model(f"models:/{model_name}/Production")
staging_model = mlflow.pyfunc.load_model(f"models:/{model_name}/Staging")

# 7. Archive older versions
client.transition_model_version_stage(
    name=model_name,
    version=1,  # Assuming this is the older version
    stage="Archived"
)

Exercise 3: Automating the Model Lifecycle

# 1. Creating a webhook for Model Registry
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Create a job-triggered webhook when a model is transitioned to staging
staging_webhook = client.create_webhook(
    name="Trigger-Test-Job-On-Staging",
    events=["MODEL_VERSION_TRANSITIONED_STAGE"],
    job_spec={
        "job_id": "123456",  # Replace with actual job ID
        "workspace_url": "https://your-workspace.cloud.databricks.com"
    },
    model_name=model_name,
    target_stage="Staging"
)

# Create an HTTP webhook when a model is transitioned to production
production_webhook = client.create_webhook(
    name="Notify-On-Production",
    events=["MODEL_VERSION_TRANSITIONED_STAGE"],
    http_url_spec={
        "url": "https://your-service.example.com/webhook",
        "authorization": "Bearer your-token-here"
    },
    model_name=model_name,
    target_stage="Production"
)

# 2. List all webhooks
all_webhooks = client.list_webhooks()
for webhook in all_webhooks:
    print(f"ID: {webhook.id}, Name: {webhook.name}, Events: {webhook.events}")

# 3. Delete a webhook
client.delete_webhook(webhook_id=staging_webhook.id)

# 4. Set up Databricks Jobs for model automation

# Note: This would typically be done through the Databricks UI or API
# Here's a conceptual example of what the jobs would look like:

# Job 1: Train Model (scheduled or triggered)
"""
{
    "name": "Train-RF-Model",
    "tasks": [
        {
            "task_key": "train_model",
            "notebook_task": {
                "notebook_path": "/Path/To/Training/Notebook",
                "source": "WORKSPACE"
            },
            "job_cluster_key": "training_cluster"
        }
    ],
    "job_clusters": [
        {
            "job_cluster_key": "training_cluster",
            "new_cluster": {
                "spark_version": "10.4.x-cpu-ml-scala2.12",
                "node_type_id": "Standard_DS3_v2",
                "num_workers": 2
            }
        }
    ]
}
"""

# Job 2: Test Model (triggered by webhook)
"""
{
    "name": "Test-RF-Model",
    "tasks": [
        {
            "task_key": "test_model",
            "notebook_task": {
                "notebook_path": "/Path/To/Testing/Notebook",
                "base_parameters": {
                    "model_name": "",
                    "model_version": ""
                }
            },
            "job_cluster_key": "testing_cluster"
        }
    ],
    "job_clusters": [
        {
            "job_cluster_key": "testing_cluster",
            "new_cluster": {
                "spark_version": "10.4.x-cpu-ml-scala2.12",
                "node_type_id": "Standard_DS3_v2",
                "num_workers": 1
            }
        }
    ]
}
"""

# Job 3: Deploy Model (triggered by webhook)
"""
{
    "name": "Deploy-RF-Model",
    "tasks": [
        {
            "task_key": "deploy_model",
            "notebook_task": {
                "notebook_path": "/Path/To/Deployment/Notebook",
                "base_parameters": {
                    "model_name": "",
                    "model_version": ""
                }
            },
            "job_cluster_key": "deployment_cluster"
        }
    ],
    "job_clusters": [
        {
            "job_cluster_key": "deployment_cluster",
            "new_cluster": {
                "spark_version": "10.4.x-cpu-ml-scala2.12",
                "node_type_id": "Standard_DS3_v2",
                "num_workers": 1
            }
        }
    ]
}
"""

Quiz: Model Lifecycle Management

  1. Which MLflow flavor would you use to create a model with custom preprocessing logic? A) mlflow.sklearn B) mlflow.custom C) mlflow.pyfunc D) mlflow.generic

  2. What happens when you register a model that already exists in the Model Registry? A) It overwrites the existing model B) It creates a new version of the model C) It returns an error that the model already exists D) It creates a copy with a different name

  3. Which of the following is NOT a standard stage in the MLflow Model Registry? A) Development B) Staging C) Production D) Archived

  4. When creating a webhook for model registry events, which of these is NOT a valid event type? A) MODEL_VERSION_CREATED B) MODEL_VERSION_TRANSITIONED_STAGE C) REGISTERED_MODEL_CREATED D) MODEL_VERSION_DEPLOYED

  5. What is the correct way to load a production model from the registry? A) mlflow.pyfunc.load_model(“models:/model_name/production”) B) mlflow.sklearn.load_production_model(“model_name”) C) mlflow.load_model(“models:/model_name/Production”) D) mlflow.pyfunc.load_model(“models:/model_name/Production”)

  6. What is the purpose of adding a model signature to an MLflow model? A) Digitally sign the model to verify its creator B) Define the expected input and output schema C) Increase the model’s security in the registry D) Document who approved the model for production

  7. Which client method is used to move a model from Staging to Production? A) client.promote_model_version() B) client.update_model_stage() C) client.transition_model_version_stage() D) client.set_model_version_status()

  8. What would you include in a pyfunc model’s predict() method to implement custom preprocessing? A) Data cleaning and feature transformation logic before passing to the model B) Hyperparameter optimization logic C) Model retraining logic if performance decreases D) Post-processing of model outputs only

  9. What is the correct API to add a description to a model version? A) client.set_model_version_description() B) client.update_model_version() C) client.add_model_description() D) client.update_registered_model_version()

  10. What type of compute is recommended for production Databricks Jobs? A) All-purpose clusters B) Job clusters C) Single-node clusters D) Interactive clusters

Quiz Answers

  1. Which MLflow flavor would you use to create a model with custom preprocessing logic? Answer: C) mlflow.pyfunc

    Explanation: The mlflow.pyfunc flavor allows you to create custom Python models by extending the PythonModel class, enabling you to implement custom preprocessing logic in the predict() method. The other flavors are for specific frameworks (sklearn) or don’t exist (custom, generic).

  2. What happens when you register a model that already exists in the Model Registry? Answer: B) It creates a new version of the model

    Explanation: When registering a model with a name that already exists in the registry, MLflow creates a new version under that name rather than overwriting the existing model or returning an error.

  3. Which of the following is NOT a standard stage in the MLflow Model Registry? Answer: A) Development

    Explanation: The standard stages in the MLflow Model Registry are None (default), Staging, Production, and Archived. “Development” is not a standard stage.

  4. When creating a webhook for model registry events, which of these is NOT a valid event type? Answer: D) MODEL_VERSION_DEPLOYED

    Explanation: Valid event types include MODEL_VERSION_CREATED, MODEL_VERSION_TRANSITIONED_STAGE, and REGISTERED_MODEL_CREATED. MODEL_VERSION_DEPLOYED is not a standard event type in MLflow webhooks.

  5. What is the correct way to load a production model from the registry? Answer: D) mlflow.pyfunc.load_model(“models:/model_name/Production”)

    Explanation: The correct URI format is “models:/model_name/stage” and the stage name is case-sensitive, with “Production” being the proper casing. MLflow’s pyfunc loader is the universal way to load any model.

  6. What is the purpose of adding a model signature to an MLflow model? Answer: B) Define the expected input and output schema

    Explanation: A model signature in MLflow defines the expected data types and shapes for inputs and outputs, allowing for validation when the model is used for inference.

  7. Which client method is used to move a model from Staging to Production? Answer: C) client.transition_model_version_stage()

    Explanation: The transition_model_version_stage() method of the MLflowClient is used to change a model version’s stage (e.g., from Staging to Production). The other methods don’t exist in the MLflow API.

  8. What would you include in a pyfunc model’s predict() method to implement custom preprocessing? Answer: A) Data cleaning and feature transformation logic before passing to the model

    Explanation: The predict() method in a custom pyfunc model is where you implement preprocessing logic like data cleaning and feature transformations before passing the data to the underlying model.

  9. What is the correct API to add a description to a model version? Answer: B) client.update_model_version()

    Explanation: The update_model_version() method is used to update metadata for a specific model version, including its description. The other options are not valid MLflow API methods.

  10. What type of compute is recommended for production Databricks Jobs? Answer: B) Job clusters

    Explanation: Job clusters are purpose-built for production workloads in Databricks. They start when a job begins and terminate when it completes, optimizing costs. All-purpose clusters are for interactive work, not production jobs.

Score Assessment

  • 9-10 correct: Excellent! You have a strong grasp of Model Lifecycle Management.
  • 7-8 correct: Good understanding, but review the topics you missed.
  • 5-6 correct: You’re on the right track, but need more study in several areas.
  • Below 5 correct: More intensive review needed on the Day 2 topics.

Areas to Focus On

Here are the concepts to review:

  • MLflow flavors: Make sure you understand the different flavors, especially pyfunc for custom models
  • Model Registry stages: Review the standard stages and their purpose in the workflow
  • Webhook events: Practice setting up webhooks with different event triggers
  • Client API methods: Familiarize yourself with the correct MLflowClient methods for different operations
  • Model URI formats: Practice loading models using the correct URI format and stage names

Key Takeaways

  1. PyFunc is powerful for creating custom models with preprocessing logic
  2. The Model Registry provides stages to organize model development
  3. Webhooks enable event-driven automation for model transitions

Model Deployment Strategies

1. Batch Deployment Patterns

Batch deployment is the most common pattern for model inference in production environments.

Key Concepts:

  • Batch Inference: Running predictions on large datasets at scheduled intervals
  • Parallel Processing: Distributing model inference across a cluster
  • Spark UDFs: User-Defined Functions for applying models in Spark
  • Load Optimization: Strategies for efficient batch processing

Essential Operations:

  • Loading registered models with MLflow
  • Creating Spark UDFs for parallel inference
  • Optimizing batch inference with partitioning and Z-ordering
  • Implementing score_batch for efficient processing

2. Streaming Deployment Concepts

Streaming deployment enables continuous inference on real-time data streams.

Key Concepts:

  • Structured Streaming: Spark’s API for stream processing
  • Continuous Inference: Processing data as it arrives
  • Stateful Processing: Maintaining state across micro-batches
  • Output Modes: Handling different output requirements

Essential Operations:

  • Converting batch pipelines to streaming
  • Handling late-arriving or out-of-order data
  • Implementing streaming model inference
  • Writing predictions to output sinks

3. Real-time Serving Fundamentals

Real-time serving provides low-latency inference for individual requests.

Key Concepts:

  • Model Serving: Hosting models as REST endpoints
  • Latency Requirements: Meeting low-latency service-level agreements (SLAs)
  • Scaling: Managing throughput and concurrency
  • Just-in-Time Features: Retrieving features for real-time inference

Essential Operations:

  • Deploying models to Databricks Model Serving
  • Creating and managing serving endpoints
  • Querying models via REST API
  • Integrating with online feature stores

Exercise 1: Batch Deployment with Spark UDFs

# 1. Load a registered model for batch inference
import mlflow
import pandas as pd
from pyspark.sql.functions import struct, col

# Load the model from the registry
model_uri = "models:/RandomForestRegressor/Production"
model = mlflow.pyfunc.load_model(model_uri)

# 2. Create a Pandas UDF for batch inference
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType

# Define the UDF using the loaded model
@pandas_udf(DoubleType())
def predict_udf(features_pd: pd.Series) -> pd.Series:
    # Make predictions
    return pd.Series(model.predict(features_pd.to_frame().T))

# 3. Apply the UDF to a DataFrame
# Assuming df has feature columns feature1, feature2, etc.
predictions_df = df.withColumn(
    "prediction", 
    predict_udf(struct(*[col(c) for c in feature_cols]))
)

# 4. Create a Spark UDF for distributed inference
# This approach is more efficient for large-scale inference
spark_udf = mlflow.pyfunc.spark_udf(
    spark=spark, 
    model_uri=model_uri,
    result_type=DoubleType()
)

# Apply the Spark UDF
predictions_df = df.withColumn(
    "prediction",
    spark_udf(struct(*[col(c) for c in feature_cols]))
)

# 5. Optimize batch inference with partitioning
# First, save predictions to a Delta table
predictions_df.write.format("delta") \
    .partitionBy("date_col") \
    .mode("append") \
    .save("/path/to/predictions")

# Optimize with Z-ordering for faster queries
spark.sql("""
OPTIMIZE delta.`/path/to/predictions`
ZORDER BY (customer_id)
""")

# 6. Use Feature Store batch scoring
from databricks.feature_store import FeatureStoreClient

fs = FeatureStoreClient()

# Score batch of data using a registered model and features from Feature Store
batch_df = fs.score_batch(
    model_uri=model_uri,
    df=inference_df,
    feature_lookups=[
        FeatureLookup(
            table_name="customer_features",
            feature_names=["feature1", "feature2", "feature3"],
            lookup_key="customer_id"
        )
    ]
)

Exercise 2: Streaming Deployment

# 1. Set up a streaming data source
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Define the schema of the streaming data
schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("feature1", DoubleType(), True),
    StructField("feature2", DoubleType(), True),
    StructField("feature3", DoubleType(), True),
    StructField("timestamp", StringType(), True)
])

# Read from a streaming source (e.g., Kafka)
streaming_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "host:port") \
    .option("subscribe", "input_topic") \
    .load() \
    .select(from_json(col("value").cast("string"), schema).alias("data")) \
    .select("data.*")

# 2. Load the model for streaming inference
import mlflow

model_uri = "models:/RandomForestRegressor/Production"
model = mlflow.pyfunc.load_model(model_uri)

# 3. Apply the model using a UDF
from pyspark.sql.functions import pandas_udf, struct
from pyspark.sql.types import DoubleType
import pandas as pd

# Define the UDF
@pandas_udf(DoubleType())
def predict_udf(features_pd: pd.Series) -> pd.Series:
    # Make predictions
    return pd.Series(model.predict(features_pd.to_frame().T))

# Apply the UDF to the streaming DataFrame
feature_cols = ["feature1", "feature2", "feature3"]
predictions_streaming_df = streaming_df.withColumn(
    "prediction", 
    predict_udf(struct(*[col(c) for c in feature_cols]))
)

# 4. Write predictions to a streaming sink
query = predictions_streaming_df \
    .writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .start("/path/to/streaming_predictions")

# 5. Alternative: Use a foreachBatch function for more control
def process_batch(batch_df, batch_id):
    # Additional processing specific to each batch
    processed_df = batch_df.withColumn("batch_id", lit(batch_id))
    
    # Write to Delta table
    processed_df.write.format("delta").mode("append").save("/path/to/predictions")

streaming_query = predictions_streaming_df \
    .writeStream \
    .foreachBatch(process_batch) \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .trigger(processingTime="1 minute") \
    .start()

# 6. Handle watermarking for late data
windowed_df = streaming_df \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("customer_id")
    ) \
    .agg(avg("feature1").alias("avg_feature1"))

# Then apply model to windowed data...

Exercise 3: Real-time Model Serving

# Note: Much of the real-time serving setup is done through the Databricks UI or REST API
# Here's a conceptual guide for the key operations:

# 1. Enable Model Serving for a registered model
# Through UI: Go to Model Registry -> Select model -> Serving tab -> Enable serving

# Or programmatically via REST API:
"""
curl -X POST https://<databricks-instance>/api/2.0/serving-endpoints \
     -H "Authorization: Bearer <token>" \
     -d '{
           "name": "rf-prediction-endpoint",
           "config": {
             "served_models": [{
               "model_name": "RandomForestRegressor",
               "model_version": "1",
               "workload_size": "Small",
               "scale_to_zero_enabled": true
             }]
           }
         }'
"""

# 2. Query the model endpoint programmatically
import requests
import json

def query_endpoint(endpoint_name, input_data):
    url = f"https://<databricks-instance>/serving-endpoints/{endpoint_name}/invocations"
    headers = {
        "Authorization": f"Bearer <token>",
        "Content-Type": "application/json"
    }
    data_json = json.dumps({
        "dataframe_records": input_data
    })
    
    response = requests.post(url, headers=headers, data=data_json)
    return response.json()

# Example input data
input_data = [
    {"feature1": 0.5, "feature2": 0.2, "feature3": 0.8}
]

# Get predictions
predictions = query_endpoint("rf-prediction-endpoint", input_data)
print(predictions)

# 3. Integration with Feature Store for online serving
# First, ensure your Feature Store table is published for online serving
"""
# This would be done through the Feature Store UI or API
fs.publish_table(
    name="customer_features",
    online=True
)
"""

# Then, when making real-time predictions, include feature lookup
def query_with_features(endpoint_name, lookup_keys):
    url = f"https://<databricks-instance>/serving-endpoints/{endpoint_name}/invocations"
    headers = {
        "Authorization": f"Bearer <token>",
        "Content-Type": "application/json"
    }
    
    data_json = json.dumps({
        "dataframe_records": [{"customer_id": key} for key in lookup_keys],
        "feature_lookups": [
            {
                "table_name": "customer_features",
                "lookup_key": "customer_id"
            }
        ]
    })
    
    response = requests.post(url, headers=headers, data=data_json)
    return response.json()

# Example lookup keys
lookup_keys = ["customer_123", "customer_456"]

# Get predictions with feature lookups
predictions = query_with_features("rf-prediction-endpoint", lookup_keys)
print(predictions)

Quiz: Model Deployment Strategies

  1. Which deployment pattern is most appropriate for generating predictions for millions of customers once per day? A) Real-time serving B) Batch inference C) Streaming inference D) Online serving

  2. What is the primary advantage of using mlflow.pyfunc.spark_udf() over a regular pandas UDF? A) It allows for more complex preprocessing B) It distributes the model across the cluster efficiently C) It enables automatic model retraining D) It provides lower latency for small datasets

  3. When implementing a streaming inference pipeline, what Spark feature helps deal with late-arriving data? A) Checkpointing B) Watermarking C) Trigger once D) Output mode

  4. What is NOT a typical use case for real-time model serving? A) Fraud detection during a transaction B) Real-time product recommendations on a website C) Daily customer churn prediction reports D) Instant credit approval decisions

  5. Which method allows the most efficient batch scoring using features from the Feature Store? A) fs.read_table() followed by model.predict() B) fs.score_batch() C) mlflow.pyfunc.spark_udf() with feature lookup D) model.predict() with feature joiner

  6. What is a benefit of Z-ordering a Delta table containing model predictions? A) It compresses the data to save storage space B) It enables faster queries on specific columns C) It enforces schema validation D) It guarantees ACID transactions

  7. When converting a batch inference pipeline to streaming, what component must be changed? A) The model itself B) The input data source and potentially the output sink C) The MLflow tracking system D) The cluster configuration only

  8. What is the recommended way to handle state information in a streaming model inference pipeline? A) Store state in a separate database B) Use Spark’s stateful processing with watermarking C) Avoid stateful operations in model inference D) Write custom state handlers

  9. Which Databricks feature enables scaling real-time model serving to zero when not in use? A) Auto Scaling B) Scale-to-Zero C) Serverless Endpoints D) Cluster Autotermination

  10. What’s the primary benefit of partitioning a Delta table containing model predictions? A) It improves write performance by distributing data B) It enables query pruning for faster reads on partition columns C) It guarantees data consistency D) It allows for better compression

Quiz Answers

  1. Which deployment pattern is most appropriate for generating predictions for millions of customers once per day? Answer: B) Batch inference

    Explanation: Batch inference is ideal for high-volume, scheduled processing where predictions don’t need to be generated in real-time. Running predictions for millions of customers once daily is a classic batch inference scenario.

  2. What is the primary advantage of using mlflow.pyfunc.spark_udf() over a regular pandas UDF? Answer: B) It distributes the model across the cluster efficiently

    Explanation: The spark_udf function from MLflow efficiently distributes model inference across a Spark cluster, allowing for better parallelization and throughput compared to a standard pandas UDF implementation.

  3. When implementing a streaming inference pipeline, what Spark feature helps deal with late-arriving data? Answer: B) Watermarking

    Explanation: Watermarking in Spark Structured Streaming allows you to specify how late data can arrive and still be processed, which is essential for handling out-of-order data in streaming pipelines.

  4. What is NOT a typical use case for real-time model serving? Answer: C) Daily customer churn prediction reports

    Explanation: Daily churn prediction reports are a batch processing use case, not requiring real-time serving. The other options (fraud detection, real-time recommendations, and instant credit decisions) all require immediate responses and are suitable for real-time serving.

  5. Which method allows the most efficient batch scoring using features from the Feature Store? Answer: B) fs.score_batch()

    Explanation: The score_batch() method from the FeatureStoreClient is specifically designed for efficient batch scoring with features from the Feature Store, handling the feature lookups and model prediction in an optimized way.

  6. What is a benefit of Z-ordering a Delta table containing model predictions? Answer: B) It enables faster queries on specific columns

    Explanation: Z-ordering co-locates related data in the same files, allowing for more efficient data skipping during queries, which significantly speeds up queries that filter on the Z-ordered columns.

  7. When converting a batch inference pipeline to streaming, what component must be changed? Answer: B) The input data source and potentially the output sink

    Explanation: To convert a batch pipeline to streaming, you must change how data is read (using readStream instead of read) and how results are written (using writeStream instead of write). The model itself doesn’t necessarily change.

  8. What is the recommended way to handle state information in a streaming model inference pipeline? Answer: B) Use Spark’s stateful processing with watermarking

    Explanation: Spark Structured Streaming provides built-in stateful processing capabilities, which, when combined with watermarking, allow for proper handling of state across streaming micro-batches.

  9. Which Databricks feature enables scaling real-time model serving to zero when not in use? Answer: B) Scale-to-Zero

    Explanation: Databricks Model Serving includes a Scale-to-Zero feature that allows endpoints to automatically scale down to zero compute when they’re not receiving requests, reducing costs.

  10. What’s the primary benefit of partitioning a Delta table containing model predictions? Answer: B) It enables query pruning for faster reads on partition columns

    Explanation: Partitioning a Delta table allows the query optimizer to skip irrelevant partitions (data files) during read operations, significantly improving query performance when filtering on the partition columns.

Score Assessment

  • 9-10 correct: Excellent! You have a strong grasp of Model Deployment Strategies.
  • 7-8 correct: Good understanding, but review the topics you missed.
  • 5-6 correct: You’re on the right track, but need more study in several areas.
  • Below 5 correct: More intensive review needed on the Day 3 topics.

Areas to Focus On

Here are the concepts to review:

  • Deployment pattern selection: Understand when to use batch, streaming, or real-time serving
  • Spark UDFs for model deployment: Practice implementing spark_udf for distributed inference
  • Streaming concepts: Focus on watermarking, stateful processing, and output modes
  • Feature Store integration: Practice using fs.score_batch() for optimized batch scoring
  • Delta Lake optimizations: Review partitioning and Z-ordering for performance improvement

Key Takeaways

  1. Batch processing is the most common and cost-effective approach for most use cases
  2. Streaming enables continuous processing of real-time data
  3. Real-time serving provides low-latency predictions but requires more resources

Solution & Data Monitoring

1. Drift Types

Understanding different types of drift is essential for monitoring ML systems effectively.

Key Concepts:

  • Feature Drift: Changes in the distribution of input features
  • Label Drift: Changes in the distribution of target variables
  • Concept Drift: Changes in the relationship between features and target
  • Data Quality Drift: Degradation in data quality (missing values, outliers)

Essential Operations:

  • Detecting different types of drift
  • Understanding causes and impacts of drift
  • Identifying scenarios prone to drift
  • Responding appropriately to different drift types

2. Drift Tests and Monitoring

Statistical methods help quantify and detect drift in production ML systems.

Key Concepts:

  • Summary Statistics: Basic metrics for monitoring numeric features
  • Distribution Comparisons: Methods to compare data distributions
  • Statistical Tests: Rigorous approaches to detect significant drift
  • Monitoring Frameworks: Systematic approaches to track model health

Essential Operations:

  • Calculating summary statistics for feature monitoring
  • Implementing distribution-based drift detection
  • Applying statistical tests (Jensen-Shannon, Kolmogorov-Smirnov, Chi-square)
  • Setting up alerting systems for drift detection

3. Comprehensive Drift Solutions

End-to-end systems for monitoring and responding to drift in production.

Key Concepts:

  • Production Monitoring: Continuous tracking of model and data health
  • Retraining Triggers: Events that initiate model updates
  • Performance Evaluation: Comparing model versions on current data
  • Feedback Loops: Systems for continuous improvement

Essential Operations:

  • Building comprehensive drift detection workflows
  • Implementing automated retraining pipelines
  • Evaluating new models against current data
  • Deploying updated models to production

Exercise 1: Detecting Different Types of Drift

# 1. Set up data for drift analysis
import pandas as pd
import numpy as np
from scipy import stats
import mlflow
import matplotlib.pyplot as plt
from pyspark.sql import functions as F

# Load reference data (training distribution)
reference_df = spark.table("feature_store.training_features")

# Load current data (production distribution)
current_df = spark.table("feature_store.current_features")

# 2. Basic feature drift detection with summary statistics
# Convert to pandas for easier statistical analysis
reference_pd = reference_df.toPandas()
current_pd = current_df.toPandas()

# Compare summary statistics for numeric features
numeric_features = ["feature1", "feature2", "feature3"]

print("Reference vs Current Summary Statistics:")
for feature in numeric_features:
    ref_mean = reference_pd[feature].mean()
    ref_std = reference_pd[feature].std()
    
    curr_mean = current_pd[feature].mean()
    curr_std = current_pd[feature].std()
    
    mean_shift = abs(ref_mean - curr_mean) / ref_std
    
    print(f"Feature: {feature}")
    print(f"  Reference: mean={ref_mean:.4f}, std={ref_std:.4f}")
    print(f"  Current: mean={curr_mean:.4f}, std={curr_std:.4f}")
    print(f"  Mean shift (in std): {mean_shift:.4f}")
    print(f"  Significant drift: {mean_shift > 0.5}")
    print("")

# 3. Categorical feature drift detection
categorical_features = ["category1", "category2"]

for feature in categorical_features:
    # Calculate value frequencies
    ref_counts = reference_pd[feature].value_counts(normalize=True)
    curr_counts = current_pd[feature].value_counts(normalize=True)
    
    # Combine and fill missing categories with 0
    all_categories = set(ref_counts.index) | set(curr_counts.index)
    ref_dist = {cat: ref_counts.get(cat, 0) for cat in all_categories}
    curr_dist = {cat: curr_counts.get(cat, 0) for cat in all_categories}
    
    print(f"Feature: {feature}")
    print(f"  Reference distribution: {ref_dist}")
    print(f"  Current distribution: {curr_dist}")
    
    # Check for new or missing categories
    missing_cats = set(ref_counts.index) - set(curr_counts.index)
    new_cats = set(curr_counts.index) - set(ref_counts.index)
    
    if missing_cats:
        print(f"  Missing categories in current data: {missing_cats}")
    if new_cats:
        print(f"  New categories in current data: {new_cats}")
    print("")

# 4. Visualize distributions for key features
for feature in numeric_features:
    plt.figure(figsize=(10, 6))
    
    plt.hist(reference_pd[feature], alpha=0.5, label="Reference")
    plt.hist(current_pd[feature], alpha=0.5, label="Current")
    
    plt.title(f"Distribution Comparison: {feature}")
    plt.legend()
    plt.savefig(f"/tmp/{feature}_drift.png")
    
    # Log the plot to MLflow
    with mlflow.start_run(run_name=f"drift_detection_{feature}"):
        mlflow.log_artifact(f"/tmp/{feature}_drift.png")

Exercise 2: Statistical Tests for Drift Detection

# 1. Kolmogorov-Smirnov test for numerical features
from scipy import stats
import numpy as np
from sklearn.preprocessing import StandardScaler
import scipy.spatial.distance as distance

def ks_test_drift(reference, current, feature, alpha=0.05):
    """
    Perform Kolmogorov-Smirnov test for distribution drift
    """
    # Remove nulls for statistical testing
    ref_data = reference[feature].dropna().values
    curr_data = current[feature].dropna().values
    
    # Perform KS test
    ks_stat, p_value = stats.ks_2samp(ref_data, curr_data)
    
    # Determine if drift detected
    drift_detected = p_value < alpha
    
    return {
        "feature": feature,
        "ks_statistic": ks_stat,
        "p_value": p_value,
        "drift_detected": drift_detected,
        "test": "Kolmogorov-Smirnov"
    }

# Run KS test for each numeric feature
ks_results = []
for feature in numeric_features:
    result = ks_test_drift(reference_pd, current_pd, feature)
    ks_results.append(result)
    print(f"KS Test - {feature}: Drift Detected = {result['drift_detected']} (p-value: {result['p_value']:.6f})")

# 2. Jensen-Shannon divergence for distribution comparison
def jensen_shannon_divergence(p, q):
    """
    Calculate Jensen-Shannon divergence between two distributions
    """
    # Ensure valid probability distributions (sum to 1)
    p = np.asarray(p) / np.sum(p)
    q = np.asarray(q) / np.sum(q)
    
    # Calculate the average distribution
    m = (p + q) / 2
    
    # Calculate JS divergence
    divergence = (distance.entropy(p, m) + distance.entropy(q, m)) / 2
    
    return divergence

def js_test_drift(reference, current, feature, bins=20, threshold=0.1):
    """
    Use Jensen-Shannon divergence to detect drift in numeric features
    """
    # Remove nulls
    ref_data = reference[feature].dropna().values
    curr_data = current[feature].dropna().values
    
    # Create histogram distributions
    min_val = min(ref_data.min(), curr_data.min())
    max_val = max(ref_data.max(), curr_data.max())
    
    ref_hist, _ = np.histogram(ref_data, bins=bins, range=(min_val, max_val), density=True)
    curr_hist, _ = np.histogram(curr_data, bins=bins, range=(min_val, max_val), density=True)
    
    # Add small epsilon to avoid zero probabilities
    epsilon = 1e-10
    ref_hist = ref_hist + epsilon
    curr_hist = curr_hist + epsilon
    
    # Calculate JS divergence
    js_div = jensen_shannon_divergence(ref_hist, curr_hist)
    
    # Determine if drift detected
    drift_detected = js_div > threshold
    
    return {
        "feature": feature,
        "js_divergence": js_div,
        "drift_detected": drift_detected,
        "threshold": threshold,
        "test": "Jensen-Shannon"
    }

# Run JS divergence test for each numeric feature
js_results = []
for feature in numeric_features:
    result = js_test_drift(reference_pd, current_pd, feature)
    js_results.append(result)
    print(f"JS Divergence - {feature}: Drift Detected = {result['drift_detected']} (divergence: {result['js_divergence']:.6f})")

# 3. Chi-square test for categorical features
def chi_square_test_drift(reference, current, feature, alpha=0.05):
    """
    Perform Chi-square test for categorical distribution drift
    """
    # Get value counts
    ref_counts = reference[feature].value_counts()
    curr_counts = current[feature].value_counts()
    
    # Get union of all categories
    all_categories = set(ref_counts.index) | set(curr_counts.index)
    
    # Create arrays with counts for each category
    ref_array = np.array([ref_counts.get(cat, 0) for cat in all_categories])
    curr_array = np.array([curr_counts.get(cat, 0) for cat in all_categories])
    
    # Chi-square test requires expected counts >= 5, so filter categories
    valid_mask = (ref_array >= 5) & (curr_array >= 5)
    
    if np.sum(valid_mask) < 2:
        return {
            "feature": feature,
            "chi2_statistic": np.nan,
            "p_value": np.nan,
            "drift_detected": False,
            "test": "Chi-square",
            "note": "Not enough valid categories for Chi-square test"
        }
    
    # Filter arrays
    ref_array = ref_array[valid_mask]
    curr_array = curr_array[valid_mask]
    
    # Perform Chi-square test
    chi2_stat, p_value = stats.chisquare(curr_array, ref_array)
    
    # Determine if drift detected
    drift_detected = p_value < alpha
    
    return {
        "feature": feature,
        "chi2_statistic": chi2_stat,
        "p_value": p_value,
        "drift_detected": drift_detected,
        "test": "Chi-square"
    }

# Run Chi-square test for each categorical feature
chi2_results = []
for feature in categorical_features:
    result = chi_square_test_drift(reference_pd, current_pd, feature)
    chi2_results.append(result)
    
    if np.isnan(result['p_value']):
        print(f"Chi-square Test - {feature}: {result['note']}")
    else:
        print(f"Chi-square Test - {feature}: Drift Detected = {result['drift_detected']} (p-value: {result['p_value']:.6f})")

Exercise 3: Comprehensive Drift Monitoring Solution

# 1. Set up a workflow to detect and respond to drift
from datetime import datetime
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

def detect_and_respond_to_drift(reference_table, current_table, model_name, version, drift_threshold=0.05):
    """
    End-to-end workflow to detect drift and trigger retraining if needed
    """
    # Load data
    reference_df = spark.table(reference_table)
    current_df = spark.table(current_table)
    
    # Convert to pandas for analysis
    reference_pd = reference_df.toPandas()
    current_pd = current_df.toPandas()
    
    # Get feature names
    numeric_features = [col for col in current_pd.columns if current_pd[col].dtype in ['int64', 'float64']]
    categorical_features = [col for col in current_pd.columns if current_pd[col].dtype == 'object']
    
    # Start an MLflow run to track drift analysis
    with mlflow.start_run(run_name=f"drift_analysis_{model_name}_v{version}") as run:
        # Log basic info
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("model_version", version)
        mlflow.log_param("reference_table", reference_table)
        mlflow.log_param("current_table", current_table)
        mlflow.log_param("analysis_time", datetime.now().isoformat())
        
        # Initialize drift flags
        feature_drift_detected = False
        data_quality_drift_detected = False
        
        # 1. Check data quality
        for feature in numeric_features + categorical_features:
            ref_missing = reference_pd[feature].isna().mean()
            curr_missing = current_pd[feature].isna().mean()
            
            missing_diff = abs(ref_missing - curr_missing)
            
            mlflow.log_metric(f"missing_diff_{feature}", missing_diff)
            
            if missing_diff > 0.05:  # 5% threshold for missing value difference
                data_quality_drift_detected = True
                mlflow.log_metric(f"data_quality_drift_{feature}", 1)
            else:
                mlflow.log_metric(f"data_quality_drift_{feature}", 0)
        
        # 2. Statistical tests for feature drift
        drift_features = []
        
        # For numeric features
        for feature in numeric_features:
            # Run KS test
            ks_result = ks_test_drift(reference_pd, current_pd, feature)
            
            # Log results
            mlflow.log_metric(f"ks_stat_{feature}", ks_result["ks_statistic"])
            mlflow.log_metric(f"ks_pvalue_{feature}", ks_result["p_value"])
            mlflow.log_metric(f"ks_drift_{feature}", int(ks_result["drift_detected"]))
            
            # Run JS divergence test
            js_result = js_test_drift(reference_pd, current_pd, feature)
            
            # Log results
            mlflow.log_metric(f"js_div_{feature}", js_result["js_divergence"])
            mlflow.log_metric(f"js_drift_{feature}", int(js_result["drift_detected"]))
            
            # If either test detects drift, mark feature as drifted
            if ks_result["drift_detected"] or js_result["drift_detected"]:
                feature_drift_detected = True
                drift_features.append(feature)
        
        # For categorical features
        for feature in categorical_features:
            # Run Chi-square test
            chi2_result = chi_square_test_drift(reference_pd, current_pd, feature)
            
            # Log results if test was valid
            if not np.isnan(chi2_result["p_value"]):
                mlflow.log_metric(f"chi2_stat_{feature}", chi2_result["chi2_statistic"])
                mlflow.log_metric(f"chi2_pvalue_{feature}", chi2_result["p_value"])
                mlflow.log_metric(f"chi2_drift_{feature}", int(chi2_result["drift_detected"]))
                
                if chi2_result["drift_detected"]:
                    feature_drift_detected = True
                    drift_features.append(feature)
        
        # Log overall drift status
        mlflow.log_metric("feature_drift_detected", int(feature_drift_detected))
        mlflow.log_metric("data_quality_drift_detected", int(data_quality_drift_detected))
        
        # Log list of drifted features
        if drift_features:
            mlflow.log_param("drift_features", ",".join(drift_features))
        
        # 3. Determine if retraining is needed
        retraining_needed = feature_drift_detected or data_quality_drift_detected
        mlflow.log_metric("retraining_needed", int(retraining_needed))
        
        # 4. If retraining needed, trigger retraining job
        if retraining_needed:
            print(f"Drift detected for model {model_name} (v{version}). Triggering retraining job.")
            
            # In a real implementation, you would trigger a Databricks job here
            # For example, using the Databricks Jobs API
            """
            import requests
            
            response = requests.post(
                f"{workspace_url}/api/2.0/jobs/run-now",
                headers={"Authorization": f"Bearer {token}"},
                json={
                    "job_id": retraining_job_id,
                    "notebook_params": {
                        "model_name": model_name,
                        "drift_features": ",".join(drift_features)
                    }
                }
            )
            """
            
            # For this exercise, just log that we would trigger retraining
            mlflow.log_param("retraining_job_triggered", "True")
        
        return {
            "run_id": run.info.run_id,
            "feature_drift_detected": feature_drift_detected,
            "data_quality_drift_detected": data_quality_drift_detected,
            "retraining_needed": retraining_needed,
            "drift_features": drift_features
        }

# 2. Evaluate if a retrained model performs better on current data
def evaluate_model_improvement(current_data, original_model_uri, new_model_uri):
    """
    Compare performance of original and new models on current data
    """
    # Load the original and new models
    original_model = mlflow.pyfunc.load_model(original_model_uri)
    new_model = mlflow.pyfunc.load_model(new_model_uri)
    
    # Prepare evaluation data
    X = current_data.drop("target", axis=1)
    y = current_data["target"]
    
    # Make predictions with both models
    original_preds = original_model.predict(X)
    new_preds = new_model.predict(X)
    
    # Calculate performance metrics
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    
    # For regression
    original_mse = mean_squared_error(y, original_preds)
    new_mse = mean_squared_error(y, new_preds)
    
    original_mae = mean_absolute_error(y, original_preds)
    new_mae = mean_absolute_error(y, new_preds)
    
    original_r2 = r2_score(y, original_preds)
    new_r2 = r2_score(y, new_preds)
    
    # Log comparison to MLflow
    with mlflow.start_run(run_name="model_comparison") as run:
        mlflow.log_param("original_model_uri", original_model_uri)
        mlflow.log_param("new_model_uri", new_model_uri)
        
        mlflow.log_metric("original_mse", original_mse)
        mlflow.log_metric("new_mse", new_mse)
        mlflow.log_metric("mse_improvement", original_mse - new_mse)
        
        mlflow.log_metric("original_mae", original_mae)
        mlflow.log_metric("new_mae", new_mae)
        mlflow.log_metric("mae_improvement", original_mae - new_mae)
        
        mlflow.log_metric("original_r2", original_r2)
        mlflow.log_metric("new_r2", new_r2)
        mlflow.log_metric("r2_improvement", new_r2 - original_r2)
        
        # Determine if new model is better
        is_improved = (new_mse < original_mse)
        mlflow.log_metric("is_improved", int(is_improved))
        
        result = {
            "run_id": run.info.run_id,
            "original_mse": original_mse,
            "new_mse": new_mse,
            "mse_improvement": original_mse - new_mse,
            "is_improved": is_improved
        }
        
        return result

# 3. Set up scheduled monitoring workflow
"""
# This would be implemented as a Databricks job
# The following is pseudocode for a job notebook

# Get the model info
model_name = "regression_model"
model_version = mlflow.tracking.MlflowClient().get_latest_versions(model_name, stages=["Production"])[0].version

# Detect drift
drift_result = detect_and_respond_to_drift(
    reference_table="feature_store.training_features",
    current_table="feature_store.current_features",
    model_name=model_name,
    version=model_version
)

# If drift detected and retraining was triggered, evaluate improvement after retraining
if drift_result["retraining_needed"]:
    # Wait for retraining job to complete (in real workflow)
    # ...
    
    # Get new model URI
    new_model_version = mlflow.tracking.MlflowClient().get_latest_versions(model_name, stages=["None"])[0].version
    
    # Evaluate improvement
    improvement_result = evaluate_model_improvement(
        current_data=spark.table("feature_store.current_features").toPandas(),
        original_model_uri=f"models:/{model_name}/{model_version}",
        new_model_uri=f"models:/{model_name}/{new_model_version}"
    )
    
    # If improved, promote new model to production
    if improvement_result["is_improved"]:
        mlflow.tracking.MlflowClient().transition_model_version_stage(
            name=model_name,
            version=new_model_version,
            stage="Production"
        )
"""

Quiz: Solution & Data Monitoring

  1. What type of drift occurs when the statistical properties of the target variable change over time? A) Feature drift B) Label drift C) Concept drift D) Data quality drift

  2. Which of the following statistical tests is most appropriate for detecting drift in categorical features? A) Kolmogorov-Smirnov test B) Jensen-Shannon divergence C) Chi-square test D) t-test

  3. When using the Kolmogorov-Smirnov test to detect drift, what does a p-value < 0.05 indicate? A) The distributions are similar B) The distributions are significantly different C) There is no enough data to make a determination D) The test is inconclusive

  4. What is concept drift in the context of machine learning? A) When the model’s code changes over time B) When the statistical properties of features change C) When the relationship between input features and target variable changes D) When users’ understanding of the model changes

  5. What is NOT a typical response to detected feature drift? A) Retraining the model with recent data B) Adjusting feature engineering steps C) Reverting to a previous model version D) Deleting the features that show drift

  6. Which metric is NOT commonly used for monitoring numeric feature drift? A) Mean and standard deviation B) Quantiles (median, percentiles) C) Kolmogorov-Smirnov statistic D) Gini coefficient

  7. What is an advantage of using Jensen-Shannon divergence over Kolmogorov-Smirnov test for drift detection? A) Jensen-Shannon works with categorical features B) Jensen-Shannon is always more accurate C) Jensen-Shannon is symmetric and bounded between 0 and 1 D) Jensen-Shannon requires less data

  8. In a comprehensive drift detection system, why is it important to monitor both individual features and overall model performance? A) Feature drift might not always impact model performance B) It’s redundant but serves as a backup check C) Individual features are too noisy to monitor alone D) Model performance is always more important than feature drift

  9. What is a valid approach to evaluate if a retrained model is better than the current production model? A) Compare their performance on the original training data B) Compare their performance on the most recent production data C) Check if the new model has more features D) Check if the new model is more complex

  10. Which approach is NOT recommended for handling missing categories in categorical feature drift detection? A) Ignore new categories that weren’t in the training data B) Treat missing categories as a significant sign of drift C) Add a small epsilon to zero counts to avoid division by zero D) Include new categories in the distribution comparison

Quiz Answers

  1. What type of drift occurs when the statistical properties of the target variable change over time? Answer: B) Label drift

    Explanation: Label drift specifically refers to changes in the distribution of the target variable (labels) over time. Feature drift refers to changes in input features, while concept drift refers to changes in the relationship between features and the target.

  2. Which of the following statistical tests is most appropriate for detecting drift in categorical features? Answer: C) Chi-square test

    Explanation: The Chi-square test is designed for comparing categorical distributions. Kolmogorov-Smirnov and Jensen-Shannon are better suited for continuous distributions, while t-tests compare means of continuous variables.

  3. When using the Kolmogorov-Smirnov test to detect drift, what does a p-value < 0.05 indicate? Answer: B) The distributions are significantly different

    Explanation: In statistical hypothesis testing, a p-value below the significance level (commonly 0.05) means we reject the null hypothesis, which in the case of the KS test is that the distributions are the same. Therefore, p < 0.05 indicates the distributions are significantly different, suggesting drift.

  4. What is concept drift in the context of machine learning? Answer: C) When the relationship between input features and target variable changes

    Explanation: Concept drift occurs when the underlying relationship between input features and the target variable changes over time, making the learned patterns less valid. This can happen even when the feature distributions themselves remain stable.

  5. What is NOT a typical response to detected feature drift? Answer: D) Deleting the features that show drift

    Explanation: Simply deleting drifted features is rarely the right approach, as those features may still contain valuable information. Typical responses include retraining the model, adjusting feature engineering, or reverting to a previous version while investigating.

  6. Which metric is NOT commonly used for monitoring numeric feature drift? Answer: D) Gini coefficient

    Explanation: The Gini coefficient is primarily used to measure inequality in distributions, particularly in economics. It’s not commonly used for feature drift detection. Mean, standard deviation, quantiles, and KS statistics are standard metrics for monitoring numeric features.

  7. What is an advantage of using Jensen-Shannon divergence over Kolmogorov-Smirnov test for drift detection? Answer: C) Jensen-Shannon is symmetric and bounded between 0 and 1

    Explanation: The Jensen-Shannon divergence has the advantage of being symmetric (the order of distributions doesn’t matter) and bounded between 0 and 1, making it easier to interpret and set thresholds. The KS test doesn’t have these properties.

  8. In a comprehensive drift detection system, why is it important to monitor both individual features and overall model performance? Answer: A) Feature drift might not always impact model performance

    Explanation: Feature drift doesn’t always translate to degraded model performance, as the model might be robust to certain changes or the drift might occur in less important features. Monitoring both gives a more complete picture of system health.

  9. What is a valid approach to evaluate if a retrained model is better than the current production model? Answer: B) Compare their performance on the most recent production data

    Explanation: The most valid approach is to compare both models on the most recent production data, as this reflects the current reality the model will face. Using the original training data would be inappropriate as it doesn’t represent current conditions.

  10. Which approach is NOT recommended for handling missing categories in categorical feature drift detection? Answer: A) Ignore new categories that weren’t in the training data

    Explanation: Ignoring new categories that appear in production but weren’t in training data is not recommended, as these new categories often represent significant drift and should be accounted for in drift detection. The other approaches are valid techniques for handling categorical comparisons.

Score Assessment

  • 9-10 correct: Excellent! You have a strong grasp of Solution & Data Monitoring.
  • 7-8 correct: Good understanding, but review the topics you missed.
  • 5-6 correct: You’re on the right track, but need more study in several areas.
  • Below 5 correct: More intensive review needed on the Day 4 topics.

Areas to Focus On

Here are the concepts to review:

  • Types of drift: Understand the differences between feature, label, and concept drift
  • Statistical tests: Know which tests apply to which types of data (categorical vs. numerical)
  • Interpretation of test results: Understand p-values and thresholds for significance
  • Drift response strategies: Recognize appropriate responses to different drift scenarios
  • Model evaluation: Learn best practices for comparing retrained models against production models

Key Takeaways

  1. Different types of drift require different detection methods
  2. Statistical tests provide rigorous measures of distribution changes
  3. Comprehensive monitoring systems combine feature monitoring, model performance tracking, and automated retraining

Comprehensive Review and Practice Exams

1. Experimentation & Data Management

Delta Lake Operations:

  • Key Operations: Reading, writing, accessing history, and time travel
  • Optimizations: VACUUM for housekeeping, OPTIMIZE with Z-ORDER for query performance
  • Best Practices: Use partitioning for frequently filtered columns, Z-ORDER for analytical queries

Feature Store:

  • Core Functionality: Central repository for feature management
  • Key Operations: Creating feature tables, writing data (merge, overwrite), feature lookups
  • Best Practices: Use primary keys for efficient lookups, consider publishing tables for online serving

MLflow Experiment Tracking:

  • Key Operations: Logging parameters, metrics, artifacts, and models
  • Advanced Features: Model signatures, input examples, nested runs
  • Best Practices: Structure experiments for reproducibility, use autologging where appropriate

2. Model Lifecycle Management

MLflow Flavors and Custom Models:

  • Built-in Flavors: sklearn, tensorflow, pytorch, etc.
  • PyFunc Flavor: Custom models with preprocessing logic
  • Best Practices: Use model signatures for schema validation, include input examples

Model Registry:

  • Stages: None, Staging, Production, Archived
  • Key Operations: Registering models, transitioning stages, adding metadata
  • Best Practices: Use descriptive model names, add tags for categorization

Model Lifecycle Automation:

  • Webhooks: Event-triggered actions (MODEL_VERSION_CREATED, MODEL_VERSION_TRANSITIONED_STAGE)
  • Databricks Jobs: Scheduled or triggered batch processing
  • Best Practices: Use job clusters for production workloads, include validation steps

3. Model Deployment Strategies

Batch Deployment:

  • Key Operations: Using spark_udf for distributed inference, score_batch for Feature Store integration
  • Optimizations: Partitioning output tables, Z-ORDER for frequently queried columns
  • Best Practices: Balance parallelism and memory usage for large-scale inference

Streaming Deployment:

  • Key Concepts: Structured Streaming, watermarking, stateful processing
  • Conversion: Changing batch pipelines to streaming (readStream, writeStream)
  • Best Practices: Handle late data with watermarking, use checkpointing for fault tolerance

Real-time Serving:

  • Options: Model Serving endpoints, custom REST APIs
  • Key Features: Scale-to-Zero, endpoint management
  • Best Practices: Use online Feature Store for low-latency features, balance performance and cost

4. Solution & Data Monitoring

Drift Types:

  • Feature Drift: Changes in input distribution
  • Label Drift: Changes in target distribution
  • Concept Drift: Changes in the relationship between inputs and target

Drift Detection:

  • Statistical Tests: Kolmogorov-Smirnov (numerical), Chi-square (categorical), Jensen-Shannon (distributions)
  • Implementation: Comparing training data distributions vs. production data
  • Best Practices: Set appropriate thresholds, consider feature importance

Comprehensive Monitoring:

  • End-to-End Workflows: Detect drift, evaluate performance, trigger retraining
  • Model Evaluation: Compare new models against current data
  • Best Practices: Implement automated testing before deployment

Practice Exam 1 (60 questions, 120 minutes)

Section 1: Experimentation & Data Management

  1. Which command is used to optimize a Delta table by co-locating related data? A) OPTIMIZE with VACUUM B) OPTIMIZE with ZORDER BY C) COMPACT with ZORDER BY D) VACUUM with ZORDER BY

  2. When using Feature Store to create a training set, what does the FeatureLookup parameter specify? A) The data source for feature values B) Which features to retrieve and how to join them C) How to compute new features on the fly D) Which features to exclude from the dataset

  3. How can you include preprocessing logic in a model logged with MLflow? A) Create a custom PyFunc model with preprocessing in the predict method B) Add preprocessing as a separate step in the MLflow pipeline C) Use mlflow.sklearn.autolog() with preprocess=True D) Log the preprocessor as a separate artifact

  4. What is the correct way to read a specific version of a Delta table? A) spark.read.format(“delta”).option(“versionAsOf”, 5).load(“/path/to/table”) B) spark.read.format(“delta”).version(5).load(“/path/to/table”) C) spark.read.format(“delta”).option(“version”, 5).load(“/path/to/table”) D) spark.read.format(“delta”).history(5).load(“/path/to/table”)

  5. Which method should you use to track nested runs in MLflow? A) mlflow.start_run(nested=True) B) mlflow.log_nested_run() C) mlflow.create_child_run() D) mlflow.track_nested()

  6. When writing to a Feature Store table with existing data, which mode ensures only specific records are updated? A) “overwrite” B) “append” C) “merge” D) “update”

  7. What is a model signature in MLflow used for? A) Digitally signing the model to verify its creator B) Defining the expected input and output schema C) Ensuring the model can be used in production D) Providing a unique identifier for the model

  8. How do you access the metrics from a specific MLflow run programmatically? A) mlflow.get_run(run_id).data.metrics B) mlflow.client.get_metrics(run_id) C) mlflow.runs.get_metrics(run_id) D) mlflow.tracking.get_run_metrics(run_id)

Section 2: Model Lifecycle Management

  1. Which MLflow Client method is used to change a model’s stage from Staging to Production? A) client.update_model_version() B) client.change_stage() C) client.transition_model_version_stage() D) client.set_model_version_stage()

  2. What happens when you register a model with a name that already exists in the Model Registry? A) It overwrites the existing model B) It creates a new version of the model C) It returns an error D) It creates a copy with a suffix

  3. Which webhook event is triggered when a model version is moved to the Production stage? A) MODEL_VERSION_CREATED B) MODEL_VERSION_TRANSITIONED_STAGE C) MODEL_MOVED_TO_PRODUCTION D) REGISTERED_MODEL_STAGE_CHANGED

  4. How can you add a description to a specific version of a registered model? A) client.add_description(model_name, version, description) B) client.update_model_version(model_name, version, description=description) C) client.set_model_version_description(model_name, version, description) D) client.add_model_version_description(model_name, version, description)

  5. Which environment is recommended for running production-grade automated ML pipelines? A) All-purpose clusters B) Job clusters C) Interactive clusters D) Development clusters

  6. How do you add a tag to a registered model? A) client.add_model_tag(name, key, value) B) client.set_registered_model_tag(name, key, value) C) client.update_registered_model(name, tags={key: value}) D) client.tag_model(name, key, value)

  7. What is the purpose of the ModelVersion stage “Archived” in the Model Registry? A) To indicate the model is in storage but not actively used B) To keep old versions for reference without making them available for production C) To indicate the model has been backed up to cloud storage D) To delete the model from the registry

  8. How would you programmatically get the latest production version of a model? A) client.get_latest_versions(name, stages=[“Production”])[0] B) client.get_model_version(name, “Production”) C) client.get_production_version(name) D) client.get_registered_model(name).production_version

Section 3: Model Deployment Strategies

  1. Which deployment pattern is most appropriate for generating predictions on new data every hour? A) Real-time serving B) Batch inference C) Streaming inference D) Online prediction

  2. What is the primary advantage of using mlflow.pyfunc.spark_udf() for model deployment? A) It allows model serving directly from Spark B) It distributes model inference across a Spark cluster C) It enables real-time predictions D) It automatically handles feature engineering

  3. When implementing a streaming inference pipeline, what feature helps manage state across micro-batches? A) foreachBatch B) Watermarking C) checkpointLocation D) outputMode

  4. Which option in Databricks Model Serving allows an endpoint to use no resources when not receiving requests? A) Auto Scaling B) Scale-to-Zero C) Serverless Endpoints D) Dynamic Allocation

  5. What method from the Feature Store client is optimized for batch scoring with feature lookups? A) fs.batch_score() B) fs.score_batch() C) fs.lookup_and_score() D) fs.predict_batch()

  6. When using a watermark in a Structured Streaming application, what does it help with? A) Data encryption B) Authentication of the data source C) Managing late-arriving data D) Reducing memory usage

  7. What is a typical reason to choose batch deployment over real-time deployment? A) When predictions must be generated in milliseconds B) When high throughput is more important than low latency C) When user interaction requires immediate feedback D) When the model is small enough to fit in memory

  8. What Spark Structured Streaming output mode should you use when you want to write only new predictions? A) “complete” B) “update” C) “append” D) “insert-only”

Section 4: Solution & Data Monitoring

  1. What type of drift occurs when the relationship between features and the target variable changes? A) Feature drift B) Label drift C) Concept drift D) Data drift

  2. Which statistical test is appropriate for detecting drift in numerical features? A) Chi-square test B) Kolmogorov-Smirnov test C) Fisher’s exact test D) Log-likelihood ratio test

  3. What does a low p-value in the Kolmogorov-Smirnov test indicate? A) The distributions are similar B) The distributions are significantly different C) The test is inconclusive D) There is not enough data for the test

  4. Which approach is NOT typically part of a comprehensive drift monitoring solution? A) Monitoring feature distributions B) Tracking model performance metrics C) Retraining models on a fixed schedule regardless of drift D) Evaluating new models against current data

  5. What is an advantage of using Jensen-Shannon divergence for drift detection? A) It works well with categorical data B) It’s faster to compute than other tests C) It’s symmetric and bounded between 0 and 1 D) It requires fewer data points

  6. What is the most appropriate way to compare a retrained model against the production model? A) Compare their performance on the original training data B) Compare their complexity (number of parameters) C) Compare their performance on recent production data D) Compare their training time and resource usage

  7. When monitoring categorical features for drift, how should you handle new categories that weren’t in the training data? A) Ignore them as they weren’t part of the original distribution B) Consider them as a significant indicator of drift C) Group them all into an “Other” category D) Remove records with new categories from the analysis

  8. What is the primary purpose of monitoring data quality alongside drift detection? A) To validate the data pipeline is working correctly B) To identify changes that might impact model performance before drift occurs C) To comply with regulations D) To ensure data storage is optimized

Additional Questions (covering all domains)

  1. Which MLflow flavor would you use for a custom model with complex preprocessing logic? A) mlflow.sklearn B) mlflow.tensorflow C) mlflow.custom D) mlflow.pyfunc

  2. How would you optimize a Delta table for queries that frequently filter on a date column and then need to access sorted user IDs? A) OPTIMIZE table ZORDER BY (date_col, user_id) B) OPTIMIZE table PARTITION BY (date_col) ZORDER BY (user_id) C) OPTIMIZE table ZORDER BY (user_id) D) OPTIMIZE table PARTITION BY (date_col, user_id)

  3. What is the benefit of using Feature Store for model inference in production? A) It automatically retrains models when features drift B) It ensures consistent feature transformation between training and inference C) It reduces the number of features needed for a model D) It improves model accuracy by adding new features

  4. In MLflow, what does setting nested=True in a run allow you to do? A) Nest models inside each other B) Create parent-child relationships between runs C) Use nested cross-validation D) Track nested parameters in dictionaries

  5. What is the primary advantage of job clusters over all-purpose clusters for ML workflows? A) They are more secure B) They automatically scale based on workload C) They are optimized for cost by terminating when jobs complete D) They allow for more concurrent users

  6. When implementing a feature store table that will be used for both training and online inference, what should you consider? A) Using a NoSQL database for storage B) Publishing the table for online serving C) Converting all features to numeric type D) Limiting the number of features to improve performance

  7. Which approach would you use to deploy a model that needs to process a continuous stream of IoT sensor data? A) Real-time serving with RESTful endpoints B) Batch inference scheduled every minute C) Streaming inference with Structured Streaming D) On-device inference without cloud deployment

  8. What does the MLflow model registry enable that experiment tracking alone does not? A) Logging model parameters and metrics B) Organizing model versions and transitions between stages C) Creating model artifacts D) Running hyperparameter tuning

  9. Which method would you use to retrieve an artifact from a specific MLflow run? A) mlflow.get_artifact(run_id, path) B) mlflow.artifacts.download_artifacts(run_id, path) C) mlflow.tracking.download_artifacts(run_id, path) D) mlflow.tracking.MlflowClient().download_artifacts(run_id, path)

  10. What is the purpose of a webhook in the MLflow Model Registry? A) To securely authenticate API calls B) To trigger automated actions when model events occur C) To monitor model performance in production D) To enable web-based model serving

  11. Which deployment pattern is best for a use case requiring predictions on millions of records with no immediate time constraint? A) Real-time serving B) Batch inference C) Streaming inference D) Edge deployment

  12. What is a suitable method to detect drift in a categorical feature with many possible values? A) Kolmogorov-Smirnov test B) Jensen-Shannon divergence C) Chi-square test D) t-test

  13. How would you load a specific version of a model from the Model Registry? A) mlflow.pyfunc.load_model(“models:/model_name/version”) B) mlflow.pyfunc.load_model(“models:/model_name/3”) C) mlflow.load_model(“models:/model_name/3”) D) mlflow.load_version(“model_name”, 3)

  14. Which Delta Lake operation helps maintain storage costs by removing old file versions? A) OPTIMIZE B) VACUUM C) COMPACT D) CLEAN

  15. What is the best way to ensure a model’s preprocessing steps are consistently applied in production? A) Document the steps for manual implementation B) Include preprocessing in the model’s pipeline or custom PyFunc class C) Apply preprocessing in the application code D) Create a separate preprocessing service

  16. What would you use to automatically track parameters, metrics, and artifacts for Spark ML models? A) mlflow.start_run() B) mlflow.spark.autolog() C) mlflow.log_model() D) spark.track_with_mlflow()

  17. Which approach allows for the most efficient querying of model predictions stored in a Delta table? A) Saving predictions in JSON format B) Partitioning by frequently filtered columns and Z-ordering C) Using Parquet instead of Delta format D) Denormalizing the prediction data

  18. What is the recommended way to deploy a model that needs to process real-time payments for fraud detection? A) Batch inference every hour B) Streaming inference with latency of minutes C) Real-time model serving with millisecond latency D) Edge deployment on payment terminals

  19. When setting up a webhook for model registry events, which parameter indicates the webhook should trigger on model transitions to Production? A) events=[“PRODUCTION_TRANSITION”] B) stage=”Production” C) target_stage=”Production” D) production_only=True

  20. What is the best practice for organizing experiments in MLflow? A) Create a new experiment for each model type B) Use the default experiment for all runs C) Create a new experiment for each business problem D) Create a new experiment for each modeler

  21. Which of the following is NOT a standard stage in the MLflow Model Registry? A) Development B) Staging C) Production D) Archived

  22. What is the recommended approach when concept drift is detected in a production model? A) Roll back to a previous model version B) Retrain the model with recent data and evaluate performance C) Add more features to the model D) Increase the model complexity

  23. Which method would you use to register a model directly from an MLflow run? A) mlflow.register_model_from_run(run_id, name) B) mlflow.register_model(“runs:/run_id/model”, name) C) mlflow.models.register(run_id, name) D) mlflow.client.register_model(run_id, name)

  24. What does the score_batch method in the Feature Store client do? A) Scores models for feature importance B) Retrieves features and applies a model in one operation C) Batches multiple scoring requests for efficiency D) Calculates feature statistics in batches

  25. Which approach is LEAST appropriate for detecting data quality issues? A) Monitoring the percentage of missing values B) Checking for out-of-range values based on training data C) Comparing average prediction values over time D) Verifying schema consistency

  26. What is a recommended practice when deploying models via Databricks Jobs? A) Use all-purpose clusters for cost efficiency B) Include validation steps before promoting to production C) Deploy all models as a single job D) Use the same runtime for all jobs

  27. How should you implement incremental feature computation in a streaming context? A) Recompute all features on each batch B) Use window functions and stateful processing C) Store feature values in a database and update manually D) Use batch processing instead of streaming

  28. Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? A) Real-time serving endpoints B) Webhooks and scheduled jobs C) Delta Lake time travel D) Feature importance calculation

Practice Exam 1: Answers and Explanations

  1. Which command is used to optimize a Delta table by co-locating related data? Answer: B) OPTIMIZE with ZORDER BY

    Explanation: The OPTIMIZE command with ZORDER BY co-locates related data in the same files, which improves query performance by reducing the amount of data that needs to be read.

  2. When using Feature Store to create a training set, what does the FeatureLookup parameter specify? Answer: B) Which features to retrieve and how to join them

    Explanation: FeatureLookup specifies which features to retrieve from which feature tables and how to join them with the training dataset using lookup keys.

  3. How can you include preprocessing logic in a model logged with MLflow? Answer: A) Create a custom PyFunc model with preprocessing in the predict method

    Explanation: Creating a custom PyFunc model allows you to implement preprocessing logic in the predict method, ensuring the same preprocessing is applied consistently during inference.

  4. What is the correct way to read a specific version of a Delta table? Answer: A) spark.read.format(“delta”).option(“versionAsOf”, 5).load(“/path/to/table”)

    Explanation: The option “versionAsOf” is the correct parameter for specifying a particular version of a Delta table when reading.

  5. Which method should you use to track nested runs in MLflow? Answer: A) mlflow.start_run(nested=True)

    Explanation: Setting nested=True in mlflow.start_run() creates a child run under the current parent run, enabling hierarchical organization of experiment runs.

  6. When writing to a Feature Store table with existing data, which mode ensures only specific records are updated? Answer: C) “merge”

    Explanation: The “merge” mode updates existing records based on matching primary keys and inserts new records, while “overwrite” would replace the entire table.

  7. What is a model signature in MLflow used for? Answer: B) Defining the expected input and output schema

    Explanation: A model signature defines the expected data types and shapes for model inputs and outputs, enabling validation when the model is used for inference.

  8. How do you access the metrics from a specific MLflow run programmatically? Answer: A) mlflow.get_run(run_id).data.metrics

    Explanation: The correct way to access metrics from an MLflow run is by using the get_run() method and accessing the metrics through the .data.metrics attribute.

  9. Which MLflow Client method is used to change a model’s stage from Staging to Production? Answer: C) client.transition_model_version_stage()

    Explanation: The transition_model_version_stage() method is used to change a model version’s stage in the Model Registry.

  10. What happens when you register a model with a name that already exists in the Model Registry? Answer: B) It creates a new version of the model

    Explanation: Registering a model with an existing name creates a new version under that name rather than overwriting the existing model or returning an error.

  11. Which webhook event is triggered when a model version is moved to the Production stage? Answer: B) MODEL_VERSION_TRANSITIONED_STAGE

    Explanation: The MODEL_VERSION_TRANSITIONED_STAGE event is triggered when a model version’s stage changes, including transitions to Production.

  12. How can you add a description to a specific version of a registered model? Answer: B) client.update_model_version(model_name, version, description=description)

    Explanation: The update_model_version() method is used to update metadata for a specific model version, including its description.

  13. Which environment is recommended for running production-grade automated ML pipelines? Answer: B) Job clusters

    Explanation: Job clusters are purpose-built for production workloads. They start when a job begins and terminate when it completes, optimizing costs.

  14. How do you add a tag to a registered model? Answer: B) client.set_registered_model_tag(name, key, value)

    Explanation: The set_registered_model_tag() method is the correct API for adding or updating a tag on a registered model.

  15. What is the purpose of the ModelVersion stage “Archived” in the Model Registry? Answer: B) To keep old versions for reference without making them available for production

    Explanation: The “Archived” stage is used for model versions that are no longer actively used but should be kept for reference or compliance purposes.

  16. How would you programmatically get the latest production version of a model? Answer: A) client.get_latest_versions(name, stages=[“Production”])[0]

    Explanation: The get_latest_versions() method with stages=[“Production”] returns the latest model version in the Production stage.

  17. Which deployment pattern is most appropriate for generating predictions on new data every hour? Answer: B) Batch inference

    Explanation: Batch inference is ideal for scheduled, periodic processing of data when real-time predictions aren’t required.

  18. What is the primary advantage of using mlflow.pyfunc.spark_udf() for model deployment? Answer: B) It distributes model inference across a Spark cluster

    Explanation: The spark_udf() function efficiently distributes model inference across a Spark cluster, enabling parallel processing for better performance.

  19. When implementing a streaming inference pipeline, what feature helps manage state across micro-batches? Answer: C) checkpointLocation

    Explanation: The checkpointLocation option in Structured Streaming enables fault tolerance by saving state information between micro-batches.

  20. Which option in Databricks Model Serving allows an endpoint to use no resources when not receiving requests? Answer: B) Scale-to-Zero

    Explanation: Scale-to-Zero allows model serving endpoints to automatically scale down to zero compute when they’re not receiving requests, reducing costs.

  21. What method from the Feature Store client is optimized for batch scoring with feature lookups? Answer: B) fs.score_batch()

    Explanation: The score_batch() method is specifically designed for efficient batch scoring with features from the Feature Store.

  22. When using a watermark in a Structured Streaming application, what does it help with? Answer: C) Managing late-arriving data

    Explanation: Watermarking in Spark Structured Streaming allows you to specify how late data can arrive and still be processed, which is essential for handling out-of-order data.

  23. What is a typical reason to choose batch deployment over real-time deployment? Answer: B) When high throughput is more important than low latency

    Explanation: Batch deployment is typically chosen when processing large volumes of data (high throughput) is more important than immediate responses (low latency).

  24. What Spark Structured Streaming output mode should you use when you want to write only new predictions? Answer: C) “append”

    Explanation: The “append” output mode only writes new output records to the sink, which is appropriate for writing only new predictions.

  25. What type of drift occurs when the relationship between features and the target variable changes? Answer: C) Concept drift

    Explanation: Concept drift refers specifically to changes in the relationship between input features and the target variable, which affects model performance even if feature distributions remain stable.

  26. Which statistical test is appropriate for detecting drift in numerical features? Answer: B) Kolmogorov-Smirnov test

    Explanation: The Kolmogorov-Smirnov test is commonly used to detect drift in numerical features by comparing the distributions of two samples.

  27. What does a low p-value in the Kolmogorov-Smirnov test indicate? Answer: B) The distributions are significantly different

    Explanation: In statistical hypothesis testing, a low p-value (typically <0.05) means we reject the null hypothesis, which in the KS test is that the distributions are the same.

  28. Which approach is NOT typically part of a comprehensive drift monitoring solution? Answer: C) Retraining models on a fixed schedule regardless of drift

    Explanation: A comprehensive drift monitoring solution typically triggers retraining based on detected drift, not on a fixed schedule regardless of whether drift has occurred.

  29. What is an advantage of using Jensen-Shannon divergence for drift detection? Answer: C) It’s symmetric and bounded between 0 and 1

    Explanation: Jensen-Shannon divergence has the advantage of being symmetric and bounded between 0 and 1, making it easier to interpret and set thresholds.

  30. What is the most appropriate way to compare a retrained model against the production model? Answer: C) Compare their performance on recent production data

    Explanation: The most appropriate comparison is on recent production data, as this reflects the current conditions the model will face.

  31. When monitoring categorical features for drift, how should you handle new categories that weren’t in the training data? Answer: B) Consider them as a significant indicator of drift

    Explanation: New categories that weren’t present in training data are often significant indicators of drift and should be flagged as such.

  32. What is the primary purpose of monitoring data quality alongside drift detection? Answer: B) To identify changes that might impact model performance before drift occurs

    Explanation: Data quality monitoring helps identify issues (like missing values or outliers) that might impact model performance before they manifest as drift.

  33. Which MLflow flavor would you use for a custom model with complex preprocessing logic? Answer: D) mlflow.pyfunc

    Explanation: The MLflow pyfunc flavor allows you to create custom Python models by extending the PythonModel class, enabling complex preprocessing logic.

  34. How would you optimize a Delta table for queries that frequently filter on a date column and then need to access sorted user IDs? Answer: A) OPTIMIZE table ZORDER BY (date_col, user_id)

    Explanation: Z-ordering by both columns will co-locate data with similar dates and user IDs, optimizing queries that filter or sort by these columns.

  35. What is the benefit of using Feature Store for model inference in production? Answer: B) It ensures consistent feature transformation between training and inference

    Explanation: Feature Store ensures that the same feature definitions and transformations are used consistently between training and inference.

  36. In MLflow, what does setting nested=True in a run allow you to do? Answer: B) Create parent-child relationships between runs

    Explanation: The nested=True parameter creates a hierarchical relationship between runs, with the current run becoming a child of the active parent run.

  37. What is the primary advantage of job clusters over all-purpose clusters for ML workflows? Answer: C) They are optimized for cost by terminating when jobs complete

    Explanation: Job clusters automatically terminate when a job completes, optimizing costs by not consuming resources when not needed.

  38. When implementing a feature store table that will be used for both training and online inference, what should you consider? Answer: B) Publishing the table for online serving

    Explanation: For a feature table to be available for online (real-time) inference, it needs to be explicitly published for online serving.

  39. Which approach would you use to deploy a model that needs to process a continuous stream of IoT sensor data? Answer: C) Streaming inference with Structured Streaming

    Explanation: Streaming inference with Structured Streaming is ideal for continuously processing real-time data streams like IoT sensor data.

  40. What does the MLflow model registry enable that experiment tracking alone does not? Answer: B) Organizing model versions and transitions between stages

    Explanation: The Model Registry adds versioning and stage transitions (None, Staging, Production, Archived) that aren’t available with experiment tracking alone.

  41. Which method would you use to retrieve an artifact from a specific MLflow run? Answer: D) mlflow.tracking.MlflowClient().download_artifacts(run_id, path)

    Explanation: The download_artifacts method of the MlflowClient is used to retrieve artifacts from a specific run.

  42. What is the purpose of a webhook in the MLflow Model Registry? Answer: B) To trigger automated actions when model events occur

    Explanation: Webhooks enable automated actions (like triggering a job) in response to model events such as version creation or stage transitions.

  43. Which deployment pattern is best for a use case requiring predictions on millions of records with no immediate time constraint? Answer: B) Batch inference

    Explanation: Batch inference is the most efficient pattern for high-volume predictions when there’s no immediate time constraint.

  44. What is a suitable method to detect drift in a categorical feature with many possible values? Answer: C) Chi-square test

    Explanation: The Chi-square test is appropriate for comparing categorical distributions, including those with many possible values.

  45. How would you load a specific version of a model from the Model Registry? Answer: B) mlflow.pyfunc.load_model(“models:/model_name/3”)

    Explanation: To load a specific version, you use the URI format “models:/model_name/version_number” with the version number.

  46. Which Delta Lake operation helps maintain storage costs by removing old file versions? Answer: B) VACUUM

    Explanation: The VACUUM command permanently removes files that are no longer needed by the table and have been marked for deletion by previous operations.

  47. What is the best way to ensure a model’s preprocessing steps are consistently applied in production? Answer: B) Include preprocessing in the model’s pipeline or custom PyFunc class

    Explanation: Including preprocessing in the model itself (via pipeline or custom class) ensures consistent application in all environments.

  48. What would you use to automatically track parameters, metrics, and artifacts for Spark ML models? Answer: B) mlflow.spark.autolog()

    Explanation: The mlflow.spark.autolog() function automatically logs parameters, metrics, and artifacts for Spark ML models.

  49. Which approach allows for the most efficient querying of model predictions stored in a Delta table? Answer: B) Partitioning by frequently filtered columns and Z-ordering

    Explanation: Partitioning by frequently filtered columns combined with Z-ordering provides the most efficient query performance on Delta tables.

  50. What is the recommended way to deploy a model that needs to process real-time payments for fraud detection? Answer: C) Real-time model serving with millisecond latency

    Explanation: Fraud detection during payment processing requires immediate responses, making real-time model serving with low latency the appropriate choice.

  51. When setting up a webhook for model registry events, which parameter indicates the webhook should trigger on model transitions to Production? Answer: C) target_stage=”Production”

    Explanation: The target_stage parameter specifies that the webhook should trigger when a model is transitioned to a specific stage, in this case, Production.

  52. What is the best practice for organizing experiments in MLflow? Answer: C) Create a new experiment for each business problem

    Explanation: Creating experiments around business problems or use cases helps organize related model development efforts.

  53. Which of the following is NOT a standard stage in the MLflow Model Registry? Answer: A) Development

    Explanation: The standard stages in the MLflow Model Registry are None (default), Staging, Production, and Archived. “Development” is not a standard stage.

  54. What is the recommended approach when concept drift is detected in a production model? Answer: B) Retrain the model with recent data and evaluate performance

    Explanation: When concept drift is detected, retraining the model with recent data that reflects the new relationship between features and target is recommended.

  55. Which method would you use to register a model directly from an MLflow run? Answer: B) mlflow.register_model(“runs:/run_id/model”, name)

    Explanation: The register_model function with a runs URI is used to register a model directly from a run.

  56. What does the score_batch method in the Feature Store client do? Answer: B) Retrieves features and applies a model in one operation

    Explanation: The score_batch method efficiently retrieves features from the Feature Store and applies a model in a single operation.

  57. Which approach is LEAST appropriate for detecting data quality issues? Answer: C) Comparing average prediction values over time

    Explanation: While useful for monitoring model behavior, comparing average predictions is not a direct method for detecting data quality issues like missing values or schema changes.

  58. What is a recommended practice when deploying models via Databricks Jobs? Answer: B) Include validation steps before promoting to production

    Explanation: Including validation steps in deployment jobs ensures that models meet quality standards before being promoted to production.

  59. How should you implement incremental feature computation in a streaming context? Answer: B) Use window functions and stateful processing

    Explanation: Window functions and stateful processing in Structured Streaming enable efficient incremental feature computation as new data arrives.

  60. Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? Answer: B) Webhooks and scheduled jobs

    Explanation: Webhooks and scheduled jobs provide the automation backbone needed to trigger retraining and deployment when drift is detected.

Score Assessment

  • 54-60 correct: Excellent! You’re very well prepared for the exam.
  • 48-53 correct: Good understanding with a few areas to review.
  • 42-47 correct: Solid foundation but need targeted review in several areas.
  • Below 42 correct: More intensive review needed before taking the exam.

Practice Exam 2 (60 questions, 120 minutes)

  1. What is the correct way to create a Feature Store table? A) fs.create_table(name, primary_keys, df, description) B) fs.create_feature_table(name, keys, dataframe, description) C) fs.register_table(name, keys, dataframe, description) D) fs.create_new_table(name, primary_keys, dataframe, description)

  2. How do you restore a Delta table to a specific timestamp? A) spark.read.format(“delta”).option(“timestampAsOf”, timestamp).load(path).write.format(“delta”).mode(“overwrite”).save(path) B) spark.restore.format(“delta”).timestamp(timestamp).save(path) C) DeltaTable.forPath(spark, path).restoreToTimestamp(timestamp) D) spark.sql(f”RESTORE TABLE delta.{path} TO TIMESTAMP AS OF ‘{timestamp}’”)

  3. Which MLflow method can automatically log parameters, metrics, and artifacts without explicit logging statements? A) mlflow.auto_track() B) mlflow.start_tracking() C) mlflow.autolog() D) mlflow.enable_logging()

  4. What does the “version” field in the Delta table history tell you? A) The schema version of the Delta protocol B) The version number of each transaction in sequence C) The version of Spark used to write the table D) The version of the table schema

  5. How would you share features from a feature table with a specific model during training? A) fs.get_features(table_name, feature_names) B) fs.create_training_set(df, feature_lookups, label) C) fs.lookup_features(table_name, feature_names, lookup_key) D) fs.select_features(table_name, feature_names, lookup_key)

  6. What information does MLflow NOT automatically capture when autologging is enabled? A) Model parameters B) Performance metrics C) Feature importance D) Business context for the model

  7. Which operation is NOT possible with Delta Lake? A) Time travel to query a previous version B) Schema evolution to add new columns C) Rollback to a previous version D) Real-time streaming without micro-batching

  8. What is the purpose of specifying input_example when logging a model in MLflow? A) To validate the model’s input schema B) To provide an example for documentation and inference testing C) To set default values for the model D) To optimize model storage

  9. What is the purpose of a custom PyFunc model class in MLflow? A) To implement models in programming languages other than Python B) To include custom preprocessing and postprocessing with the model C) To optimize model inference speed D) To create ensemble models from multiple base models

  10. Which MLflow Model Registry stage is a model automatically assigned when first registered? A) Development B) None C) Staging D) Production

  11. What method would you use to transition a model version from Staging to Production programmatically? A) client.set_model_version_stage(name, version, “Production”) B) client.transition_model_version_stage(name, version, “Production”) C) client.update_model_version(name, version, stage=”Production”) D) client.promote_model_version(name, version, “Staging”, “Production”)

  12. How would you archive a specific version of a registered model using the MLflow client? A) client.archive_model_version(name, version) B) client.set_model_version_archived(name, version) C) client.update_model_version(name, version, status=”ARCHIVED”) D) client.transition_model_version_stage(name, version, “Archived”)

  13. What is the purpose of model aliases in MLflow? A) To provide user-friendly names for complex models B) To create named references to specific model versions C) To specify which flavor to use when loading a model D) To categorize models by their intended use

  14. Which method would you use to add a tag to a specific model version? A) client.add_model_version_tag(name, version, key, value) B) client.tag_model_version(name, version, key, value) C) client.set_model_version_tag(name, version, key, value) D) client.update_model_version(name, version, tags={key: value})

  15. What happens when you create a webhook for MODEL_VERSION_CREATED events? A) It triggers when any model version is created in the workspace B) It triggers only when versions of the specified model are created C) It triggers when a model version is created or updated D) It triggers only when a model version is created by the webhook owner

  16. Which type of compute is most appropriate for a CI/CD pipeline that automatically tests new model versions? A) All-purpose cluster B) Single-node cluster C) Job cluster D) Interactive cluster

  17. What is the primary advantage of deploying a model with spark_udf() over a pandas UDF? A) It enables GPU acceleration B) It distributes model inference across the cluster C) It allows for more complex preprocessing D) It reduces model serving latency

  18. Which deployment pattern is most appropriate for a model that needs to provide product recommendations when a user visits a webpage? A) Batch inference B) Streaming inference C) Real-time serving D) Embedded inference

  19. What is the purpose of a checkpointLocation in a Structured Streaming application? A) To cache model predictions for faster retrieval B) To save the streaming state for fault tolerance C) To record model metrics for monitoring D) To store a copy of the model for rollback

  20. When deploying a model for batch inference, which optimization technique would MOST improve query performance on specific customer segments? A) OPTIMIZE with bin-packing B) VACUUM to remove stale files C) Z-ORDER by customer segment columns D) Partitioning by random IDs

  21. What is the benefit of using Feature Store with real-time model serving? A) It automatically updates feature values in real-time B) It provides low-latency access to pre-computed features C) It eliminates the need for feature transformations D) It improves model accuracy for real-time predictions

  22. Which deployment pattern uses the least amount of resources for a model that needs to generate predictions once per week? A) Continuous streaming deployment B) Always-on real-time serving C) Scheduled batch inference with job clusters D) On-demand batch inference with all-purpose clusters

  23. What is watermarking used for in a Structured Streaming application? A) To validate data authenticity B) To manage state for late-arriving data C) To encrypt sensitive data D) To compress streaming output

  24. Which approach would you use to efficiently apply a model to a streaming source with millions of events per second? A) Foreachbatch with spark_udf B) Real-time model serving endpoints C) Process each event individually with pandas UDFs D) Queue events and process in micro-batches

  25. What type of drift occurs when the overall statistical distribution of input features changes? A) Concept drift B) Label drift C) Feature drift D) Model drift

  26. Which statistical test is most appropriate for detecting drift in a categorical feature with many unique values? A) t-test B) Chi-square test C) Kolmogorov-Smirnov test D) Mann-Whitney U test

  27. What is a limitation of using basic summary statistics (mean, median, etc.) for drift detection? A) They require too much computation B) They can miss changes in distribution shape while means remain similar C) They don’t work with large datasets D) They can only be applied to numerical features

  28. When monitoring a production model, which metric should trigger immediate investigation if it suddenly increases? A) Model inference latency B) Number of unique users C) Prediction error rate compared to baseline D) Model version number

  29. What is the Jensen-Shannon divergence used for in model monitoring? A) Measuring the computational efficiency of a model B) Quantifying the difference between probability distributions C) Evaluating model convergence during training D) Calculating feature importance scores

  30. Which of the following is NOT typically a sign of concept drift? A) Stable error metrics but changing feature distributions B) Degrading error metrics with stable feature distributions C) Changes in both feature distributions and error metrics D) Increasing inference latency with no changes in data

  31. What is a recommended approach for handling missing values when monitoring categorical features for drift? A) Treat missing values as a separate category B) Ignore records with missing values C) Impute missing values with the mode D) Replace missing values with a fixed string

  32. Which metric would be LEAST useful for detecting label drift? A) Jensen-Shannon divergence of label distributions B) Chi-square test on label frequencies C) Percentage of missing labels D) Model training time

  33. What is the primary benefit of implementing feature monitoring in the Feature Store? A) It automatically prevents drift B) It detects changes in feature distributions over time C) It optimizes feature computation D) It improves feature selection

  34. How would you implement a model that requires both batch inference for daily reports and real-time inference for user interactions? A) Create two separate models with different code B) Deploy the same model artifact through both batch and real-time patterns C) Use only streaming inference as a compromise D) Convert all use cases to batch processing

  35. What is the correct way to log a confusion matrix visualization to MLflow? A) mlflow.log_confusion_matrix(cm) B) mlflow.log_figure(fig) C) mlflow.log_metrics({“confusion_matrix”: cm}) D) mlflow.log_artifact(“/tmp/confusion_matrix.png”)

  36. Which approach is most efficient for deploying a model that needs to process streaming data and update aggregated metrics in real-time? A) Real-time serving with REST endpoints B) Structured Streaming with stateful processing C) Batch inference run every minute D) Lambda architecture with separate batch and speed layers

  37. What is a key benefit of using Delta tables to store model predictions? A) They automatically optimize model accuracy B) They provide ACID transactions and time travel capabilities C) They eliminate the need for feature engineering D) They automatically detect data drift

  38. What would you use to track the lineage of a feature in the Feature Store? A) Delta table history B) MLflow tags C) Feature Store metadata and descriptions D) Webhook event logs

  39. When evaluating a model for concept drift, what approach provides the most direct evidence? A) Comparing feature distributions between training and production B) Monitoring model prediction distributions C) Measuring model error on recent labeled data D) Tracking inference latency

  40. Which deployment strategy requires the least code modification when converting from batch to streaming inference? A) Using foreachBatch with the same processing logic B) Reimplementing with Kafka consumers C) Creating a new REST API endpoint D) Implementing a custom streaming sink

  41. What does a model signature in MLflow help prevent? A) Unauthorized model access B) Model versioning conflicts C) Input schema mismatches during inference D) Overwriting production models

  42. Which approach would you use to handle a model with different preprocessing requirements for different feature types? A) Create separate models for each feature type B) Implement a custom PyFunc model with conditional preprocessing C) Use only features that share the same preprocessing D) Convert all features to the same type first

  43. What is a best practice for organizing experiments in MLflow for a data science team? A) Use a single experiment for all team members B) Create separate experiments for each model iteration C) Organize experiments by business problem or use case D) Create a new experiment for each data scientist

  44. How would you implement a comprehensive monitoring solution that detects both data drift and model performance issues? A) Monitor only prediction distributions B) Track feature distributions and model error metrics separately C) Compare model outputs between versions D) Run periodic A/B tests in production

  45. Which strategy should you use when promoting a model to production if it performs better on some metrics but worse on others? A) Always prioritize the model with better accuracy B) Evaluate metrics based on business impact and prioritize accordingly C) Create an ensemble of both models D) Keep the current model to avoid any regression

  46. What is the correct approach to handle evolving schemas in feature tables? A) Create a new feature table for each schema version B) Use Delta Lake schema evolution capabilities C) Maintain fixed schemas and reject new columns D) Convert all data to string type to avoid schema issues

  47. Which method would you use to log multiple evaluation metrics for a model at once in MLflow? A) Call mlflow.log_metric() for each metric B) mlflow.log_metrics(metrics_dict) C) mlflow.log_dict(metrics_dict, “metrics.json”) D) mlflow.sklearn.log_metrics(metrics_dict)

  48. What is a key benefit of using job clusters over all-purpose clusters for production ML workflows? A) They support more concurrent users B) They provide better interactive development experiences C) They are optimized for cost by terminating when jobs complete D) They allow for GPU acceleration

  49. Which feature in Delta Lake is most important for implementing reproducible ML pipelines? A) Schema enforcement B) Time travel C) ACID transactions D) Z-ordering

  50. How would you ensure that a model’s feature preprocessing is consistent between training and inference? A) Document the preprocessing steps carefully B) Include preprocessing in the model pipeline or custom PyFunc class C) Apply preprocessing in the application code D) Use separate preprocessing services

  51. What is the recommended way to handle missing values in categorical features for a model in production? A) Drop records with missing values B) Apply the same imputation strategy used during training C) Return an error for records with missing values D) Use a default value not seen in training

  52. What is the most appropriate response when detecting significant feature drift in a production model? A) Immediately roll back to a previous model version B) Assess impact on model performance and retrain if needed C) Add more features to compensate for the drift D) Switch to a more complex model architecture

  53. Which approach allows for the most efficient querying of model predictions stored in a Delta table? A) Using the latest version of the table B) Partitioning by frequently filtered columns and Z-ordering C) Converting to Parquet format D) Using databricks SQL endpoints

  54. When implementing a feature table that will be used for both batch and real-time inference, what should you consider? A) Using only numerical features for consistency B) Limiting the number of features to improve latency C) Publishing the table for online serving D) Creating separate tables for batch and real-time use

  55. What is the primary purpose of the Model Registry in an MLOps workflow? A) To track experiments and hyperparameters B) To organize model versions and transitions between stages C) To store model artifacts D) To monitor model performance in production

  56. What does the score_batch method in the Feature Store client do? A) Calculates feature importance scores B) Retrieves features and applies a model in one operation C) Evaluates model performance on batches of data D) Computes batch statistics for monitoring

  57. Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? A) Delta Lake time travel B) Webhooks and scheduled jobs C) Feature Store table partitioning D) Custom model flavors

  58. What should you do before promoting a new model version to production? A) Always retrain on the most recent data B) Validate performance against a holdout dataset C) Increase the model complexity D) Add more features to improve accuracy

  59. What is a key consideration when designing a feature computation pipeline that needs to support both batch and streaming inference? A) Use only features that can be computed in real-time B) Implement feature logic that works in both paradigms C) Optimize exclusively for batch performance D) Create separate implementations for each pattern

  60. Which approach is most effective for detecting concept drift in a model where ground truth labels are available with a delay? A) Monitor feature distributions only B) Calculate error metrics once labels become available C) Use unsupervised drift detection methods D) Compare model prediction distributions over time

Practice Exam 2: Answers and Explanations

Here are the answers to the second practice exam, along with explanations for each question:

  1. What is the correct way to create a Feature Store table? Answer: A) fs.create_table(name, primary_keys, df, description)

    Explanation: The correct method for creating a feature table is create_table() with parameters for the table name, primary keys, DataFrame, and optional description.

  2. How do you restore a Delta table to a specific timestamp? Answer: A) spark.read.format(“delta”).option(“timestampAsOf”, timestamp).load(path).write.format(“delta”).mode(“overwrite”).save(path)

    Explanation: Delta Lake doesn’t have a direct “restore” command. To restore a table to a previous timestamp, you need to read the table at that timestamp and then overwrite the current table.

  3. Which MLflow method can automatically log parameters, metrics, and artifacts without explicit logging statements? Answer: C) mlflow.autolog()

    Explanation: mlflow.autolog() enables automatic logging of parameters, metrics, and artifacts for supported libraries without requiring explicit logging statements.

  4. What does the “version” field in the Delta table history tell you? Answer: B) The version number of each transaction in sequence

    Explanation: The version field in Delta table history represents the sequential version number of each transaction on the table, starting from 0 for the initial creation.

  5. How would you share features from a feature table with a specific model during training? Answer: B) fs.create_training_set(df, feature_lookups, label)

    Explanation: The create_training_set method creates a training dataset by joining features from feature tables with a DataFrame containing the entity keys and labels.

  6. What information does MLflow NOT automatically capture when autologging is enabled? Answer: D) Business context for the model

    Explanation: Autologging captures technical information like parameters, metrics, and artifacts, but cannot capture business context, which must be added manually.

  7. Which operation is NOT possible with Delta Lake? Answer: D) Real-time streaming without micro-batching

    Explanation: Delta Lake supports streaming through Spark Structured Streaming, which uses a micro-batch architecture. True “real-time” streaming without micro-batching is not supported.

  8. What is the purpose of specifying input_example when logging a model in MLflow? Answer: B) To provide an example for documentation and inference testing

    Explanation: An input example provides a sample input that can be used for documentation, inference testing, and as a reference for what the model expects.

  9. What is the purpose of a custom PyFunc model class in MLflow? Answer: B) To include custom preprocessing and postprocessing with the model

    Explanation: A custom PyFunc model allows you to include preprocessing and postprocessing logic with the model, ensuring consistent application during inference.

  10. Which MLflow Model Registry stage is a model automatically assigned when first registered? Answer: B) None

    Explanation: Newly registered models start in the “None” stage by default, and must be explicitly transitioned to other stages like Staging or Production.

  11. What method would you use to transition a model version from Staging to Production programmatically? Answer: B) client.transition_model_version_stage(name, version, “Production”)

    Explanation: The transition_model_version_stage method is used to move a model version between registry stages.

  12. How would you archive a specific version of a registered model using the MLflow client? Answer: D) client.transition_model_version_stage(name, version, “Archived”)

    Explanation: To archive a model version, you transition it to the “Archived” stage using the same method used for other stage transitions.

  13. What is the purpose of model aliases in MLflow? Answer: B) To create named references to specific model versions

    Explanation: Model aliases provide named references to specific model versions, allowing for more flexible development and deployment workflows.

  14. Which method would you use to add a tag to a specific model version? Answer: C) client.set_model_version_tag(name, version, key, value)

    Explanation: The set_model_version_tag method is used to add or update a tag on a specific version of a registered model.

  15. What happens when you create a webhook for MODEL_VERSION_CREATED events? Answer: B) It triggers only when versions of the specified model are created

    Explanation: Webhooks for MODEL_VERSION_CREATED events trigger only when new versions of the specified model are created, not for all models.

  16. Which type of compute is most appropriate for a CI/CD pipeline that automatically tests new model versions? Answer: C) Job cluster

    Explanation: Job clusters are designed for automated workloads like CI/CD pipelines. They start when a job begins and terminate when it completes, optimizing costs.

  17. What is the primary advantage of deploying a model with spark_udf() over a pandas UDF? Answer: B) It distributes model inference across the cluster

    Explanation: spark_udf() efficiently distributes model inference across a Spark cluster, allowing for better parallelization and throughput.

  18. Which deployment pattern is most appropriate for a model that needs to provide product recommendations when a user visits a webpage? Answer: C) Real-time serving

    Explanation: Real-time serving is appropriate for use cases requiring immediate predictions with low latency, such as providing recommendations when a user visits a webpage.

  19. What is the purpose of a checkpointLocation in a Structured Streaming application? Answer: B) To save the streaming state for fault tolerance

    Explanation: The checkpointLocation saves the state of a streaming query, enabling fault tolerance and exactly-once processing semantics.

  20. When deploying a model for batch inference, which optimization technique would MOST improve query performance on specific customer segments? Answer: C) Z-ORDER by customer segment columns

    Explanation: Z-ordering by customer segment columns co-locates related data, improving query performance when filtering or aggregating by those segments.

  21. What is the benefit of using Feature Store with real-time model serving? Answer: B) It provides low-latency access to pre-computed features

    Explanation: Feature Store provides low-latency access to pre-computed features via its online store, which is critical for real-time serving.

  22. Which deployment pattern uses the least amount of resources for a model that needs to generate predictions once per week? Answer: C) Scheduled batch inference with job clusters

    Explanation: Scheduled batch inference with job clusters that automatically terminate after completion is the most resource-efficient for weekly predictions.

  23. What is watermarking used for in a Structured Streaming application? Answer: B) To manage state for late-arriving data

    Explanation: Watermarking defines how late data can arrive and still be processed, helping manage state for stateful operations in streaming applications.

  24. Which approach would you use to efficiently apply a model to a streaming source with millions of events per second? Answer: A) Foreachbatch with spark_udf

    Explanation: Using foreachbatch with spark_udf allows for efficient batch processing of events within a streaming context, which is ideal for high-throughput scenarios.

  25. What type of drift occurs when the overall statistical distribution of input features changes? Answer: C) Feature drift

    Explanation: Feature drift specifically refers to changes in the statistical distribution of input features over time.

  26. Which statistical test is most appropriate for detecting drift in a categorical feature with many unique values? Answer: B) Chi-square test

    Explanation: The Chi-square test is appropriate for comparing categorical distributions, including those with many unique values.

  27. What is a limitation of using basic summary statistics (mean, median, etc.) for drift detection? Answer: B) They can miss changes in distribution shape while means remain similar

    Explanation: Summary statistics can miss important changes in distribution shape or variance if central tendencies like the mean remain similar.

  28. When monitoring a production model, which metric should trigger immediate investigation if it suddenly increases? Answer: C) Prediction error rate compared to baseline

    Explanation: A sudden increase in prediction error compared to baseline indicates a potential model degradation that requires immediate investigation.

  29. What is the Jensen-Shannon divergence used for in model monitoring? Answer: B) Quantifying the difference between probability distributions

    Explanation: Jensen-Shannon divergence is a method to measure the similarity between two probability distributions, making it useful for detecting drift.

  30. Which of the following is NOT typically a sign of concept drift? Answer: D) Increasing inference latency with no changes in data

    Explanation: Increasing inference latency without data changes is typically an infrastructure or resource issue, not concept drift which involves changes in the relationship between features and targets.

  31. What is a recommended approach for handling missing values when monitoring categorical features for drift? Answer: A) Treat missing values as a separate category

    Explanation: Treating missing values as a separate category allows you to monitor changes in the rate of missingness, which can be an important indicator of drift.

  32. Which metric would be LEAST useful for detecting label drift? Answer: D) Model training time

    Explanation: Model training time has no direct relationship to label drift, which concerns changes in the distribution of target variables.

  33. What is the primary benefit of implementing feature monitoring in the Feature Store? Answer: B) It detects changes in feature distributions over time

    Explanation: Feature monitoring in the Feature Store helps detect changes in feature distributions over time, which is critical for maintaining model performance.

  34. How would you implement a model that requires both batch inference for daily reports and real-time inference for user interactions? Answer: B) Deploy the same model artifact through both batch and real-time patterns

    Explanation: The best practice is to use the same model artifact deployed through different patterns (batch and real-time) to ensure consistency.

  35. What is the correct way to log a confusion matrix visualization to MLflow? Answer: D) mlflow.log_artifact(“/tmp/confusion_matrix.png”)

    Explanation: To log a visualization, you typically save it as an image file and then use log_artifact to add it to the MLflow run.

  36. Which approach is most efficient for deploying a model that needs to process streaming data and update aggregated metrics in real-time? Answer: B) Structured Streaming with stateful processing

    Explanation: Structured Streaming with stateful processing is designed for efficiently processing streaming data and maintaining real-time aggregations.

  37. What is a key benefit of using Delta tables to store model predictions? Answer: B) They provide ACID transactions and time travel capabilities

    Explanation: Delta tables provide ACID transactions and time travel capabilities, ensuring data reliability and enabling historical analysis.

  38. What would you use to track the lineage of a feature in the Feature Store? Answer: C) Feature Store metadata and descriptions

    Explanation: Feature Store metadata and descriptions are used to document and track the lineage, sources, and transformations of features.

  39. When evaluating a model for concept drift, what approach provides the most direct evidence? Answer: C) Measuring model error on recent labeled data

    Explanation: Measuring model error on recent labeled data directly evaluates whether the relationship between features and target has changed.

  40. Which deployment strategy requires the least code modification when converting from batch to streaming inference? Answer: A) Using foreachBatch with the same processing logic

    Explanation: The foreachBatch method allows you to reuse batch processing logic within a streaming context with minimal code changes.

  41. What does a model signature in MLflow help prevent? Answer: C) Input schema mismatches during inference

    Explanation: A model signature defines the expected input and output schemas, helping prevent mismatches during inference.

  42. Which approach would you use to handle a model with different preprocessing requirements for different feature types? Answer: B) Implement a custom PyFunc model with conditional preprocessing

    Explanation: A custom PyFunc model allows you to implement conditional preprocessing logic based on feature types.

  43. What is a best practice for organizing experiments in MLflow for a data science team? Answer: C) Organize experiments by business problem or use case

    Explanation: Organizing experiments by business problem or use case helps maintain clarity and focus in the team’s workflow.

  44. How would you implement a comprehensive monitoring solution that detects both data drift and model performance issues? Answer: B) Track feature distributions and model error metrics separately

    Explanation: A comprehensive solution tracks both feature distributions (to detect data drift) and model error metrics (to detect performance issues).

  45. Which strategy should you use when promoting a model to production if it performs better on some metrics but worse on others? Answer: B) Evaluate metrics based on business impact and prioritize accordingly

    Explanation: The best approach is to evaluate the business impact of each metric and prioritize those that align with the business goals.

  46. What is the correct approach to handle evolving schemas in feature tables? Answer: B) Use Delta Lake schema evolution capabilities

    Explanation: Delta Lake’s schema evolution capabilities allow you to add, modify, or remove columns while maintaining backward compatibility.

  47. Which method would you use to log multiple evaluation metrics for a model at once in MLflow? Answer: B) mlflow.log_metrics(metrics_dict)

    Explanation: The log_metrics method allows you to log multiple metrics at once by passing a dictionary of metric names and values.

  48. What is a key benefit of using job clusters over all-purpose clusters for production ML workflows? Answer: C) They are optimized for cost by terminating when jobs complete

    Explanation: Job clusters automatically terminate when a job completes, optimizing costs by not consuming resources when not needed.

  49. Which feature in Delta Lake is most important for implementing reproducible ML pipelines? Answer: B) Time travel

    Explanation: Time travel allows you to access specific versions of data, ensuring reproducibility in ML pipelines by using consistent data snapshots.

  50. How would you ensure that a model’s feature preprocessing is consistent between training and inference? Answer: B) Include preprocessing in the model pipeline or custom PyFunc class

    Explanation: Including preprocessing in the model pipeline or custom class ensures the same steps are applied consistently during both training and inference.

  51. What is the recommended way to handle missing values in categorical features for a model in production? Answer: B) Apply the same imputation strategy used during training

    Explanation: Consistency is key - you should apply the same imputation strategy used during training to ensure the model receives similar data in production.

  52. What is the most appropriate response when detecting significant feature drift in a production model? Answer: B) Assess impact on model performance and retrain if needed

    Explanation: The appropriate response is to first assess whether the drift impacts model performance, and then retrain if necessary.

  53. Which approach allows for the most efficient querying of model predictions stored in a Delta table? Answer: B) Partitioning by frequently filtered columns and Z-ordering

    Explanation: Partitioning by frequently filtered columns combined with Z-ordering provides the most efficient query performance.

  54. When implementing a feature table that will be used for both batch and real-time inference, what should you consider? Answer: C) Publishing the table for online serving

    Explanation: For real-time inference, the feature table needs to be published for online serving to enable low-latency access.

  55. What is the primary purpose of the Model Registry in an MLOps workflow? Answer: B) To organize model versions and transitions between stages

    Explanation: The Model Registry’s primary purpose is to organize model versions and manage transitions between stages (None, Staging, Production, Archived).

  56. What does the score_batch method in the Feature Store client do? Answer: B) Retrieves features and applies a model in one operation

    Explanation: The score_batch method efficiently retrieves features from the Feature Store and applies a model in a single operation.

  57. Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? Answer: B) Webhooks and scheduled jobs

    Explanation: Webhooks and scheduled jobs provide the automation backbone needed to trigger retraining and deployment when drift is detected.

  58. What should you do before promoting a new model version to production? Answer: B) Validate performance against a holdout dataset

    Explanation: Validating performance against a holdout dataset is a critical step to ensure the new model performs as expected before promotion.

  59. What is a key consideration when designing a feature computation pipeline that needs to support both batch and streaming inference? Answer: B) Implement feature logic that works in both paradigms

    Explanation: The feature logic should be implemented in a way that works consistently in both batch and streaming paradigms to ensure consistency.

  60. Which approach is most effective for detecting concept drift in a model where ground truth labels are available with a delay? Answer: B) Calculate error metrics once labels become available

    Explanation: When ground truth labels are available (even with a delay), calculating error metrics provides the most direct measure of concept drift.

Score Assessment

  • 54-60 correct: Excellent! You’re very well prepared for the certification exam.
  • 48-53 correct: Good understanding with a few areas to review.
  • 42-47 correct: Solid foundation but need targeted review in several areas.
  • Below 42 correct: More intensive review needed before taking the certification exam.

Exam Day Tips

As you prepare for the actual exam, keep these strategies in mind:

  1. Time Management: You have about 2 minutes per question. Don’t spend too long on any single question.
  2. First Pass Strategy: Answer the questions you’re confident about first, then return to the more difficult ones.
  3. Process of Elimination: For difficult questions, eliminate obviously wrong answers to improve your chances.
  4. Read Carefully: Pay attention to qualifiers like “NOT,” “BEST,” “LEAST,” etc., which can completely change the meaning of a question.
  5. Contextual Clues: Look for hints in the question that point to specific domains or technologies.
  6. Trust Your Preparation: By this point, you’ve covered all the material comprehensively.