Delta Lake is a storage layer that brings reliability to data lakes. As an ML engineer, you need to understand its capabilities for managing ML data.
Key Concepts:
Essential Operations:
The Databricks Feature Store is a centralized repository for managing and sharing ML features.
Key Concepts:
Essential Operations:
MLflow provides tools for experiment tracking, reproducibility, and model management.
Key Concepts:
Essential Operations:
# 1. Create a Delta table
data = spark.range(0, 1000).withColumn("square", col("id") * col("id"))
data.write.format("delta").save("/path/to/delta-table")
# 2. Read from Delta table
df = spark.read.format("delta").load("/path/to/delta-table")
# 3. Update Delta table (append new data)
new_data = spark.range(1000, 2000).withColumn("square", col("id") * col("id"))
new_data.write.format("delta").mode("append").save("/path/to/delta-table")
# 4. View table history
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/path/to/delta-table")
history = delta_table.history()
display(history)
# 5. Time travel (load previous version)
previous_df = spark.read.format("delta").option("versionAsOf", 0).load("/path/to/delta-table")
# 6. Optimize table (Z-ordering)
spark.sql("OPTIMIZE delta.`/path/to/delta-table` ZORDER BY (id)")
# 1. Initialize Feature Store client
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
# 2. Create a feature table
from databricks.feature_store import feature_table
features_df = spark.read.format("delta").load("/path/to/data")
fs.create_table(
name="customer_features",
primary_keys=["customer_id"],
df=features_df,
description="Customer features for churn prediction"
)
# 3. Write to an existing feature table
fs.write_table(
name="customer_features",
df=updated_features_df,
mode="merge" # Supports "overwrite" and "merge"
)
# 4. Read from a feature table
features = fs.read_table(
name="customer_features"
)
# 5. Use features in model training
from databricks.feature_store import FeatureLookup
feature_lookups = [
FeatureLookup(
table_name="customer_features",
feature_names=["feature1", "feature2", "feature3"],
lookup_key="customer_id"
)
]
training_data = fs.create_training_set(
df=training_labels_df,
feature_lookups=feature_lookups,
label="churn"
)
# Get the training DataFrame
training_df = training_data.load_df()
# 1. Basic MLflow tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Start a run
with mlflow.start_run(run_name="rf-classifier") as run:
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 5)
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)
# Log metrics
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
mlflow.log_metric("accuracy", accuracy)
# Log model
mlflow.sklearn.log_model(model, "model")
# Get run ID for later reference
run_id = run.info.run_id
# 2. Advanced MLflow tracking with signatures and input examples
import pandas as pd
from mlflow.models.signature import infer_signature
# Create an input example
input_example = X_train.iloc[0:5]
# Infer the model signature
signature = infer_signature(X_train, y_pred)
# Log model with signature and input example
with mlflow.start_run(run_name="rf-with-signature") as run:
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
input_example=input_example
)
# 3. Working with nested runs
with mlflow.start_run(run_name="parent-run") as parent_run:
mlflow.log_param("parent_param", "parent_value")
# Create child runs for different model variations
for n_estimators in [50, 100, 150]:
with mlflow.start_run(run_name=f"child-run-{n_estimators}", nested=True) as child_run:
mlflow.log_param("n_estimators", n_estimators)
# Train and log model details
model = RandomForestClassifier(n_estimators=n_estimators)
model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, model.predict(X_test))
mlflow.log_metric("accuracy", accuracy)
mlflow.sklearn.log_model(model, "model")
What Delta Lake operation would you use to eliminate small files and optimize performance? A) VACUUM B) OPTIMIZE C) COMPACT D) CLEAN
In the Feature Store, what mode should you use when writing updates to an existing feature table for specific keys? A) “overwrite” B) “append” C) “merge” D) “update”
How would you access a Delta table’s version from 3 days ago? A) Using option(“versionAsOf”, X) B) Using option(“timestampAsOf”, timestamp) C) Using timeTravel(days=3) D) Using history().filter(days=3)
When tracking experiments with MLflow, what does setting nested=True allow you to do? A) Create hierarchical runs for different model configurations B) Nest models inside each other C) Create hierarchical storage of artifacts D) Track nested parameters in dictionaries
Which of the following is NOT a benefit of using the Databricks Feature Store? A) Feature sharing across teams B) Automatic feature selection C) Feature discovery D) Point-in-time lookups
What information is provided by a model signature in MLflow? A) The author of the model B) The input and output schema of the model C) Digital signature verifying model authenticity D) The model’s architecture details
Which MLflow tracking function would you use to save metadata about a trained model? A) log_artifact() B) log_metric() C) log_param() D) log_tags()
What operation can you perform to go back to a previous version of a Delta table? A) table.rollback(version=N) B) RESTORE TABLE to version N C) Read with option(“versionAsOf”, N) and rewrite D) History and select previous version
When using the FeatureStoreClient to create a training set, what does the FeatureLookup parameter do? A) Searches for the best features automatically B) Specifies which features to retrieve and how to join them C) Looks up feature importance scores D) Creates a lookup table for the features
How can you programmatically retrieve the metrics from a previous MLflow run? A) mlflow.get_run(run_id).data.metrics B) mlflow.search_runs(experiment_id) C) mlflow.get_metrics(run_id) D) mlflow.runs.get_metrics(run_id)
What Delta Lake operation would you use to eliminate small files and optimize performance? Answer: B) OPTIMIZE
Explanation: The OPTIMIZE command rewrites small files into larger ones to improve read performance. VACUUM removes old file versions but doesn’t consolidate small files. COMPACT isn’t a standard Delta operation, and CLEAN doesn’t exist in Delta Lake.
In the Feature Store, what mode should you use when writing updates to an existing feature table for specific keys? Answer: C) “merge”
Explanation: The “merge” mode updates existing records based on primary keys and inserts new ones. “Overwrite” would replace the entire table, “append” would add duplicate records, and “update” isn’t a standard mode in Feature Store.
How would you access a Delta table’s version from 3 days ago? Answer: B) Using option(“timestampAsOf”, timestamp)
Explanation: To access a version from a specific time point, you use the “timestampAsOf” option with a timestamp value. “versionAsOf” is used for specific version numbers, not time periods. The other options don’t exist in Delta Lake.
When tracking experiments with MLflow, what does setting nested=True allow you to do? Answer: A) Create hierarchical runs for different model configurations
Explanation: The nested=True parameter allows you to create child runs within a parent run, which is useful for organizing related experiments (like testing different hyperparameters of the same model type).
Which of the following is NOT a benefit of using the Databricks Feature Store? Answer: B) Automatic feature selection
Explanation: Feature Store doesn’t automatically select optimal features for your models. It does provide feature sharing, discovery (finding features in the organization), and point-in-time lookups (accessing feature values as they were at a specific time).
What information is provided by a model signature in MLflow? Answer: B) The input and output schema of the model
Explanation: A model signature defines the expected data types and shapes for model inputs and outputs, enabling validation when the model is used for inference.
Which MLflow tracking function would you use to save metadata about a trained model? Answer: C) log_param()
Explanation: log_param() saves named parameters about your model (like hyperparameters). log_metric() is for performance metrics, log_artifact() is for files, and log_tags() isn’t a standard MLflow function.
What operation can you perform to go back to a previous version of a Delta table? Answer: C) Read with option(“versionAsOf”, N) and rewrite
Explanation: Delta Lake doesn’t have a direct rollback command. To restore a previous version, you need to read the table at that version using “versionAsOf” and then write it back to replace the current version.
When using the FeatureStoreClient to create a training set, what does the FeatureLookup parameter do? Answer: B) Specifies which features to retrieve and how to join them
Explanation: FeatureLookup specifies which features to retrieve from which feature tables and how to join them with the training dataset based on lookup keys.
How can you programmatically retrieve the metrics from a previous MLflow run? Answer: A) mlflow.get_run(run_id).data.metrics
Explanation: You can get a run’s details with mlflow.get_run() and then access its metrics via the .data.metrics attribute. This returns a dictionary of all logged metrics from that run.
Here are the concepts to review:
MLflow flavors provide a standardized way to package models for different frameworks.
Key Concepts:
Essential Operations:
The Model Registry provides a centralized repository for managing the full lifecycle of your ML models.
Key Concepts:
Essential Operations:
Automating the model lifecycle enables CI/CD workflows for ML models.
Key Concepts:
Essential Operations:
# 1. Basic model with sklearn flavor
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
# Train a model
X = np.random.rand(100, 4)
y = X[:, 0] + 2 * X[:, 1] + np.random.rand(100)
model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)
# Log with sklearn flavor
with mlflow.start_run() as run:
mlflow.sklearn.log_model(model, "sklearn_model")
model_uri = f"runs:/{run.info.run_id}/sklearn_model"
# Load the model
loaded_model = mlflow.sklearn.load_model(model_uri)
predictions = loaded_model.predict(X)
# 2. Custom PyFunc model with preprocessing
import mlflow.pyfunc
# Define a custom model class with preprocessing
class CustomRFModel(mlflow.pyfunc.PythonModel):
def __init__(self, model):
self.model = model
def predict(self, context, model_input):
# Add preprocessing logic
if isinstance(model_input, pd.DataFrame):
# Scale numeric features
numeric_cols = model_input.select_dtypes(include=[np.number]).columns
model_input[numeric_cols] = model_input[numeric_cols] - model_input[numeric_cols].mean()
model_input[numeric_cols] = model_input[numeric_cols] / model_input[numeric_cols].std()
# Return predictions
return self.model.predict(model_input)
# Create and log the custom model
custom_model = CustomRFModel(model)
with mlflow.start_run() as run:
# Define the model signature
from mlflow.models.signature import infer_signature
signature = infer_signature(X, model.predict(X))
# Provide an input example
input_example = pd.DataFrame(X[0:5])
# Log the model with all metadata
mlflow.pyfunc.log_model(
"custom_model",
python_model=custom_model,
signature=signature,
input_example=input_example
)
custom_model_uri = f"runs:/{run.info.run_id}/custom_model"
# Load and use the custom model
loaded_custom_model = mlflow.pyfunc.load_model(custom_model_uri)
custom_predictions = loaded_custom_model.predict(X)
# 1. Register a model directly from a run
import mlflow.sklearn
from mlflow.tracking import MlflowClient
client = MlflowClient()
# First, log a model with MLflow
with mlflow.start_run() as run:
mlflow.sklearn.log_model(model, "sk_model")
run_id = run.info.run_id
model_uri = f"runs:/{run_id}/sk_model"
# Register the model
model_name = "RandomForestRegressor"
mv = mlflow.register_model(model_uri, model_name)
print(f"Name: {mv.name}")
print(f"Version: {mv.version}")
# 2. Add description and tags to the registered model
client.update_registered_model(
name=model_name,
description="Random Forest Regressor for predicting target values"
)
client.set_registered_model_tag(
name=model_name,
key="team",
value="data_science"
)
# 3. Add description and tags to a specific model version
client.update_model_version(
name=model_name,
version=mv.version,
description="Model trained with 100 trees and default parameters"
)
client.set_model_version_tag(
name=model_name,
version=mv.version,
key="train_data",
value="synthetic_data"
)
# 4. Transition model to staging
client.transition_model_version_stage(
name=model_name,
version=mv.version,
stage="Staging"
)
# 5. Create a new version and transition to production
with mlflow.start_run() as new_run:
mlflow.sklearn.log_model(
model,
"sk_model",
registered_model_name=model_name
)
new_run_id = new_run.info.run_id
# Find the latest version
latest_version = max([mv.version for mv in client.search_model_versions(f"name='{model_name}'")])
# Transition to production
client.transition_model_version_stage(
name=model_name,
version=latest_version,
stage="Production"
)
# 6. Load a specific model version by stage
prod_model = mlflow.pyfunc.load_model(f"models:/{model_name}/Production")
staging_model = mlflow.pyfunc.load_model(f"models:/{model_name}/Staging")
# 7. Archive older versions
client.transition_model_version_stage(
name=model_name,
version=1, # Assuming this is the older version
stage="Archived"
)
# 1. Creating a webhook for Model Registry
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Create a job-triggered webhook when a model is transitioned to staging
staging_webhook = client.create_webhook(
name="Trigger-Test-Job-On-Staging",
events=["MODEL_VERSION_TRANSITIONED_STAGE"],
job_spec={
"job_id": "123456", # Replace with actual job ID
"workspace_url": "https://your-workspace.cloud.databricks.com"
},
model_name=model_name,
target_stage="Staging"
)
# Create an HTTP webhook when a model is transitioned to production
production_webhook = client.create_webhook(
name="Notify-On-Production",
events=["MODEL_VERSION_TRANSITIONED_STAGE"],
http_url_spec={
"url": "https://your-service.example.com/webhook",
"authorization": "Bearer your-token-here"
},
model_name=model_name,
target_stage="Production"
)
# 2. List all webhooks
all_webhooks = client.list_webhooks()
for webhook in all_webhooks:
print(f"ID: {webhook.id}, Name: {webhook.name}, Events: {webhook.events}")
# 3. Delete a webhook
client.delete_webhook(webhook_id=staging_webhook.id)
# 4. Set up Databricks Jobs for model automation
# Note: This would typically be done through the Databricks UI or API
# Here's a conceptual example of what the jobs would look like:
# Job 1: Train Model (scheduled or triggered)
"""
{
"name": "Train-RF-Model",
"tasks": [
{
"task_key": "train_model",
"notebook_task": {
"notebook_path": "/Path/To/Training/Notebook",
"source": "WORKSPACE"
},
"job_cluster_key": "training_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "training_cluster",
"new_cluster": {
"spark_version": "10.4.x-cpu-ml-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
}
}
]
}
"""
# Job 2: Test Model (triggered by webhook)
"""
{
"name": "Test-RF-Model",
"tasks": [
{
"task_key": "test_model",
"notebook_task": {
"notebook_path": "/Path/To/Testing/Notebook",
"base_parameters": {
"model_name": "",
"model_version": ""
}
},
"job_cluster_key": "testing_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "testing_cluster",
"new_cluster": {
"spark_version": "10.4.x-cpu-ml-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 1
}
}
]
}
"""
# Job 3: Deploy Model (triggered by webhook)
"""
{
"name": "Deploy-RF-Model",
"tasks": [
{
"task_key": "deploy_model",
"notebook_task": {
"notebook_path": "/Path/To/Deployment/Notebook",
"base_parameters": {
"model_name": "",
"model_version": ""
}
},
"job_cluster_key": "deployment_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "deployment_cluster",
"new_cluster": {
"spark_version": "10.4.x-cpu-ml-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 1
}
}
]
}
"""
Which MLflow flavor would you use to create a model with custom preprocessing logic? A) mlflow.sklearn B) mlflow.custom C) mlflow.pyfunc D) mlflow.generic
What happens when you register a model that already exists in the Model Registry? A) It overwrites the existing model B) It creates a new version of the model C) It returns an error that the model already exists D) It creates a copy with a different name
Which of the following is NOT a standard stage in the MLflow Model Registry? A) Development B) Staging C) Production D) Archived
When creating a webhook for model registry events, which of these is NOT a valid event type? A) MODEL_VERSION_CREATED B) MODEL_VERSION_TRANSITIONED_STAGE C) REGISTERED_MODEL_CREATED D) MODEL_VERSION_DEPLOYED
What is the correct way to load a production model from the registry? A) mlflow.pyfunc.load_model(“models:/model_name/production”) B) mlflow.sklearn.load_production_model(“model_name”) C) mlflow.load_model(“models:/model_name/Production”) D) mlflow.pyfunc.load_model(“models:/model_name/Production”)
What is the purpose of adding a model signature to an MLflow model? A) Digitally sign the model to verify its creator B) Define the expected input and output schema C) Increase the model’s security in the registry D) Document who approved the model for production
Which client method is used to move a model from Staging to Production? A) client.promote_model_version() B) client.update_model_stage() C) client.transition_model_version_stage() D) client.set_model_version_status()
What would you include in a pyfunc model’s predict() method to implement custom preprocessing? A) Data cleaning and feature transformation logic before passing to the model B) Hyperparameter optimization logic C) Model retraining logic if performance decreases D) Post-processing of model outputs only
What is the correct API to add a description to a model version? A) client.set_model_version_description() B) client.update_model_version() C) client.add_model_description() D) client.update_registered_model_version()
What type of compute is recommended for production Databricks Jobs? A) All-purpose clusters B) Job clusters C) Single-node clusters D) Interactive clusters
Which MLflow flavor would you use to create a model with custom preprocessing logic? Answer: C) mlflow.pyfunc
Explanation: The mlflow.pyfunc flavor allows you to create custom Python models by extending the PythonModel class, enabling you to implement custom preprocessing logic in the predict() method. The other flavors are for specific frameworks (sklearn) or don’t exist (custom, generic).
What happens when you register a model that already exists in the Model Registry? Answer: B) It creates a new version of the model
Explanation: When registering a model with a name that already exists in the registry, MLflow creates a new version under that name rather than overwriting the existing model or returning an error.
Which of the following is NOT a standard stage in the MLflow Model Registry? Answer: A) Development
Explanation: The standard stages in the MLflow Model Registry are None (default), Staging, Production, and Archived. “Development” is not a standard stage.
When creating a webhook for model registry events, which of these is NOT a valid event type? Answer: D) MODEL_VERSION_DEPLOYED
Explanation: Valid event types include MODEL_VERSION_CREATED, MODEL_VERSION_TRANSITIONED_STAGE, and REGISTERED_MODEL_CREATED. MODEL_VERSION_DEPLOYED is not a standard event type in MLflow webhooks.
What is the correct way to load a production model from the registry? Answer: D) mlflow.pyfunc.load_model(“models:/model_name/Production”)
Explanation: The correct URI format is “models:/model_name/stage” and the stage name is case-sensitive, with “Production” being the proper casing. MLflow’s pyfunc loader is the universal way to load any model.
What is the purpose of adding a model signature to an MLflow model? Answer: B) Define the expected input and output schema
Explanation: A model signature in MLflow defines the expected data types and shapes for inputs and outputs, allowing for validation when the model is used for inference.
Which client method is used to move a model from Staging to Production? Answer: C) client.transition_model_version_stage()
Explanation: The transition_model_version_stage() method of the MLflowClient is used to change a model version’s stage (e.g., from Staging to Production). The other methods don’t exist in the MLflow API.
What would you include in a pyfunc model’s predict() method to implement custom preprocessing? Answer: A) Data cleaning and feature transformation logic before passing to the model
Explanation: The predict() method in a custom pyfunc model is where you implement preprocessing logic like data cleaning and feature transformations before passing the data to the underlying model.
What is the correct API to add a description to a model version? Answer: B) client.update_model_version()
Explanation: The update_model_version() method is used to update metadata for a specific model version, including its description. The other options are not valid MLflow API methods.
What type of compute is recommended for production Databricks Jobs? Answer: B) Job clusters
Explanation: Job clusters are purpose-built for production workloads in Databricks. They start when a job begins and terminate when it completes, optimizing costs. All-purpose clusters are for interactive work, not production jobs.
Here are the concepts to review:
Batch deployment is the most common pattern for model inference in production environments.
Key Concepts:
Essential Operations:
Streaming deployment enables continuous inference on real-time data streams.
Key Concepts:
Essential Operations:
Real-time serving provides low-latency inference for individual requests.
Key Concepts:
Essential Operations:
# 1. Load a registered model for batch inference
import mlflow
import pandas as pd
from pyspark.sql.functions import struct, col
# Load the model from the registry
model_uri = "models:/RandomForestRegressor/Production"
model = mlflow.pyfunc.load_model(model_uri)
# 2. Create a Pandas UDF for batch inference
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
# Define the UDF using the loaded model
@pandas_udf(DoubleType())
def predict_udf(features_pd: pd.Series) -> pd.Series:
# Make predictions
return pd.Series(model.predict(features_pd.to_frame().T))
# 3. Apply the UDF to a DataFrame
# Assuming df has feature columns feature1, feature2, etc.
predictions_df = df.withColumn(
"prediction",
predict_udf(struct(*[col(c) for c in feature_cols]))
)
# 4. Create a Spark UDF for distributed inference
# This approach is more efficient for large-scale inference
spark_udf = mlflow.pyfunc.spark_udf(
spark=spark,
model_uri=model_uri,
result_type=DoubleType()
)
# Apply the Spark UDF
predictions_df = df.withColumn(
"prediction",
spark_udf(struct(*[col(c) for c in feature_cols]))
)
# 5. Optimize batch inference with partitioning
# First, save predictions to a Delta table
predictions_df.write.format("delta") \
.partitionBy("date_col") \
.mode("append") \
.save("/path/to/predictions")
# Optimize with Z-ordering for faster queries
spark.sql("""
OPTIMIZE delta.`/path/to/predictions`
ZORDER BY (customer_id)
""")
# 6. Use Feature Store batch scoring
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
# Score batch of data using a registered model and features from Feature Store
batch_df = fs.score_batch(
model_uri=model_uri,
df=inference_df,
feature_lookups=[
FeatureLookup(
table_name="customer_features",
feature_names=["feature1", "feature2", "feature3"],
lookup_key="customer_id"
)
]
)
# 1. Set up a streaming data source
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
# Define the schema of the streaming data
schema = StructType([
StructField("customer_id", StringType(), True),
StructField("feature1", DoubleType(), True),
StructField("feature2", DoubleType(), True),
StructField("feature3", DoubleType(), True),
StructField("timestamp", StringType(), True)
])
# Read from a streaming source (e.g., Kafka)
streaming_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host:port") \
.option("subscribe", "input_topic") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("data")) \
.select("data.*")
# 2. Load the model for streaming inference
import mlflow
model_uri = "models:/RandomForestRegressor/Production"
model = mlflow.pyfunc.load_model(model_uri)
# 3. Apply the model using a UDF
from pyspark.sql.functions import pandas_udf, struct
from pyspark.sql.types import DoubleType
import pandas as pd
# Define the UDF
@pandas_udf(DoubleType())
def predict_udf(features_pd: pd.Series) -> pd.Series:
# Make predictions
return pd.Series(model.predict(features_pd.to_frame().T))
# Apply the UDF to the streaming DataFrame
feature_cols = ["feature1", "feature2", "feature3"]
predictions_streaming_df = streaming_df.withColumn(
"prediction",
predict_udf(struct(*[col(c) for c in feature_cols]))
)
# 4. Write predictions to a streaming sink
query = predictions_streaming_df \
.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start("/path/to/streaming_predictions")
# 5. Alternative: Use a foreachBatch function for more control
def process_batch(batch_df, batch_id):
# Additional processing specific to each batch
processed_df = batch_df.withColumn("batch_id", lit(batch_id))
# Write to Delta table
processed_df.write.format("delta").mode("append").save("/path/to/predictions")
streaming_query = predictions_streaming_df \
.writeStream \
.foreachBatch(process_batch) \
.option("checkpointLocation", "/path/to/checkpoint") \
.trigger(processingTime="1 minute") \
.start()
# 6. Handle watermarking for late data
windowed_df = streaming_df \
.withWatermark("timestamp", "10 minutes") \
.groupBy(
window(col("timestamp"), "5 minutes"),
col("customer_id")
) \
.agg(avg("feature1").alias("avg_feature1"))
# Then apply model to windowed data...
# Note: Much of the real-time serving setup is done through the Databricks UI or REST API
# Here's a conceptual guide for the key operations:
# 1. Enable Model Serving for a registered model
# Through UI: Go to Model Registry -> Select model -> Serving tab -> Enable serving
# Or programmatically via REST API:
"""
curl -X POST https://<databricks-instance>/api/2.0/serving-endpoints \
-H "Authorization: Bearer <token>" \
-d '{
"name": "rf-prediction-endpoint",
"config": {
"served_models": [{
"model_name": "RandomForestRegressor",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": true
}]
}
}'
"""
# 2. Query the model endpoint programmatically
import requests
import json
def query_endpoint(endpoint_name, input_data):
url = f"https://<databricks-instance>/serving-endpoints/{endpoint_name}/invocations"
headers = {
"Authorization": f"Bearer <token>",
"Content-Type": "application/json"
}
data_json = json.dumps({
"dataframe_records": input_data
})
response = requests.post(url, headers=headers, data=data_json)
return response.json()
# Example input data
input_data = [
{"feature1": 0.5, "feature2": 0.2, "feature3": 0.8}
]
# Get predictions
predictions = query_endpoint("rf-prediction-endpoint", input_data)
print(predictions)
# 3. Integration with Feature Store for online serving
# First, ensure your Feature Store table is published for online serving
"""
# This would be done through the Feature Store UI or API
fs.publish_table(
name="customer_features",
online=True
)
"""
# Then, when making real-time predictions, include feature lookup
def query_with_features(endpoint_name, lookup_keys):
url = f"https://<databricks-instance>/serving-endpoints/{endpoint_name}/invocations"
headers = {
"Authorization": f"Bearer <token>",
"Content-Type": "application/json"
}
data_json = json.dumps({
"dataframe_records": [{"customer_id": key} for key in lookup_keys],
"feature_lookups": [
{
"table_name": "customer_features",
"lookup_key": "customer_id"
}
]
})
response = requests.post(url, headers=headers, data=data_json)
return response.json()
# Example lookup keys
lookup_keys = ["customer_123", "customer_456"]
# Get predictions with feature lookups
predictions = query_with_features("rf-prediction-endpoint", lookup_keys)
print(predictions)
Which deployment pattern is most appropriate for generating predictions for millions of customers once per day? A) Real-time serving B) Batch inference C) Streaming inference D) Online serving
What is the primary advantage of using mlflow.pyfunc.spark_udf() over a regular pandas UDF? A) It allows for more complex preprocessing B) It distributes the model across the cluster efficiently C) It enables automatic model retraining D) It provides lower latency for small datasets
When implementing a streaming inference pipeline, what Spark feature helps deal with late-arriving data? A) Checkpointing B) Watermarking C) Trigger once D) Output mode
What is NOT a typical use case for real-time model serving? A) Fraud detection during a transaction B) Real-time product recommendations on a website C) Daily customer churn prediction reports D) Instant credit approval decisions
Which method allows the most efficient batch scoring using features from the Feature Store? A) fs.read_table() followed by model.predict() B) fs.score_batch() C) mlflow.pyfunc.spark_udf() with feature lookup D) model.predict() with feature joiner
What is a benefit of Z-ordering a Delta table containing model predictions? A) It compresses the data to save storage space B) It enables faster queries on specific columns C) It enforces schema validation D) It guarantees ACID transactions
When converting a batch inference pipeline to streaming, what component must be changed? A) The model itself B) The input data source and potentially the output sink C) The MLflow tracking system D) The cluster configuration only
What is the recommended way to handle state information in a streaming model inference pipeline? A) Store state in a separate database B) Use Spark’s stateful processing with watermarking C) Avoid stateful operations in model inference D) Write custom state handlers
Which Databricks feature enables scaling real-time model serving to zero when not in use? A) Auto Scaling B) Scale-to-Zero C) Serverless Endpoints D) Cluster Autotermination
What’s the primary benefit of partitioning a Delta table containing model predictions? A) It improves write performance by distributing data B) It enables query pruning for faster reads on partition columns C) It guarantees data consistency D) It allows for better compression
Which deployment pattern is most appropriate for generating predictions for millions of customers once per day? Answer: B) Batch inference
Explanation: Batch inference is ideal for high-volume, scheduled processing where predictions don’t need to be generated in real-time. Running predictions for millions of customers once daily is a classic batch inference scenario.
What is the primary advantage of using mlflow.pyfunc.spark_udf() over a regular pandas UDF? Answer: B) It distributes the model across the cluster efficiently
Explanation: The spark_udf function from MLflow efficiently distributes model inference across a Spark cluster, allowing for better parallelization and throughput compared to a standard pandas UDF implementation.
When implementing a streaming inference pipeline, what Spark feature helps deal with late-arriving data? Answer: B) Watermarking
Explanation: Watermarking in Spark Structured Streaming allows you to specify how late data can arrive and still be processed, which is essential for handling out-of-order data in streaming pipelines.
What is NOT a typical use case for real-time model serving? Answer: C) Daily customer churn prediction reports
Explanation: Daily churn prediction reports are a batch processing use case, not requiring real-time serving. The other options (fraud detection, real-time recommendations, and instant credit decisions) all require immediate responses and are suitable for real-time serving.
Which method allows the most efficient batch scoring using features from the Feature Store? Answer: B) fs.score_batch()
Explanation: The score_batch() method from the FeatureStoreClient is specifically designed for efficient batch scoring with features from the Feature Store, handling the feature lookups and model prediction in an optimized way.
What is a benefit of Z-ordering a Delta table containing model predictions? Answer: B) It enables faster queries on specific columns
Explanation: Z-ordering co-locates related data in the same files, allowing for more efficient data skipping during queries, which significantly speeds up queries that filter on the Z-ordered columns.
When converting a batch inference pipeline to streaming, what component must be changed? Answer: B) The input data source and potentially the output sink
Explanation: To convert a batch pipeline to streaming, you must change how data is read (using readStream instead of read) and how results are written (using writeStream instead of write). The model itself doesn’t necessarily change.
What is the recommended way to handle state information in a streaming model inference pipeline? Answer: B) Use Spark’s stateful processing with watermarking
Explanation: Spark Structured Streaming provides built-in stateful processing capabilities, which, when combined with watermarking, allow for proper handling of state across streaming micro-batches.
Which Databricks feature enables scaling real-time model serving to zero when not in use? Answer: B) Scale-to-Zero
Explanation: Databricks Model Serving includes a Scale-to-Zero feature that allows endpoints to automatically scale down to zero compute when they’re not receiving requests, reducing costs.
What’s the primary benefit of partitioning a Delta table containing model predictions? Answer: B) It enables query pruning for faster reads on partition columns
Explanation: Partitioning a Delta table allows the query optimizer to skip irrelevant partitions (data files) during read operations, significantly improving query performance when filtering on the partition columns.
Here are the concepts to review:
Understanding different types of drift is essential for monitoring ML systems effectively.
Key Concepts:
Essential Operations:
Statistical methods help quantify and detect drift in production ML systems.
Key Concepts:
Essential Operations:
End-to-end systems for monitoring and responding to drift in production.
Key Concepts:
Essential Operations:
# 1. Set up data for drift analysis
import pandas as pd
import numpy as np
from scipy import stats
import mlflow
import matplotlib.pyplot as plt
from pyspark.sql import functions as F
# Load reference data (training distribution)
reference_df = spark.table("feature_store.training_features")
# Load current data (production distribution)
current_df = spark.table("feature_store.current_features")
# 2. Basic feature drift detection with summary statistics
# Convert to pandas for easier statistical analysis
reference_pd = reference_df.toPandas()
current_pd = current_df.toPandas()
# Compare summary statistics for numeric features
numeric_features = ["feature1", "feature2", "feature3"]
print("Reference vs Current Summary Statistics:")
for feature in numeric_features:
ref_mean = reference_pd[feature].mean()
ref_std = reference_pd[feature].std()
curr_mean = current_pd[feature].mean()
curr_std = current_pd[feature].std()
mean_shift = abs(ref_mean - curr_mean) / ref_std
print(f"Feature: {feature}")
print(f" Reference: mean={ref_mean:.4f}, std={ref_std:.4f}")
print(f" Current: mean={curr_mean:.4f}, std={curr_std:.4f}")
print(f" Mean shift (in std): {mean_shift:.4f}")
print(f" Significant drift: {mean_shift > 0.5}")
print("")
# 3. Categorical feature drift detection
categorical_features = ["category1", "category2"]
for feature in categorical_features:
# Calculate value frequencies
ref_counts = reference_pd[feature].value_counts(normalize=True)
curr_counts = current_pd[feature].value_counts(normalize=True)
# Combine and fill missing categories with 0
all_categories = set(ref_counts.index) | set(curr_counts.index)
ref_dist = {cat: ref_counts.get(cat, 0) for cat in all_categories}
curr_dist = {cat: curr_counts.get(cat, 0) for cat in all_categories}
print(f"Feature: {feature}")
print(f" Reference distribution: {ref_dist}")
print(f" Current distribution: {curr_dist}")
# Check for new or missing categories
missing_cats = set(ref_counts.index) - set(curr_counts.index)
new_cats = set(curr_counts.index) - set(ref_counts.index)
if missing_cats:
print(f" Missing categories in current data: {missing_cats}")
if new_cats:
print(f" New categories in current data: {new_cats}")
print("")
# 4. Visualize distributions for key features
for feature in numeric_features:
plt.figure(figsize=(10, 6))
plt.hist(reference_pd[feature], alpha=0.5, label="Reference")
plt.hist(current_pd[feature], alpha=0.5, label="Current")
plt.title(f"Distribution Comparison: {feature}")
plt.legend()
plt.savefig(f"/tmp/{feature}_drift.png")
# Log the plot to MLflow
with mlflow.start_run(run_name=f"drift_detection_{feature}"):
mlflow.log_artifact(f"/tmp/{feature}_drift.png")
# 1. Kolmogorov-Smirnov test for numerical features
from scipy import stats
import numpy as np
from sklearn.preprocessing import StandardScaler
import scipy.spatial.distance as distance
def ks_test_drift(reference, current, feature, alpha=0.05):
"""
Perform Kolmogorov-Smirnov test for distribution drift
"""
# Remove nulls for statistical testing
ref_data = reference[feature].dropna().values
curr_data = current[feature].dropna().values
# Perform KS test
ks_stat, p_value = stats.ks_2samp(ref_data, curr_data)
# Determine if drift detected
drift_detected = p_value < alpha
return {
"feature": feature,
"ks_statistic": ks_stat,
"p_value": p_value,
"drift_detected": drift_detected,
"test": "Kolmogorov-Smirnov"
}
# Run KS test for each numeric feature
ks_results = []
for feature in numeric_features:
result = ks_test_drift(reference_pd, current_pd, feature)
ks_results.append(result)
print(f"KS Test - {feature}: Drift Detected = {result['drift_detected']} (p-value: {result['p_value']:.6f})")
# 2. Jensen-Shannon divergence for distribution comparison
def jensen_shannon_divergence(p, q):
"""
Calculate Jensen-Shannon divergence between two distributions
"""
# Ensure valid probability distributions (sum to 1)
p = np.asarray(p) / np.sum(p)
q = np.asarray(q) / np.sum(q)
# Calculate the average distribution
m = (p + q) / 2
# Calculate JS divergence
divergence = (distance.entropy(p, m) + distance.entropy(q, m)) / 2
return divergence
def js_test_drift(reference, current, feature, bins=20, threshold=0.1):
"""
Use Jensen-Shannon divergence to detect drift in numeric features
"""
# Remove nulls
ref_data = reference[feature].dropna().values
curr_data = current[feature].dropna().values
# Create histogram distributions
min_val = min(ref_data.min(), curr_data.min())
max_val = max(ref_data.max(), curr_data.max())
ref_hist, _ = np.histogram(ref_data, bins=bins, range=(min_val, max_val), density=True)
curr_hist, _ = np.histogram(curr_data, bins=bins, range=(min_val, max_val), density=True)
# Add small epsilon to avoid zero probabilities
epsilon = 1e-10
ref_hist = ref_hist + epsilon
curr_hist = curr_hist + epsilon
# Calculate JS divergence
js_div = jensen_shannon_divergence(ref_hist, curr_hist)
# Determine if drift detected
drift_detected = js_div > threshold
return {
"feature": feature,
"js_divergence": js_div,
"drift_detected": drift_detected,
"threshold": threshold,
"test": "Jensen-Shannon"
}
# Run JS divergence test for each numeric feature
js_results = []
for feature in numeric_features:
result = js_test_drift(reference_pd, current_pd, feature)
js_results.append(result)
print(f"JS Divergence - {feature}: Drift Detected = {result['drift_detected']} (divergence: {result['js_divergence']:.6f})")
# 3. Chi-square test for categorical features
def chi_square_test_drift(reference, current, feature, alpha=0.05):
"""
Perform Chi-square test for categorical distribution drift
"""
# Get value counts
ref_counts = reference[feature].value_counts()
curr_counts = current[feature].value_counts()
# Get union of all categories
all_categories = set(ref_counts.index) | set(curr_counts.index)
# Create arrays with counts for each category
ref_array = np.array([ref_counts.get(cat, 0) for cat in all_categories])
curr_array = np.array([curr_counts.get(cat, 0) for cat in all_categories])
# Chi-square test requires expected counts >= 5, so filter categories
valid_mask = (ref_array >= 5) & (curr_array >= 5)
if np.sum(valid_mask) < 2:
return {
"feature": feature,
"chi2_statistic": np.nan,
"p_value": np.nan,
"drift_detected": False,
"test": "Chi-square",
"note": "Not enough valid categories for Chi-square test"
}
# Filter arrays
ref_array = ref_array[valid_mask]
curr_array = curr_array[valid_mask]
# Perform Chi-square test
chi2_stat, p_value = stats.chisquare(curr_array, ref_array)
# Determine if drift detected
drift_detected = p_value < alpha
return {
"feature": feature,
"chi2_statistic": chi2_stat,
"p_value": p_value,
"drift_detected": drift_detected,
"test": "Chi-square"
}
# Run Chi-square test for each categorical feature
chi2_results = []
for feature in categorical_features:
result = chi_square_test_drift(reference_pd, current_pd, feature)
chi2_results.append(result)
if np.isnan(result['p_value']):
print(f"Chi-square Test - {feature}: {result['note']}")
else:
print(f"Chi-square Test - {feature}: Drift Detected = {result['drift_detected']} (p-value: {result['p_value']:.6f})")
# 1. Set up a workflow to detect and respond to drift
from datetime import datetime
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
def detect_and_respond_to_drift(reference_table, current_table, model_name, version, drift_threshold=0.05):
"""
End-to-end workflow to detect drift and trigger retraining if needed
"""
# Load data
reference_df = spark.table(reference_table)
current_df = spark.table(current_table)
# Convert to pandas for analysis
reference_pd = reference_df.toPandas()
current_pd = current_df.toPandas()
# Get feature names
numeric_features = [col for col in current_pd.columns if current_pd[col].dtype in ['int64', 'float64']]
categorical_features = [col for col in current_pd.columns if current_pd[col].dtype == 'object']
# Start an MLflow run to track drift analysis
with mlflow.start_run(run_name=f"drift_analysis_{model_name}_v{version}") as run:
# Log basic info
mlflow.log_param("model_name", model_name)
mlflow.log_param("model_version", version)
mlflow.log_param("reference_table", reference_table)
mlflow.log_param("current_table", current_table)
mlflow.log_param("analysis_time", datetime.now().isoformat())
# Initialize drift flags
feature_drift_detected = False
data_quality_drift_detected = False
# 1. Check data quality
for feature in numeric_features + categorical_features:
ref_missing = reference_pd[feature].isna().mean()
curr_missing = current_pd[feature].isna().mean()
missing_diff = abs(ref_missing - curr_missing)
mlflow.log_metric(f"missing_diff_{feature}", missing_diff)
if missing_diff > 0.05: # 5% threshold for missing value difference
data_quality_drift_detected = True
mlflow.log_metric(f"data_quality_drift_{feature}", 1)
else:
mlflow.log_metric(f"data_quality_drift_{feature}", 0)
# 2. Statistical tests for feature drift
drift_features = []
# For numeric features
for feature in numeric_features:
# Run KS test
ks_result = ks_test_drift(reference_pd, current_pd, feature)
# Log results
mlflow.log_metric(f"ks_stat_{feature}", ks_result["ks_statistic"])
mlflow.log_metric(f"ks_pvalue_{feature}", ks_result["p_value"])
mlflow.log_metric(f"ks_drift_{feature}", int(ks_result["drift_detected"]))
# Run JS divergence test
js_result = js_test_drift(reference_pd, current_pd, feature)
# Log results
mlflow.log_metric(f"js_div_{feature}", js_result["js_divergence"])
mlflow.log_metric(f"js_drift_{feature}", int(js_result["drift_detected"]))
# If either test detects drift, mark feature as drifted
if ks_result["drift_detected"] or js_result["drift_detected"]:
feature_drift_detected = True
drift_features.append(feature)
# For categorical features
for feature in categorical_features:
# Run Chi-square test
chi2_result = chi_square_test_drift(reference_pd, current_pd, feature)
# Log results if test was valid
if not np.isnan(chi2_result["p_value"]):
mlflow.log_metric(f"chi2_stat_{feature}", chi2_result["chi2_statistic"])
mlflow.log_metric(f"chi2_pvalue_{feature}", chi2_result["p_value"])
mlflow.log_metric(f"chi2_drift_{feature}", int(chi2_result["drift_detected"]))
if chi2_result["drift_detected"]:
feature_drift_detected = True
drift_features.append(feature)
# Log overall drift status
mlflow.log_metric("feature_drift_detected", int(feature_drift_detected))
mlflow.log_metric("data_quality_drift_detected", int(data_quality_drift_detected))
# Log list of drifted features
if drift_features:
mlflow.log_param("drift_features", ",".join(drift_features))
# 3. Determine if retraining is needed
retraining_needed = feature_drift_detected or data_quality_drift_detected
mlflow.log_metric("retraining_needed", int(retraining_needed))
# 4. If retraining needed, trigger retraining job
if retraining_needed:
print(f"Drift detected for model {model_name} (v{version}). Triggering retraining job.")
# In a real implementation, you would trigger a Databricks job here
# For example, using the Databricks Jobs API
"""
import requests
response = requests.post(
f"{workspace_url}/api/2.0/jobs/run-now",
headers={"Authorization": f"Bearer {token}"},
json={
"job_id": retraining_job_id,
"notebook_params": {
"model_name": model_name,
"drift_features": ",".join(drift_features)
}
}
)
"""
# For this exercise, just log that we would trigger retraining
mlflow.log_param("retraining_job_triggered", "True")
return {
"run_id": run.info.run_id,
"feature_drift_detected": feature_drift_detected,
"data_quality_drift_detected": data_quality_drift_detected,
"retraining_needed": retraining_needed,
"drift_features": drift_features
}
# 2. Evaluate if a retrained model performs better on current data
def evaluate_model_improvement(current_data, original_model_uri, new_model_uri):
"""
Compare performance of original and new models on current data
"""
# Load the original and new models
original_model = mlflow.pyfunc.load_model(original_model_uri)
new_model = mlflow.pyfunc.load_model(new_model_uri)
# Prepare evaluation data
X = current_data.drop("target", axis=1)
y = current_data["target"]
# Make predictions with both models
original_preds = original_model.predict(X)
new_preds = new_model.predict(X)
# Calculate performance metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# For regression
original_mse = mean_squared_error(y, original_preds)
new_mse = mean_squared_error(y, new_preds)
original_mae = mean_absolute_error(y, original_preds)
new_mae = mean_absolute_error(y, new_preds)
original_r2 = r2_score(y, original_preds)
new_r2 = r2_score(y, new_preds)
# Log comparison to MLflow
with mlflow.start_run(run_name="model_comparison") as run:
mlflow.log_param("original_model_uri", original_model_uri)
mlflow.log_param("new_model_uri", new_model_uri)
mlflow.log_metric("original_mse", original_mse)
mlflow.log_metric("new_mse", new_mse)
mlflow.log_metric("mse_improvement", original_mse - new_mse)
mlflow.log_metric("original_mae", original_mae)
mlflow.log_metric("new_mae", new_mae)
mlflow.log_metric("mae_improvement", original_mae - new_mae)
mlflow.log_metric("original_r2", original_r2)
mlflow.log_metric("new_r2", new_r2)
mlflow.log_metric("r2_improvement", new_r2 - original_r2)
# Determine if new model is better
is_improved = (new_mse < original_mse)
mlflow.log_metric("is_improved", int(is_improved))
result = {
"run_id": run.info.run_id,
"original_mse": original_mse,
"new_mse": new_mse,
"mse_improvement": original_mse - new_mse,
"is_improved": is_improved
}
return result
# 3. Set up scheduled monitoring workflow
"""
# This would be implemented as a Databricks job
# The following is pseudocode for a job notebook
# Get the model info
model_name = "regression_model"
model_version = mlflow.tracking.MlflowClient().get_latest_versions(model_name, stages=["Production"])[0].version
# Detect drift
drift_result = detect_and_respond_to_drift(
reference_table="feature_store.training_features",
current_table="feature_store.current_features",
model_name=model_name,
version=model_version
)
# If drift detected and retraining was triggered, evaluate improvement after retraining
if drift_result["retraining_needed"]:
# Wait for retraining job to complete (in real workflow)
# ...
# Get new model URI
new_model_version = mlflow.tracking.MlflowClient().get_latest_versions(model_name, stages=["None"])[0].version
# Evaluate improvement
improvement_result = evaluate_model_improvement(
current_data=spark.table("feature_store.current_features").toPandas(),
original_model_uri=f"models:/{model_name}/{model_version}",
new_model_uri=f"models:/{model_name}/{new_model_version}"
)
# If improved, promote new model to production
if improvement_result["is_improved"]:
mlflow.tracking.MlflowClient().transition_model_version_stage(
name=model_name,
version=new_model_version,
stage="Production"
)
"""
What type of drift occurs when the statistical properties of the target variable change over time? A) Feature drift B) Label drift C) Concept drift D) Data quality drift
Which of the following statistical tests is most appropriate for detecting drift in categorical features? A) Kolmogorov-Smirnov test B) Jensen-Shannon divergence C) Chi-square test D) t-test
When using the Kolmogorov-Smirnov test to detect drift, what does a p-value < 0.05 indicate? A) The distributions are similar B) The distributions are significantly different C) There is no enough data to make a determination D) The test is inconclusive
What is concept drift in the context of machine learning? A) When the model’s code changes over time B) When the statistical properties of features change C) When the relationship between input features and target variable changes D) When users’ understanding of the model changes
What is NOT a typical response to detected feature drift? A) Retraining the model with recent data B) Adjusting feature engineering steps C) Reverting to a previous model version D) Deleting the features that show drift
Which metric is NOT commonly used for monitoring numeric feature drift? A) Mean and standard deviation B) Quantiles (median, percentiles) C) Kolmogorov-Smirnov statistic D) Gini coefficient
What is an advantage of using Jensen-Shannon divergence over Kolmogorov-Smirnov test for drift detection? A) Jensen-Shannon works with categorical features B) Jensen-Shannon is always more accurate C) Jensen-Shannon is symmetric and bounded between 0 and 1 D) Jensen-Shannon requires less data
In a comprehensive drift detection system, why is it important to monitor both individual features and overall model performance? A) Feature drift might not always impact model performance B) It’s redundant but serves as a backup check C) Individual features are too noisy to monitor alone D) Model performance is always more important than feature drift
What is a valid approach to evaluate if a retrained model is better than the current production model? A) Compare their performance on the original training data B) Compare their performance on the most recent production data C) Check if the new model has more features D) Check if the new model is more complex
Which approach is NOT recommended for handling missing categories in categorical feature drift detection? A) Ignore new categories that weren’t in the training data B) Treat missing categories as a significant sign of drift C) Add a small epsilon to zero counts to avoid division by zero D) Include new categories in the distribution comparison
What type of drift occurs when the statistical properties of the target variable change over time? Answer: B) Label drift
Explanation: Label drift specifically refers to changes in the distribution of the target variable (labels) over time. Feature drift refers to changes in input features, while concept drift refers to changes in the relationship between features and the target.
Which of the following statistical tests is most appropriate for detecting drift in categorical features? Answer: C) Chi-square test
Explanation: The Chi-square test is designed for comparing categorical distributions. Kolmogorov-Smirnov and Jensen-Shannon are better suited for continuous distributions, while t-tests compare means of continuous variables.
When using the Kolmogorov-Smirnov test to detect drift, what does a p-value < 0.05 indicate? Answer: B) The distributions are significantly different
Explanation: In statistical hypothesis testing, a p-value below the significance level (commonly 0.05) means we reject the null hypothesis, which in the case of the KS test is that the distributions are the same. Therefore, p < 0.05 indicates the distributions are significantly different, suggesting drift.
What is concept drift in the context of machine learning? Answer: C) When the relationship between input features and target variable changes
Explanation: Concept drift occurs when the underlying relationship between input features and the target variable changes over time, making the learned patterns less valid. This can happen even when the feature distributions themselves remain stable.
What is NOT a typical response to detected feature drift? Answer: D) Deleting the features that show drift
Explanation: Simply deleting drifted features is rarely the right approach, as those features may still contain valuable information. Typical responses include retraining the model, adjusting feature engineering, or reverting to a previous version while investigating.
Which metric is NOT commonly used for monitoring numeric feature drift? Answer: D) Gini coefficient
Explanation: The Gini coefficient is primarily used to measure inequality in distributions, particularly in economics. It’s not commonly used for feature drift detection. Mean, standard deviation, quantiles, and KS statistics are standard metrics for monitoring numeric features.
What is an advantage of using Jensen-Shannon divergence over Kolmogorov-Smirnov test for drift detection? Answer: C) Jensen-Shannon is symmetric and bounded between 0 and 1
Explanation: The Jensen-Shannon divergence has the advantage of being symmetric (the order of distributions doesn’t matter) and bounded between 0 and 1, making it easier to interpret and set thresholds. The KS test doesn’t have these properties.
In a comprehensive drift detection system, why is it important to monitor both individual features and overall model performance? Answer: A) Feature drift might not always impact model performance
Explanation: Feature drift doesn’t always translate to degraded model performance, as the model might be robust to certain changes or the drift might occur in less important features. Monitoring both gives a more complete picture of system health.
What is a valid approach to evaluate if a retrained model is better than the current production model? Answer: B) Compare their performance on the most recent production data
Explanation: The most valid approach is to compare both models on the most recent production data, as this reflects the current reality the model will face. Using the original training data would be inappropriate as it doesn’t represent current conditions.
Which approach is NOT recommended for handling missing categories in categorical feature drift detection? Answer: A) Ignore new categories that weren’t in the training data
Explanation: Ignoring new categories that appear in production but weren’t in training data is not recommended, as these new categories often represent significant drift and should be accounted for in drift detection. The other approaches are valid techniques for handling categorical comparisons.
Here are the concepts to review:
Delta Lake Operations:
Feature Store:
MLflow Experiment Tracking:
MLflow Flavors and Custom Models:
Model Registry:
Model Lifecycle Automation:
Batch Deployment:
Streaming Deployment:
Real-time Serving:
Drift Types:
Drift Detection:
Comprehensive Monitoring:
Section 1: Experimentation & Data Management
Which command is used to optimize a Delta table by co-locating related data? A) OPTIMIZE with VACUUM B) OPTIMIZE with ZORDER BY C) COMPACT with ZORDER BY D) VACUUM with ZORDER BY
When using Feature Store to create a training set, what does the FeatureLookup parameter specify? A) The data source for feature values B) Which features to retrieve and how to join them C) How to compute new features on the fly D) Which features to exclude from the dataset
How can you include preprocessing logic in a model logged with MLflow? A) Create a custom PyFunc model with preprocessing in the predict method B) Add preprocessing as a separate step in the MLflow pipeline C) Use mlflow.sklearn.autolog() with preprocess=True D) Log the preprocessor as a separate artifact
What is the correct way to read a specific version of a Delta table? A) spark.read.format(“delta”).option(“versionAsOf”, 5).load(“/path/to/table”) B) spark.read.format(“delta”).version(5).load(“/path/to/table”) C) spark.read.format(“delta”).option(“version”, 5).load(“/path/to/table”) D) spark.read.format(“delta”).history(5).load(“/path/to/table”)
Which method should you use to track nested runs in MLflow? A) mlflow.start_run(nested=True) B) mlflow.log_nested_run() C) mlflow.create_child_run() D) mlflow.track_nested()
When writing to a Feature Store table with existing data, which mode ensures only specific records are updated? A) “overwrite” B) “append” C) “merge” D) “update”
What is a model signature in MLflow used for? A) Digitally signing the model to verify its creator B) Defining the expected input and output schema C) Ensuring the model can be used in production D) Providing a unique identifier for the model
How do you access the metrics from a specific MLflow run programmatically? A) mlflow.get_run(run_id).data.metrics B) mlflow.client.get_metrics(run_id) C) mlflow.runs.get_metrics(run_id) D) mlflow.tracking.get_run_metrics(run_id)
Section 2: Model Lifecycle Management
Which MLflow Client method is used to change a model’s stage from Staging to Production? A) client.update_model_version() B) client.change_stage() C) client.transition_model_version_stage() D) client.set_model_version_stage()
What happens when you register a model with a name that already exists in the Model Registry? A) It overwrites the existing model B) It creates a new version of the model C) It returns an error D) It creates a copy with a suffix
Which webhook event is triggered when a model version is moved to the Production stage? A) MODEL_VERSION_CREATED B) MODEL_VERSION_TRANSITIONED_STAGE C) MODEL_MOVED_TO_PRODUCTION D) REGISTERED_MODEL_STAGE_CHANGED
How can you add a description to a specific version of a registered model? A) client.add_description(model_name, version, description) B) client.update_model_version(model_name, version, description=description) C) client.set_model_version_description(model_name, version, description) D) client.add_model_version_description(model_name, version, description)
Which environment is recommended for running production-grade automated ML pipelines? A) All-purpose clusters B) Job clusters C) Interactive clusters D) Development clusters
How do you add a tag to a registered model? A) client.add_model_tag(name, key, value) B) client.set_registered_model_tag(name, key, value) C) client.update_registered_model(name, tags={key: value}) D) client.tag_model(name, key, value)
What is the purpose of the ModelVersion stage “Archived” in the Model Registry? A) To indicate the model is in storage but not actively used B) To keep old versions for reference without making them available for production C) To indicate the model has been backed up to cloud storage D) To delete the model from the registry
How would you programmatically get the latest production version of a model? A) client.get_latest_versions(name, stages=[“Production”])[0] B) client.get_model_version(name, “Production”) C) client.get_production_version(name) D) client.get_registered_model(name).production_version
Section 3: Model Deployment Strategies
Which deployment pattern is most appropriate for generating predictions on new data every hour? A) Real-time serving B) Batch inference C) Streaming inference D) Online prediction
What is the primary advantage of using mlflow.pyfunc.spark_udf() for model deployment? A) It allows model serving directly from Spark B) It distributes model inference across a Spark cluster C) It enables real-time predictions D) It automatically handles feature engineering
When implementing a streaming inference pipeline, what feature helps manage state across micro-batches? A) foreachBatch B) Watermarking C) checkpointLocation D) outputMode
Which option in Databricks Model Serving allows an endpoint to use no resources when not receiving requests? A) Auto Scaling B) Scale-to-Zero C) Serverless Endpoints D) Dynamic Allocation
What method from the Feature Store client is optimized for batch scoring with feature lookups? A) fs.batch_score() B) fs.score_batch() C) fs.lookup_and_score() D) fs.predict_batch()
When using a watermark in a Structured Streaming application, what does it help with? A) Data encryption B) Authentication of the data source C) Managing late-arriving data D) Reducing memory usage
What is a typical reason to choose batch deployment over real-time deployment? A) When predictions must be generated in milliseconds B) When high throughput is more important than low latency C) When user interaction requires immediate feedback D) When the model is small enough to fit in memory
What Spark Structured Streaming output mode should you use when you want to write only new predictions? A) “complete” B) “update” C) “append” D) “insert-only”
Section 4: Solution & Data Monitoring
What type of drift occurs when the relationship between features and the target variable changes? A) Feature drift B) Label drift C) Concept drift D) Data drift
Which statistical test is appropriate for detecting drift in numerical features? A) Chi-square test B) Kolmogorov-Smirnov test C) Fisher’s exact test D) Log-likelihood ratio test
What does a low p-value in the Kolmogorov-Smirnov test indicate? A) The distributions are similar B) The distributions are significantly different C) The test is inconclusive D) There is not enough data for the test
Which approach is NOT typically part of a comprehensive drift monitoring solution? A) Monitoring feature distributions B) Tracking model performance metrics C) Retraining models on a fixed schedule regardless of drift D) Evaluating new models against current data
What is an advantage of using Jensen-Shannon divergence for drift detection? A) It works well with categorical data B) It’s faster to compute than other tests C) It’s symmetric and bounded between 0 and 1 D) It requires fewer data points
What is the most appropriate way to compare a retrained model against the production model? A) Compare their performance on the original training data B) Compare their complexity (number of parameters) C) Compare their performance on recent production data D) Compare their training time and resource usage
When monitoring categorical features for drift, how should you handle new categories that weren’t in the training data? A) Ignore them as they weren’t part of the original distribution B) Consider them as a significant indicator of drift C) Group them all into an “Other” category D) Remove records with new categories from the analysis
What is the primary purpose of monitoring data quality alongside drift detection? A) To validate the data pipeline is working correctly B) To identify changes that might impact model performance before drift occurs C) To comply with regulations D) To ensure data storage is optimized
Additional Questions (covering all domains)
Which MLflow flavor would you use for a custom model with complex preprocessing logic? A) mlflow.sklearn B) mlflow.tensorflow C) mlflow.custom D) mlflow.pyfunc
How would you optimize a Delta table for queries that frequently filter on a date column and then need to access sorted user IDs? A) OPTIMIZE table ZORDER BY (date_col, user_id) B) OPTIMIZE table PARTITION BY (date_col) ZORDER BY (user_id) C) OPTIMIZE table ZORDER BY (user_id) D) OPTIMIZE table PARTITION BY (date_col, user_id)
What is the benefit of using Feature Store for model inference in production? A) It automatically retrains models when features drift B) It ensures consistent feature transformation between training and inference C) It reduces the number of features needed for a model D) It improves model accuracy by adding new features
In MLflow, what does setting nested=True in a run allow you to do? A) Nest models inside each other B) Create parent-child relationships between runs C) Use nested cross-validation D) Track nested parameters in dictionaries
What is the primary advantage of job clusters over all-purpose clusters for ML workflows? A) They are more secure B) They automatically scale based on workload C) They are optimized for cost by terminating when jobs complete D) They allow for more concurrent users
When implementing a feature store table that will be used for both training and online inference, what should you consider? A) Using a NoSQL database for storage B) Publishing the table for online serving C) Converting all features to numeric type D) Limiting the number of features to improve performance
Which approach would you use to deploy a model that needs to process a continuous stream of IoT sensor data? A) Real-time serving with RESTful endpoints B) Batch inference scheduled every minute C) Streaming inference with Structured Streaming D) On-device inference without cloud deployment
What does the MLflow model registry enable that experiment tracking alone does not? A) Logging model parameters and metrics B) Organizing model versions and transitions between stages C) Creating model artifacts D) Running hyperparameter tuning
Which method would you use to retrieve an artifact from a specific MLflow run? A) mlflow.get_artifact(run_id, path) B) mlflow.artifacts.download_artifacts(run_id, path) C) mlflow.tracking.download_artifacts(run_id, path) D) mlflow.tracking.MlflowClient().download_artifacts(run_id, path)
What is the purpose of a webhook in the MLflow Model Registry? A) To securely authenticate API calls B) To trigger automated actions when model events occur C) To monitor model performance in production D) To enable web-based model serving
Which deployment pattern is best for a use case requiring predictions on millions of records with no immediate time constraint? A) Real-time serving B) Batch inference C) Streaming inference D) Edge deployment
What is a suitable method to detect drift in a categorical feature with many possible values? A) Kolmogorov-Smirnov test B) Jensen-Shannon divergence C) Chi-square test D) t-test
How would you load a specific version of a model from the Model Registry? A) mlflow.pyfunc.load_model(“models:/model_name/version”) B) mlflow.pyfunc.load_model(“models:/model_name/3”) C) mlflow.load_model(“models:/model_name/3”) D) mlflow.load_version(“model_name”, 3)
Which Delta Lake operation helps maintain storage costs by removing old file versions? A) OPTIMIZE B) VACUUM C) COMPACT D) CLEAN
What is the best way to ensure a model’s preprocessing steps are consistently applied in production? A) Document the steps for manual implementation B) Include preprocessing in the model’s pipeline or custom PyFunc class C) Apply preprocessing in the application code D) Create a separate preprocessing service
What would you use to automatically track parameters, metrics, and artifacts for Spark ML models? A) mlflow.start_run() B) mlflow.spark.autolog() C) mlflow.log_model() D) spark.track_with_mlflow()
Which approach allows for the most efficient querying of model predictions stored in a Delta table? A) Saving predictions in JSON format B) Partitioning by frequently filtered columns and Z-ordering C) Using Parquet instead of Delta format D) Denormalizing the prediction data
What is the recommended way to deploy a model that needs to process real-time payments for fraud detection? A) Batch inference every hour B) Streaming inference with latency of minutes C) Real-time model serving with millisecond latency D) Edge deployment on payment terminals
When setting up a webhook for model registry events, which parameter indicates the webhook should trigger on model transitions to Production? A) events=[“PRODUCTION_TRANSITION”] B) stage=”Production” C) target_stage=”Production” D) production_only=True
What is the best practice for organizing experiments in MLflow? A) Create a new experiment for each model type B) Use the default experiment for all runs C) Create a new experiment for each business problem D) Create a new experiment for each modeler
Which of the following is NOT a standard stage in the MLflow Model Registry? A) Development B) Staging C) Production D) Archived
What is the recommended approach when concept drift is detected in a production model? A) Roll back to a previous model version B) Retrain the model with recent data and evaluate performance C) Add more features to the model D) Increase the model complexity
Which method would you use to register a model directly from an MLflow run? A) mlflow.register_model_from_run(run_id, name) B) mlflow.register_model(“runs:/run_id/model”, name) C) mlflow.models.register(run_id, name) D) mlflow.client.register_model(run_id, name)
What does the score_batch method in the Feature Store client do? A) Scores models for feature importance B) Retrieves features and applies a model in one operation C) Batches multiple scoring requests for efficiency D) Calculates feature statistics in batches
Which approach is LEAST appropriate for detecting data quality issues? A) Monitoring the percentage of missing values B) Checking for out-of-range values based on training data C) Comparing average prediction values over time D) Verifying schema consistency
What is a recommended practice when deploying models via Databricks Jobs? A) Use all-purpose clusters for cost efficiency B) Include validation steps before promoting to production C) Deploy all models as a single job D) Use the same runtime for all jobs
How should you implement incremental feature computation in a streaming context? A) Recompute all features on each batch B) Use window functions and stateful processing C) Store feature values in a database and update manually D) Use batch processing instead of streaming
Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? A) Real-time serving endpoints B) Webhooks and scheduled jobs C) Delta Lake time travel D) Feature importance calculation
Which command is used to optimize a Delta table by co-locating related data? Answer: B) OPTIMIZE with ZORDER BY
Explanation: The OPTIMIZE command with ZORDER BY co-locates related data in the same files, which improves query performance by reducing the amount of data that needs to be read.
When using Feature Store to create a training set, what does the FeatureLookup parameter specify? Answer: B) Which features to retrieve and how to join them
Explanation: FeatureLookup specifies which features to retrieve from which feature tables and how to join them with the training dataset using lookup keys.
How can you include preprocessing logic in a model logged with MLflow? Answer: A) Create a custom PyFunc model with preprocessing in the predict method
Explanation: Creating a custom PyFunc model allows you to implement preprocessing logic in the predict method, ensuring the same preprocessing is applied consistently during inference.
What is the correct way to read a specific version of a Delta table? Answer: A) spark.read.format(“delta”).option(“versionAsOf”, 5).load(“/path/to/table”)
Explanation: The option “versionAsOf” is the correct parameter for specifying a particular version of a Delta table when reading.
Which method should you use to track nested runs in MLflow? Answer: A) mlflow.start_run(nested=True)
Explanation: Setting nested=True in mlflow.start_run() creates a child run under the current parent run, enabling hierarchical organization of experiment runs.
When writing to a Feature Store table with existing data, which mode ensures only specific records are updated? Answer: C) “merge”
Explanation: The “merge” mode updates existing records based on matching primary keys and inserts new records, while “overwrite” would replace the entire table.
What is a model signature in MLflow used for? Answer: B) Defining the expected input and output schema
Explanation: A model signature defines the expected data types and shapes for model inputs and outputs, enabling validation when the model is used for inference.
How do you access the metrics from a specific MLflow run programmatically? Answer: A) mlflow.get_run(run_id).data.metrics
Explanation: The correct way to access metrics from an MLflow run is by using the get_run() method and accessing the metrics through the .data.metrics attribute.
Which MLflow Client method is used to change a model’s stage from Staging to Production? Answer: C) client.transition_model_version_stage()
Explanation: The transition_model_version_stage() method is used to change a model version’s stage in the Model Registry.
What happens when you register a model with a name that already exists in the Model Registry? Answer: B) It creates a new version of the model
Explanation: Registering a model with an existing name creates a new version under that name rather than overwriting the existing model or returning an error.
Which webhook event is triggered when a model version is moved to the Production stage? Answer: B) MODEL_VERSION_TRANSITIONED_STAGE
Explanation: The MODEL_VERSION_TRANSITIONED_STAGE event is triggered when a model version’s stage changes, including transitions to Production.
How can you add a description to a specific version of a registered model? Answer: B) client.update_model_version(model_name, version, description=description)
Explanation: The update_model_version() method is used to update metadata for a specific model version, including its description.
Which environment is recommended for running production-grade automated ML pipelines? Answer: B) Job clusters
Explanation: Job clusters are purpose-built for production workloads. They start when a job begins and terminate when it completes, optimizing costs.
How do you add a tag to a registered model? Answer: B) client.set_registered_model_tag(name, key, value)
Explanation: The set_registered_model_tag() method is the correct API for adding or updating a tag on a registered model.
What is the purpose of the ModelVersion stage “Archived” in the Model Registry? Answer: B) To keep old versions for reference without making them available for production
Explanation: The “Archived” stage is used for model versions that are no longer actively used but should be kept for reference or compliance purposes.
How would you programmatically get the latest production version of a model? Answer: A) client.get_latest_versions(name, stages=[“Production”])[0]
Explanation: The get_latest_versions() method with stages=[“Production”] returns the latest model version in the Production stage.
Which deployment pattern is most appropriate for generating predictions on new data every hour? Answer: B) Batch inference
Explanation: Batch inference is ideal for scheduled, periodic processing of data when real-time predictions aren’t required.
What is the primary advantage of using mlflow.pyfunc.spark_udf() for model deployment? Answer: B) It distributes model inference across a Spark cluster
Explanation: The spark_udf() function efficiently distributes model inference across a Spark cluster, enabling parallel processing for better performance.
When implementing a streaming inference pipeline, what feature helps manage state across micro-batches? Answer: C) checkpointLocation
Explanation: The checkpointLocation option in Structured Streaming enables fault tolerance by saving state information between micro-batches.
Which option in Databricks Model Serving allows an endpoint to use no resources when not receiving requests? Answer: B) Scale-to-Zero
Explanation: Scale-to-Zero allows model serving endpoints to automatically scale down to zero compute when they’re not receiving requests, reducing costs.
What method from the Feature Store client is optimized for batch scoring with feature lookups? Answer: B) fs.score_batch()
Explanation: The score_batch() method is specifically designed for efficient batch scoring with features from the Feature Store.
When using a watermark in a Structured Streaming application, what does it help with? Answer: C) Managing late-arriving data
Explanation: Watermarking in Spark Structured Streaming allows you to specify how late data can arrive and still be processed, which is essential for handling out-of-order data.
What is a typical reason to choose batch deployment over real-time deployment? Answer: B) When high throughput is more important than low latency
Explanation: Batch deployment is typically chosen when processing large volumes of data (high throughput) is more important than immediate responses (low latency).
What Spark Structured Streaming output mode should you use when you want to write only new predictions? Answer: C) “append”
Explanation: The “append” output mode only writes new output records to the sink, which is appropriate for writing only new predictions.
What type of drift occurs when the relationship between features and the target variable changes? Answer: C) Concept drift
Explanation: Concept drift refers specifically to changes in the relationship between input features and the target variable, which affects model performance even if feature distributions remain stable.
Which statistical test is appropriate for detecting drift in numerical features? Answer: B) Kolmogorov-Smirnov test
Explanation: The Kolmogorov-Smirnov test is commonly used to detect drift in numerical features by comparing the distributions of two samples.
What does a low p-value in the Kolmogorov-Smirnov test indicate? Answer: B) The distributions are significantly different
Explanation: In statistical hypothesis testing, a low p-value (typically <0.05) means we reject the null hypothesis, which in the KS test is that the distributions are the same.
Which approach is NOT typically part of a comprehensive drift monitoring solution? Answer: C) Retraining models on a fixed schedule regardless of drift
Explanation: A comprehensive drift monitoring solution typically triggers retraining based on detected drift, not on a fixed schedule regardless of whether drift has occurred.
What is an advantage of using Jensen-Shannon divergence for drift detection? Answer: C) It’s symmetric and bounded between 0 and 1
Explanation: Jensen-Shannon divergence has the advantage of being symmetric and bounded between 0 and 1, making it easier to interpret and set thresholds.
What is the most appropriate way to compare a retrained model against the production model? Answer: C) Compare their performance on recent production data
Explanation: The most appropriate comparison is on recent production data, as this reflects the current conditions the model will face.
When monitoring categorical features for drift, how should you handle new categories that weren’t in the training data? Answer: B) Consider them as a significant indicator of drift
Explanation: New categories that weren’t present in training data are often significant indicators of drift and should be flagged as such.
What is the primary purpose of monitoring data quality alongside drift detection? Answer: B) To identify changes that might impact model performance before drift occurs
Explanation: Data quality monitoring helps identify issues (like missing values or outliers) that might impact model performance before they manifest as drift.
Which MLflow flavor would you use for a custom model with complex preprocessing logic? Answer: D) mlflow.pyfunc
Explanation: The MLflow pyfunc flavor allows you to create custom Python models by extending the PythonModel class, enabling complex preprocessing logic.
How would you optimize a Delta table for queries that frequently filter on a date column and then need to access sorted user IDs? Answer: A) OPTIMIZE table ZORDER BY (date_col, user_id)
Explanation: Z-ordering by both columns will co-locate data with similar dates and user IDs, optimizing queries that filter or sort by these columns.
What is the benefit of using Feature Store for model inference in production? Answer: B) It ensures consistent feature transformation between training and inference
Explanation: Feature Store ensures that the same feature definitions and transformations are used consistently between training and inference.
In MLflow, what does setting nested=True in a run allow you to do? Answer: B) Create parent-child relationships between runs
Explanation: The nested=True parameter creates a hierarchical relationship between runs, with the current run becoming a child of the active parent run.
What is the primary advantage of job clusters over all-purpose clusters for ML workflows? Answer: C) They are optimized for cost by terminating when jobs complete
Explanation: Job clusters automatically terminate when a job completes, optimizing costs by not consuming resources when not needed.
When implementing a feature store table that will be used for both training and online inference, what should you consider? Answer: B) Publishing the table for online serving
Explanation: For a feature table to be available for online (real-time) inference, it needs to be explicitly published for online serving.
Which approach would you use to deploy a model that needs to process a continuous stream of IoT sensor data? Answer: C) Streaming inference with Structured Streaming
Explanation: Streaming inference with Structured Streaming is ideal for continuously processing real-time data streams like IoT sensor data.
What does the MLflow model registry enable that experiment tracking alone does not? Answer: B) Organizing model versions and transitions between stages
Explanation: The Model Registry adds versioning and stage transitions (None, Staging, Production, Archived) that aren’t available with experiment tracking alone.
Which method would you use to retrieve an artifact from a specific MLflow run? Answer: D) mlflow.tracking.MlflowClient().download_artifacts(run_id, path)
Explanation: The download_artifacts method of the MlflowClient is used to retrieve artifacts from a specific run.
What is the purpose of a webhook in the MLflow Model Registry? Answer: B) To trigger automated actions when model events occur
Explanation: Webhooks enable automated actions (like triggering a job) in response to model events such as version creation or stage transitions.
Which deployment pattern is best for a use case requiring predictions on millions of records with no immediate time constraint? Answer: B) Batch inference
Explanation: Batch inference is the most efficient pattern for high-volume predictions when there’s no immediate time constraint.
What is a suitable method to detect drift in a categorical feature with many possible values? Answer: C) Chi-square test
Explanation: The Chi-square test is appropriate for comparing categorical distributions, including those with many possible values.
How would you load a specific version of a model from the Model Registry? Answer: B) mlflow.pyfunc.load_model(“models:/model_name/3”)
Explanation: To load a specific version, you use the URI format “models:/model_name/version_number” with the version number.
Which Delta Lake operation helps maintain storage costs by removing old file versions? Answer: B) VACUUM
Explanation: The VACUUM command permanently removes files that are no longer needed by the table and have been marked for deletion by previous operations.
What is the best way to ensure a model’s preprocessing steps are consistently applied in production? Answer: B) Include preprocessing in the model’s pipeline or custom PyFunc class
Explanation: Including preprocessing in the model itself (via pipeline or custom class) ensures consistent application in all environments.
What would you use to automatically track parameters, metrics, and artifacts for Spark ML models? Answer: B) mlflow.spark.autolog()
Explanation: The mlflow.spark.autolog() function automatically logs parameters, metrics, and artifacts for Spark ML models.
Which approach allows for the most efficient querying of model predictions stored in a Delta table? Answer: B) Partitioning by frequently filtered columns and Z-ordering
Explanation: Partitioning by frequently filtered columns combined with Z-ordering provides the most efficient query performance on Delta tables.
What is the recommended way to deploy a model that needs to process real-time payments for fraud detection? Answer: C) Real-time model serving with millisecond latency
Explanation: Fraud detection during payment processing requires immediate responses, making real-time model serving with low latency the appropriate choice.
When setting up a webhook for model registry events, which parameter indicates the webhook should trigger on model transitions to Production? Answer: C) target_stage=”Production”
Explanation: The target_stage parameter specifies that the webhook should trigger when a model is transitioned to a specific stage, in this case, Production.
What is the best practice for organizing experiments in MLflow? Answer: C) Create a new experiment for each business problem
Explanation: Creating experiments around business problems or use cases helps organize related model development efforts.
Which of the following is NOT a standard stage in the MLflow Model Registry? Answer: A) Development
Explanation: The standard stages in the MLflow Model Registry are None (default), Staging, Production, and Archived. “Development” is not a standard stage.
What is the recommended approach when concept drift is detected in a production model? Answer: B) Retrain the model with recent data and evaluate performance
Explanation: When concept drift is detected, retraining the model with recent data that reflects the new relationship between features and target is recommended.
Which method would you use to register a model directly from an MLflow run? Answer: B) mlflow.register_model(“runs:/run_id/model”, name)
Explanation: The register_model function with a runs URI is used to register a model directly from a run.
What does the score_batch method in the Feature Store client do? Answer: B) Retrieves features and applies a model in one operation
Explanation: The score_batch method efficiently retrieves features from the Feature Store and applies a model in a single operation.
Which approach is LEAST appropriate for detecting data quality issues? Answer: C) Comparing average prediction values over time
Explanation: While useful for monitoring model behavior, comparing average predictions is not a direct method for detecting data quality issues like missing values or schema changes.
What is a recommended practice when deploying models via Databricks Jobs? Answer: B) Include validation steps before promoting to production
Explanation: Including validation steps in deployment jobs ensures that models meet quality standards before being promoted to production.
How should you implement incremental feature computation in a streaming context? Answer: B) Use window functions and stateful processing
Explanation: Window functions and stateful processing in Structured Streaming enable efficient incremental feature computation as new data arrives.
Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? Answer: B) Webhooks and scheduled jobs
Explanation: Webhooks and scheduled jobs provide the automation backbone needed to trigger retraining and deployment when drift is detected.
What is the correct way to create a Feature Store table? A) fs.create_table(name, primary_keys, df, description) B) fs.create_feature_table(name, keys, dataframe, description) C) fs.register_table(name, keys, dataframe, description) D) fs.create_new_table(name, primary_keys, dataframe, description)
How do you restore a Delta table to a specific timestamp?
A) spark.read.format(“delta”).option(“timestampAsOf”, timestamp).load(path).write.format(“delta”).mode(“overwrite”).save(path)
B) spark.restore.format(“delta”).timestamp(timestamp).save(path)
C) DeltaTable.forPath(spark, path).restoreToTimestamp(timestamp)
D) spark.sql(f”RESTORE TABLE delta.{path} TO TIMESTAMP AS OF ‘{timestamp}’”)
Which MLflow method can automatically log parameters, metrics, and artifacts without explicit logging statements? A) mlflow.auto_track() B) mlflow.start_tracking() C) mlflow.autolog() D) mlflow.enable_logging()
What does the “version” field in the Delta table history tell you? A) The schema version of the Delta protocol B) The version number of each transaction in sequence C) The version of Spark used to write the table D) The version of the table schema
How would you share features from a feature table with a specific model during training? A) fs.get_features(table_name, feature_names) B) fs.create_training_set(df, feature_lookups, label) C) fs.lookup_features(table_name, feature_names, lookup_key) D) fs.select_features(table_name, feature_names, lookup_key)
What information does MLflow NOT automatically capture when autologging is enabled? A) Model parameters B) Performance metrics C) Feature importance D) Business context for the model
Which operation is NOT possible with Delta Lake? A) Time travel to query a previous version B) Schema evolution to add new columns C) Rollback to a previous version D) Real-time streaming without micro-batching
What is the purpose of specifying input_example when logging a model in MLflow? A) To validate the model’s input schema B) To provide an example for documentation and inference testing C) To set default values for the model D) To optimize model storage
What is the purpose of a custom PyFunc model class in MLflow? A) To implement models in programming languages other than Python B) To include custom preprocessing and postprocessing with the model C) To optimize model inference speed D) To create ensemble models from multiple base models
Which MLflow Model Registry stage is a model automatically assigned when first registered? A) Development B) None C) Staging D) Production
What method would you use to transition a model version from Staging to Production programmatically? A) client.set_model_version_stage(name, version, “Production”) B) client.transition_model_version_stage(name, version, “Production”) C) client.update_model_version(name, version, stage=”Production”) D) client.promote_model_version(name, version, “Staging”, “Production”)
How would you archive a specific version of a registered model using the MLflow client? A) client.archive_model_version(name, version) B) client.set_model_version_archived(name, version) C) client.update_model_version(name, version, status=”ARCHIVED”) D) client.transition_model_version_stage(name, version, “Archived”)
What is the purpose of model aliases in MLflow? A) To provide user-friendly names for complex models B) To create named references to specific model versions C) To specify which flavor to use when loading a model D) To categorize models by their intended use
Which method would you use to add a tag to a specific model version? A) client.add_model_version_tag(name, version, key, value) B) client.tag_model_version(name, version, key, value) C) client.set_model_version_tag(name, version, key, value) D) client.update_model_version(name, version, tags={key: value})
What happens when you create a webhook for MODEL_VERSION_CREATED events? A) It triggers when any model version is created in the workspace B) It triggers only when versions of the specified model are created C) It triggers when a model version is created or updated D) It triggers only when a model version is created by the webhook owner
Which type of compute is most appropriate for a CI/CD pipeline that automatically tests new model versions? A) All-purpose cluster B) Single-node cluster C) Job cluster D) Interactive cluster
What is the primary advantage of deploying a model with spark_udf() over a pandas UDF? A) It enables GPU acceleration B) It distributes model inference across the cluster C) It allows for more complex preprocessing D) It reduces model serving latency
Which deployment pattern is most appropriate for a model that needs to provide product recommendations when a user visits a webpage? A) Batch inference B) Streaming inference C) Real-time serving D) Embedded inference
What is the purpose of a checkpointLocation in a Structured Streaming application? A) To cache model predictions for faster retrieval B) To save the streaming state for fault tolerance C) To record model metrics for monitoring D) To store a copy of the model for rollback
When deploying a model for batch inference, which optimization technique would MOST improve query performance on specific customer segments? A) OPTIMIZE with bin-packing B) VACUUM to remove stale files C) Z-ORDER by customer segment columns D) Partitioning by random IDs
What is the benefit of using Feature Store with real-time model serving? A) It automatically updates feature values in real-time B) It provides low-latency access to pre-computed features C) It eliminates the need for feature transformations D) It improves model accuracy for real-time predictions
Which deployment pattern uses the least amount of resources for a model that needs to generate predictions once per week? A) Continuous streaming deployment B) Always-on real-time serving C) Scheduled batch inference with job clusters D) On-demand batch inference with all-purpose clusters
What is watermarking used for in a Structured Streaming application? A) To validate data authenticity B) To manage state for late-arriving data C) To encrypt sensitive data D) To compress streaming output
Which approach would you use to efficiently apply a model to a streaming source with millions of events per second? A) Foreachbatch with spark_udf B) Real-time model serving endpoints C) Process each event individually with pandas UDFs D) Queue events and process in micro-batches
What type of drift occurs when the overall statistical distribution of input features changes? A) Concept drift B) Label drift C) Feature drift D) Model drift
Which statistical test is most appropriate for detecting drift in a categorical feature with many unique values? A) t-test B) Chi-square test C) Kolmogorov-Smirnov test D) Mann-Whitney U test
What is a limitation of using basic summary statistics (mean, median, etc.) for drift detection? A) They require too much computation B) They can miss changes in distribution shape while means remain similar C) They don’t work with large datasets D) They can only be applied to numerical features
When monitoring a production model, which metric should trigger immediate investigation if it suddenly increases? A) Model inference latency B) Number of unique users C) Prediction error rate compared to baseline D) Model version number
What is the Jensen-Shannon divergence used for in model monitoring? A) Measuring the computational efficiency of a model B) Quantifying the difference between probability distributions C) Evaluating model convergence during training D) Calculating feature importance scores
Which of the following is NOT typically a sign of concept drift? A) Stable error metrics but changing feature distributions B) Degrading error metrics with stable feature distributions C) Changes in both feature distributions and error metrics D) Increasing inference latency with no changes in data
What is a recommended approach for handling missing values when monitoring categorical features for drift? A) Treat missing values as a separate category B) Ignore records with missing values C) Impute missing values with the mode D) Replace missing values with a fixed string
Which metric would be LEAST useful for detecting label drift? A) Jensen-Shannon divergence of label distributions B) Chi-square test on label frequencies C) Percentage of missing labels D) Model training time
What is the primary benefit of implementing feature monitoring in the Feature Store? A) It automatically prevents drift B) It detects changes in feature distributions over time C) It optimizes feature computation D) It improves feature selection
How would you implement a model that requires both batch inference for daily reports and real-time inference for user interactions? A) Create two separate models with different code B) Deploy the same model artifact through both batch and real-time patterns C) Use only streaming inference as a compromise D) Convert all use cases to batch processing
What is the correct way to log a confusion matrix visualization to MLflow? A) mlflow.log_confusion_matrix(cm) B) mlflow.log_figure(fig) C) mlflow.log_metrics({“confusion_matrix”: cm}) D) mlflow.log_artifact(“/tmp/confusion_matrix.png”)
Which approach is most efficient for deploying a model that needs to process streaming data and update aggregated metrics in real-time? A) Real-time serving with REST endpoints B) Structured Streaming with stateful processing C) Batch inference run every minute D) Lambda architecture with separate batch and speed layers
What is a key benefit of using Delta tables to store model predictions? A) They automatically optimize model accuracy B) They provide ACID transactions and time travel capabilities C) They eliminate the need for feature engineering D) They automatically detect data drift
What would you use to track the lineage of a feature in the Feature Store? A) Delta table history B) MLflow tags C) Feature Store metadata and descriptions D) Webhook event logs
When evaluating a model for concept drift, what approach provides the most direct evidence? A) Comparing feature distributions between training and production B) Monitoring model prediction distributions C) Measuring model error on recent labeled data D) Tracking inference latency
Which deployment strategy requires the least code modification when converting from batch to streaming inference? A) Using foreachBatch with the same processing logic B) Reimplementing with Kafka consumers C) Creating a new REST API endpoint D) Implementing a custom streaming sink
What does a model signature in MLflow help prevent? A) Unauthorized model access B) Model versioning conflicts C) Input schema mismatches during inference D) Overwriting production models
Which approach would you use to handle a model with different preprocessing requirements for different feature types? A) Create separate models for each feature type B) Implement a custom PyFunc model with conditional preprocessing C) Use only features that share the same preprocessing D) Convert all features to the same type first
What is a best practice for organizing experiments in MLflow for a data science team? A) Use a single experiment for all team members B) Create separate experiments for each model iteration C) Organize experiments by business problem or use case D) Create a new experiment for each data scientist
How would you implement a comprehensive monitoring solution that detects both data drift and model performance issues? A) Monitor only prediction distributions B) Track feature distributions and model error metrics separately C) Compare model outputs between versions D) Run periodic A/B tests in production
Which strategy should you use when promoting a model to production if it performs better on some metrics but worse on others? A) Always prioritize the model with better accuracy B) Evaluate metrics based on business impact and prioritize accordingly C) Create an ensemble of both models D) Keep the current model to avoid any regression
What is the correct approach to handle evolving schemas in feature tables? A) Create a new feature table for each schema version B) Use Delta Lake schema evolution capabilities C) Maintain fixed schemas and reject new columns D) Convert all data to string type to avoid schema issues
Which method would you use to log multiple evaluation metrics for a model at once in MLflow? A) Call mlflow.log_metric() for each metric B) mlflow.log_metrics(metrics_dict) C) mlflow.log_dict(metrics_dict, “metrics.json”) D) mlflow.sklearn.log_metrics(metrics_dict)
What is a key benefit of using job clusters over all-purpose clusters for production ML workflows? A) They support more concurrent users B) They provide better interactive development experiences C) They are optimized for cost by terminating when jobs complete D) They allow for GPU acceleration
Which feature in Delta Lake is most important for implementing reproducible ML pipelines? A) Schema enforcement B) Time travel C) ACID transactions D) Z-ordering
How would you ensure that a model’s feature preprocessing is consistent between training and inference? A) Document the preprocessing steps carefully B) Include preprocessing in the model pipeline or custom PyFunc class C) Apply preprocessing in the application code D) Use separate preprocessing services
What is the recommended way to handle missing values in categorical features for a model in production? A) Drop records with missing values B) Apply the same imputation strategy used during training C) Return an error for records with missing values D) Use a default value not seen in training
What is the most appropriate response when detecting significant feature drift in a production model? A) Immediately roll back to a previous model version B) Assess impact on model performance and retrain if needed C) Add more features to compensate for the drift D) Switch to a more complex model architecture
Which approach allows for the most efficient querying of model predictions stored in a Delta table? A) Using the latest version of the table B) Partitioning by frequently filtered columns and Z-ordering C) Converting to Parquet format D) Using databricks SQL endpoints
When implementing a feature table that will be used for both batch and real-time inference, what should you consider? A) Using only numerical features for consistency B) Limiting the number of features to improve latency C) Publishing the table for online serving D) Creating separate tables for batch and real-time use
What is the primary purpose of the Model Registry in an MLOps workflow? A) To track experiments and hyperparameters B) To organize model versions and transitions between stages C) To store model artifacts D) To monitor model performance in production
What does the score_batch method in the Feature Store client do? A) Calculates feature importance scores B) Retrieves features and applies a model in one operation C) Evaluates model performance on batches of data D) Computes batch statistics for monitoring
Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? A) Delta Lake time travel B) Webhooks and scheduled jobs C) Feature Store table partitioning D) Custom model flavors
What should you do before promoting a new model version to production? A) Always retrain on the most recent data B) Validate performance against a holdout dataset C) Increase the model complexity D) Add more features to improve accuracy
What is a key consideration when designing a feature computation pipeline that needs to support both batch and streaming inference? A) Use only features that can be computed in real-time B) Implement feature logic that works in both paradigms C) Optimize exclusively for batch performance D) Create separate implementations for each pattern
Which approach is most effective for detecting concept drift in a model where ground truth labels are available with a delay? A) Monitor feature distributions only B) Calculate error metrics once labels become available C) Use unsupervised drift detection methods D) Compare model prediction distributions over time
Here are the answers to the second practice exam, along with explanations for each question:
What is the correct way to create a Feature Store table? Answer: A) fs.create_table(name, primary_keys, df, description)
Explanation: The correct method for creating a feature table is create_table() with parameters for the table name, primary keys, DataFrame, and optional description.
How do you restore a Delta table to a specific timestamp? Answer: A) spark.read.format(“delta”).option(“timestampAsOf”, timestamp).load(path).write.format(“delta”).mode(“overwrite”).save(path)
Explanation: Delta Lake doesn’t have a direct “restore” command. To restore a table to a previous timestamp, you need to read the table at that timestamp and then overwrite the current table.
Which MLflow method can automatically log parameters, metrics, and artifacts without explicit logging statements? Answer: C) mlflow.autolog()
Explanation: mlflow.autolog() enables automatic logging of parameters, metrics, and artifacts for supported libraries without requiring explicit logging statements.
What does the “version” field in the Delta table history tell you? Answer: B) The version number of each transaction in sequence
Explanation: The version field in Delta table history represents the sequential version number of each transaction on the table, starting from 0 for the initial creation.
How would you share features from a feature table with a specific model during training? Answer: B) fs.create_training_set(df, feature_lookups, label)
Explanation: The create_training_set method creates a training dataset by joining features from feature tables with a DataFrame containing the entity keys and labels.
What information does MLflow NOT automatically capture when autologging is enabled? Answer: D) Business context for the model
Explanation: Autologging captures technical information like parameters, metrics, and artifacts, but cannot capture business context, which must be added manually.
Which operation is NOT possible with Delta Lake? Answer: D) Real-time streaming without micro-batching
Explanation: Delta Lake supports streaming through Spark Structured Streaming, which uses a micro-batch architecture. True “real-time” streaming without micro-batching is not supported.
What is the purpose of specifying input_example when logging a model in MLflow? Answer: B) To provide an example for documentation and inference testing
Explanation: An input example provides a sample input that can be used for documentation, inference testing, and as a reference for what the model expects.
What is the purpose of a custom PyFunc model class in MLflow? Answer: B) To include custom preprocessing and postprocessing with the model
Explanation: A custom PyFunc model allows you to include preprocessing and postprocessing logic with the model, ensuring consistent application during inference.
Which MLflow Model Registry stage is a model automatically assigned when first registered? Answer: B) None
Explanation: Newly registered models start in the “None” stage by default, and must be explicitly transitioned to other stages like Staging or Production.
What method would you use to transition a model version from Staging to Production programmatically? Answer: B) client.transition_model_version_stage(name, version, “Production”)
Explanation: The transition_model_version_stage method is used to move a model version between registry stages.
How would you archive a specific version of a registered model using the MLflow client? Answer: D) client.transition_model_version_stage(name, version, “Archived”)
Explanation: To archive a model version, you transition it to the “Archived” stage using the same method used for other stage transitions.
What is the purpose of model aliases in MLflow? Answer: B) To create named references to specific model versions
Explanation: Model aliases provide named references to specific model versions, allowing for more flexible development and deployment workflows.
Which method would you use to add a tag to a specific model version? Answer: C) client.set_model_version_tag(name, version, key, value)
Explanation: The set_model_version_tag method is used to add or update a tag on a specific version of a registered model.
What happens when you create a webhook for MODEL_VERSION_CREATED events? Answer: B) It triggers only when versions of the specified model are created
Explanation: Webhooks for MODEL_VERSION_CREATED events trigger only when new versions of the specified model are created, not for all models.
Which type of compute is most appropriate for a CI/CD pipeline that automatically tests new model versions? Answer: C) Job cluster
Explanation: Job clusters are designed for automated workloads like CI/CD pipelines. They start when a job begins and terminate when it completes, optimizing costs.
What is the primary advantage of deploying a model with spark_udf() over a pandas UDF? Answer: B) It distributes model inference across the cluster
Explanation: spark_udf() efficiently distributes model inference across a Spark cluster, allowing for better parallelization and throughput.
Which deployment pattern is most appropriate for a model that needs to provide product recommendations when a user visits a webpage? Answer: C) Real-time serving
Explanation: Real-time serving is appropriate for use cases requiring immediate predictions with low latency, such as providing recommendations when a user visits a webpage.
What is the purpose of a checkpointLocation in a Structured Streaming application? Answer: B) To save the streaming state for fault tolerance
Explanation: The checkpointLocation saves the state of a streaming query, enabling fault tolerance and exactly-once processing semantics.
When deploying a model for batch inference, which optimization technique would MOST improve query performance on specific customer segments? Answer: C) Z-ORDER by customer segment columns
Explanation: Z-ordering by customer segment columns co-locates related data, improving query performance when filtering or aggregating by those segments.
What is the benefit of using Feature Store with real-time model serving? Answer: B) It provides low-latency access to pre-computed features
Explanation: Feature Store provides low-latency access to pre-computed features via its online store, which is critical for real-time serving.
Which deployment pattern uses the least amount of resources for a model that needs to generate predictions once per week? Answer: C) Scheduled batch inference with job clusters
Explanation: Scheduled batch inference with job clusters that automatically terminate after completion is the most resource-efficient for weekly predictions.
What is watermarking used for in a Structured Streaming application? Answer: B) To manage state for late-arriving data
Explanation: Watermarking defines how late data can arrive and still be processed, helping manage state for stateful operations in streaming applications.
Which approach would you use to efficiently apply a model to a streaming source with millions of events per second? Answer: A) Foreachbatch with spark_udf
Explanation: Using foreachbatch with spark_udf allows for efficient batch processing of events within a streaming context, which is ideal for high-throughput scenarios.
What type of drift occurs when the overall statistical distribution of input features changes? Answer: C) Feature drift
Explanation: Feature drift specifically refers to changes in the statistical distribution of input features over time.
Which statistical test is most appropriate for detecting drift in a categorical feature with many unique values? Answer: B) Chi-square test
Explanation: The Chi-square test is appropriate for comparing categorical distributions, including those with many unique values.
What is a limitation of using basic summary statistics (mean, median, etc.) for drift detection? Answer: B) They can miss changes in distribution shape while means remain similar
Explanation: Summary statistics can miss important changes in distribution shape or variance if central tendencies like the mean remain similar.
When monitoring a production model, which metric should trigger immediate investigation if it suddenly increases? Answer: C) Prediction error rate compared to baseline
Explanation: A sudden increase in prediction error compared to baseline indicates a potential model degradation that requires immediate investigation.
What is the Jensen-Shannon divergence used for in model monitoring? Answer: B) Quantifying the difference between probability distributions
Explanation: Jensen-Shannon divergence is a method to measure the similarity between two probability distributions, making it useful for detecting drift.
Which of the following is NOT typically a sign of concept drift? Answer: D) Increasing inference latency with no changes in data
Explanation: Increasing inference latency without data changes is typically an infrastructure or resource issue, not concept drift which involves changes in the relationship between features and targets.
What is a recommended approach for handling missing values when monitoring categorical features for drift? Answer: A) Treat missing values as a separate category
Explanation: Treating missing values as a separate category allows you to monitor changes in the rate of missingness, which can be an important indicator of drift.
Which metric would be LEAST useful for detecting label drift? Answer: D) Model training time
Explanation: Model training time has no direct relationship to label drift, which concerns changes in the distribution of target variables.
What is the primary benefit of implementing feature monitoring in the Feature Store? Answer: B) It detects changes in feature distributions over time
Explanation: Feature monitoring in the Feature Store helps detect changes in feature distributions over time, which is critical for maintaining model performance.
How would you implement a model that requires both batch inference for daily reports and real-time inference for user interactions? Answer: B) Deploy the same model artifact through both batch and real-time patterns
Explanation: The best practice is to use the same model artifact deployed through different patterns (batch and real-time) to ensure consistency.
What is the correct way to log a confusion matrix visualization to MLflow? Answer: D) mlflow.log_artifact(“/tmp/confusion_matrix.png”)
Explanation: To log a visualization, you typically save it as an image file and then use log_artifact to add it to the MLflow run.
Which approach is most efficient for deploying a model that needs to process streaming data and update aggregated metrics in real-time? Answer: B) Structured Streaming with stateful processing
Explanation: Structured Streaming with stateful processing is designed for efficiently processing streaming data and maintaining real-time aggregations.
What is a key benefit of using Delta tables to store model predictions? Answer: B) They provide ACID transactions and time travel capabilities
Explanation: Delta tables provide ACID transactions and time travel capabilities, ensuring data reliability and enabling historical analysis.
What would you use to track the lineage of a feature in the Feature Store? Answer: C) Feature Store metadata and descriptions
Explanation: Feature Store metadata and descriptions are used to document and track the lineage, sources, and transformations of features.
When evaluating a model for concept drift, what approach provides the most direct evidence? Answer: C) Measuring model error on recent labeled data
Explanation: Measuring model error on recent labeled data directly evaluates whether the relationship between features and target has changed.
Which deployment strategy requires the least code modification when converting from batch to streaming inference? Answer: A) Using foreachBatch with the same processing logic
Explanation: The foreachBatch method allows you to reuse batch processing logic within a streaming context with minimal code changes.
What does a model signature in MLflow help prevent? Answer: C) Input schema mismatches during inference
Explanation: A model signature defines the expected input and output schemas, helping prevent mismatches during inference.
Which approach would you use to handle a model with different preprocessing requirements for different feature types? Answer: B) Implement a custom PyFunc model with conditional preprocessing
Explanation: A custom PyFunc model allows you to implement conditional preprocessing logic based on feature types.
What is a best practice for organizing experiments in MLflow for a data science team? Answer: C) Organize experiments by business problem or use case
Explanation: Organizing experiments by business problem or use case helps maintain clarity and focus in the team’s workflow.
How would you implement a comprehensive monitoring solution that detects both data drift and model performance issues? Answer: B) Track feature distributions and model error metrics separately
Explanation: A comprehensive solution tracks both feature distributions (to detect data drift) and model error metrics (to detect performance issues).
Which strategy should you use when promoting a model to production if it performs better on some metrics but worse on others? Answer: B) Evaluate metrics based on business impact and prioritize accordingly
Explanation: The best approach is to evaluate the business impact of each metric and prioritize those that align with the business goals.
What is the correct approach to handle evolving schemas in feature tables? Answer: B) Use Delta Lake schema evolution capabilities
Explanation: Delta Lake’s schema evolution capabilities allow you to add, modify, or remove columns while maintaining backward compatibility.
Which method would you use to log multiple evaluation metrics for a model at once in MLflow? Answer: B) mlflow.log_metrics(metrics_dict)
Explanation: The log_metrics method allows you to log multiple metrics at once by passing a dictionary of metric names and values.
What is a key benefit of using job clusters over all-purpose clusters for production ML workflows? Answer: C) They are optimized for cost by terminating when jobs complete
Explanation: Job clusters automatically terminate when a job completes, optimizing costs by not consuming resources when not needed.
Which feature in Delta Lake is most important for implementing reproducible ML pipelines? Answer: B) Time travel
Explanation: Time travel allows you to access specific versions of data, ensuring reproducibility in ML pipelines by using consistent data snapshots.
How would you ensure that a model’s feature preprocessing is consistent between training and inference? Answer: B) Include preprocessing in the model pipeline or custom PyFunc class
Explanation: Including preprocessing in the model pipeline or custom class ensures the same steps are applied consistently during both training and inference.
What is the recommended way to handle missing values in categorical features for a model in production? Answer: B) Apply the same imputation strategy used during training
Explanation: Consistency is key - you should apply the same imputation strategy used during training to ensure the model receives similar data in production.
What is the most appropriate response when detecting significant feature drift in a production model? Answer: B) Assess impact on model performance and retrain if needed
Explanation: The appropriate response is to first assess whether the drift impacts model performance, and then retrain if necessary.
Which approach allows for the most efficient querying of model predictions stored in a Delta table? Answer: B) Partitioning by frequently filtered columns and Z-ordering
Explanation: Partitioning by frequently filtered columns combined with Z-ordering provides the most efficient query performance.
When implementing a feature table that will be used for both batch and real-time inference, what should you consider? Answer: C) Publishing the table for online serving
Explanation: For real-time inference, the feature table needs to be published for online serving to enable low-latency access.
What is the primary purpose of the Model Registry in an MLOps workflow? Answer: B) To organize model versions and transitions between stages
Explanation: The Model Registry’s primary purpose is to organize model versions and manage transitions between stages (None, Staging, Production, Archived).
What does the score_batch method in the Feature Store client do? Answer: B) Retrieves features and applies a model in one operation
Explanation: The score_batch method efficiently retrieves features from the Feature Store and applies a model in a single operation.
Which component is essential for implementing an automated ML pipeline that retrains when drift is detected? Answer: B) Webhooks and scheduled jobs
Explanation: Webhooks and scheduled jobs provide the automation backbone needed to trigger retraining and deployment when drift is detected.
What should you do before promoting a new model version to production? Answer: B) Validate performance against a holdout dataset
Explanation: Validating performance against a holdout dataset is a critical step to ensure the new model performs as expected before promotion.
What is a key consideration when designing a feature computation pipeline that needs to support both batch and streaming inference? Answer: B) Implement feature logic that works in both paradigms
Explanation: The feature logic should be implemented in a way that works consistently in both batch and streaming paradigms to ensure consistency.
Which approach is most effective for detecting concept drift in a model where ground truth labels are available with a delay? Answer: B) Calculate error metrics once labels become available
Explanation: When ground truth labels are available (even with a delay), calculating error metrics provides the most direct measure of concept drift.
As you prepare for the actual exam, keep these strategies in mind: