End-to-End MLOps Pipeline using MLFlow (Part-3)

15 min readOct 10, 2024

Part 3: Model Deployment, Monitoring, and Continuous Improvement

7. Model Evaluation
8. Model Deployment with MLFlow
9. Monitoring the Deployed Model
10. Model Registry and Versioning
11. Continuous Improvement
12. Conclusion

Read Part 2: End-to-End MLOps Pipeline using MLFlow in case you missed the ML model training and hyperparameter tuning.

7. Model Evaluation

Evaluating a machine learning model is crucial to understanding how well it performs on unseen data. In this section, we will cover common regression metrics used to assess model performance, how to log these metrics using MLFlow, and how to visualize the results using MLFlow’s UI.

Metrics Explanation

For regression tasks like the bike prediction model, several evaluation metrics help quantify how well the model fits the data. Let’s briefly cover the most commonly used metrics:

R² (Coefficient of Determination):
Measures how well the predicted values match the actual values. It ranges from 0 to 1, where 1 indicates perfect predictions. A higher R² indicates a better fit.
Formula:

R-squared

RMSE (Root Mean Squared Error):
Provides a measure of the average error magnitude. RMSE is sensitive to outliers and is in the same unit as the target variable, making it easier to interpret. A lower RMSE means a better model.
Formula:

RMSE

MAE (Mean Absolute Error):
Represents the average of the absolute differences between the predicted and actual values. Like RMSE, a lower MAE indicates a better model, but it is less sensitive to large errors (outliers).
Formula:

MAE

Logging Metrics with MLFlow

Once we compute these metrics, it’s essential to track them using MLFlow for future reference and comparison across different model runs. MLFlow allows us to log metrics like R², RMSE, and MAE for each model we train.

Here’s an example of how to log evaluation metrics with MLFlow:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import mlflow

# Evaluate the model
y_pred = best_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

# Log the metrics using MLFlow
with mlflow.start_run():
    mlflow.log_metric("R2", r2)
    mlflow.log_metric("RMSE", rmse)
    mlflow.log_metric("MAE", mae)

print(f"R²: {r2:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")

In this code:

R², RMSE, and MAE are calculated using scikit-learn metrics.
These metrics are logged using mlflow.log_metric() inside an MLFlow run.
Once logged, MLFlow keeps a record of these metrics, making it easier to compare performance across different models and tuning strategies.

Visualizing Results with MLFlow UI

MLFlow’s powerful UI provides an easy way to visualize and compare the performance of different model runs. After logging the metrics, you can open the MLFlow UI to explore:

Comparing Runs:
In the MLFlow UI, you can compare different runs by looking at the R², RMSE, and MAE metrics across all experiments. This makes it simple to identify the model with the best performance.
Plotting Metrics:
MLFlow provides graphical visualizations for the metrics you logged. You can plot metrics such as RMSE over time or across different hyperparameter configurations to observe trends and choose the best model.

To launch the MLFlow UI, run the following command in your terminal:

mlflow ui

Once the UI is running, navigate to http://localhost:5000 in your browser. You’ll see a dashboard that allows you to explore individual runs, compare models, and visualize metrics. This helps in understanding the performance of different models at a glance.

Why Logging Metrics and Visualizing Results Matters

By logging your metrics and using MLFlow’s visual tools:

Reproducibility: You can easily track and reproduce any experiment, ensuring that even the best-performing model from previous runs is recoverable.
Comparison: MLFlow allows you to efficiently compare the performance of various models and configurations, helping you make informed decisions.
Insights: Visualizing metrics can reveal important trends or issues (e.g., overfitting, underfitting) in your model, guiding further experimentation and tuning.

8. Model Deployment with MLFlow

After training and evaluating your machine learning model, the next step in the MLOps pipeline is deploying it so it can be used in production. This section will guide you through saving the best-performing model with MLFlow, serving the model through a REST API, and integrating it into larger systems using tools like Docker.

MLFlow Models: Saving the Best Performing Model

MLFlow makes it straightforward to save models along with their metadata for easy reuse and deployment. By saving the model, you not only preserve the trained model but also retain information about the parameters, environment, and code version used during training. This ensures that the deployment process is consistent and reproducible.

In MLFlow, models are stored in a standardized format that includes:

Model artifacts: The actual model file (e.g., .pkl or .h5).
Conda environment: Information about the Python environment (dependencies, versions) needed to run the model.
MLFlow model signature: Details about the input/output schema of the model.

Here’s how you can save the best-performing model:

import mlflow
import mlflow.sklearn

# Save the best model using MLFlow
with mlflow.start_run():
    # Take the first row of the training dataset as the model input example.
    input_example = X_train.iloc[[0]]
    # The signature is automatically inferred from the input example and its predicted output.
    mlflow.sklearn.log_model(best_model, "bike_prediction_model", input_example=input_example)
    print("Model saved successfully!")

This code saves the best_model under the name "bike_prediction_model" in the MLFlow tracking server. You can later retrieve and use this model for deployment.

Serving the Model: Using MLFlow to Deploy via REST API

One of the biggest advantages of MLFlow is its seamless ability to serve models as a REST API, enabling external applications to interact with your model in real time. With MLFlow, you can easily serve the saved model through an HTTP endpoint.

Here’s how you can serve the model:

Serve the model locally using MLFlow’s built-in server:

mlflow models serve -m "runs:/<run_id>/bike_prediction_model" -p 1234 --no-conda

In this command:

-m specifies the model path. You can find this path by navigating to the MLFlow UI.
-p specifies the port to serve the model, e.g., port 1234.

Test the API by sending a request using curl or any API testing tool:

curl -X POST -H "Content-Type: application/json" -d '{"dataframe_split":{"columns":["season","holiday","workingday","weather","temp","atemp","humidity","windspeed","casual","registered","hour","day_of_week", "month","is_clear_weather","is_rainy_weather","is_holiday_workingday"],"data":[[3.0,0.0,1.0,1.0,33.62,40.15,59.0,0.0,29.0,98.0,11.0,1.0,7.0,1.0,0.0,0.0]]}}' http://127.0.0.1:1234/invocations

This API will return predictions based on the input data sent in the request. The API accepts data in the form of a JSON payload, where "columns" specify the feature names and "data" holds the feature values.

9. Monitoring the Deployed Model

Once a model is deployed, the work isn’t done. Continuous monitoring is essential to ensure the model remains reliable and performs well as new data flows into the system. This section will cover why monitoring is critical, how to leverage MLFlow for tracking model drift, and how to log real-time performance metrics.

Why Monitoring is Important

After deployment, the environment in which your model operates can change, leading to potential issues such as model drift and data drift. Monitoring helps you:

Detect Model Drift: Over time, the relationships in the data may shift, which can degrade your model’s performance. For example, changes in bike rental patterns due to new weather conditions or altered traffic patterns can impact the predictions.
Ensure Model Accuracy: The performance of a model in production may not match the performance observed during training due to changes in input data distributions or system noise. Monitoring helps you catch these discrepancies early.
Track Model Health: Regularly logging and evaluating key performance indicators (KPIs) like prediction accuracy or error rates helps identify any performance degradation and trigger corrective actions, such as retraining the model.

Without proper monitoring, your model may gradually become outdated or inaccurate, leading to poor business outcomes.

MLFlow for Monitoring

MLFlow offers built-in tools to monitor deployed models by logging live data, performance metrics, and detecting shifts in model behavior. With MLFlow, you can set up a robust feedback loop that tracks your model’s real-time performance and helps you compare it with the original metrics from the training phase.

Here’s how MLFlow can help with model monitoring:

Log Live Data: Just like you logged metrics during training, you can log performance data from real-time predictions. This includes tracking metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and others on new, unseen data.
Compare with Baseline Metrics: By comparing the performance of the live model with the baseline metrics logged during training, you can detect if the model is drifting or becoming less accurate over time.
Versioning and Retraining: MLFlow allows you to version models and easily compare the current deployed version with previous models. This enables quick rollbacks or retraining when necessary.

Tracking Real-Time Performance

Tracking the real-time performance of a model is critical to ensuring it continues to deliver accurate predictions. MLFlow provides tools for logging performance metrics from live data, which can be compared against the training data to detect discrepancies.

Here’s an example of how you can monitor real-time performance using MLFlow:

import mlflow
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Simulating live data predictions
def monitor_real_time_predictions(model, X_live, y_true):
    y_pred = model.predict(X_live)
    
    # Calculate real-time metrics
    mae_live = mean_absolute_error(y_true, y_pred)
    rmse_live = mean_squared_error(y_true, y_pred, squared=False)
    
    # Log live data performance metrics using MLFlow
    with mlflow.start_run():
        mlflow.log_metric("MAE_live", mae_live)
        mlflow.log_metric("RMSE_live", rmse_live)
    
    print(f"Real-time MAE: {mae_live}")
    print(f"Real-time RMSE: {rmse_live}")

# Assuming you have live data and true labels
monitor_real_time_predictions(deployed_model, X_live, y_true_live)

In this code:

Real-time predictions are made using the deployed model.
MAE and RMSE metrics are calculated on the live data and logged using MLFlow.
These live metrics can be compared with the training metrics stored in MLFlow to assess the model’s performance drift.

Using MLFlow UI to Monitor and Compare

To make it easier to visualize how your model is performing over time, you can use MLFlow’s UI to track and compare metrics:

View Metrics Over Time: In the MLFlow UI, you can view how real-time metrics evolve, helping you to detect any long-term performance degradation. You can visualize whether the error rate (RMSE or MAE) increases as the model sees more live data.
Compare with Training Runs: MLFlow’s UI allows you to directly compare the metrics logged during training with the real-time metrics logged from production data. This comparison helps you identify whether the model’s accuracy is still aligned with the training performance or if retraining is needed.
Alerts for Drift: You can also set up custom alerts based on certain thresholds. For example, if RMSE or MAE exceeds a certain limit, it could trigger an alert to notify the team of potential model drift.

Why Real-Time Monitoring Matters

By continuously monitoring real-time performance:

Early Detection: You can quickly detect issues like model or data drift, ensuring corrective actions (such as retraining or tuning) are taken before they cause serious problems.
Model Health Insights: Real-time metrics offer ongoing insights into how your model is performing in the real world, enabling you to make informed decisions on whether the model needs updates.
Sustained Performance: Monitoring ensures that your model’s accuracy and reliability are maintained over time, providing consistent and valuable predictions for your business.

10. Model Registry and Versioning

In any MLOps pipeline, keeping track of models and their versions is crucial for smooth deployment, monitoring, and iteration. MLFlow’s Model Registry provides an effective way to manage model versions and the entire lifecycle of a machine learning model — from development to production. This section will explain how to use MLFlow’s model registry to ensure robust versioning, transition models between different stages, and maintain model lifecycle management.

MLFlow Model Registry

The MLFlow Model Registry is designed to address key challenges in managing machine learning models, including:

Versioning: As new models are trained or updated, it’s essential to keep track of their versions to understand which models are in use and how they differ.
Lifecycle Management: Models in real-world applications move through multiple phases (e.g., testing, staging, and production). MLFlow helps track the lifecycle of models and control their transitions between these stages.
Collaborative Management: The model registry provides a central place for data scientists, ML engineers, and operations teams to collaborate on model management.

The model registry allows you to register, update, annotate, and transition models between different lifecycle stages.

Model Lifecycle Management

Models typically go through different stages during their lifecycle. In MLFlow, each model can be assigned to one of the following stages:

None: The model is just registered, and no stage has been assigned yet.
Staging: The model is ready for testing and validation in a production-like environment.
Production: The model is serving predictions to real-world applications or users.
Archived: The model is no longer in use but is retained for historical reference.

MLFlow makes it easy to move models through these stages while maintaining version control, ensuring that every change or update is tracked.

For example, after training a new version of the bike prediction model, you can first transition it to the Staging environment for testing. Once the model passes all tests, you can promote it to Production. If a model is deprecated, you can archive it for future reference.

Example Code: Registering and Promoting Models in MLFlow

To showcase how MLFlow handles model versioning and lifecycle management, let’s walk through the process of registering and promoting a model.

Registering the Model in the MLFlow Registry

Once you have trained and saved your model using MLFlow, the next step is to register it in the model registry. Here’s how you can register a model:

import mlflow
import mlflow.sklearn

# Start an MLFlow run
with mlflow.start_run() as run:
    # Take the first row of the training dataset as the model input example.
    input_example = X_train.iloc[[0]]
    
    # The signature is automatically inferred from the input example and its predicted output.
    mlflow.sklearn.log_model(best_model, "bike_prediction_model", input_example=input_example)

# Register the model in the model registry
    model_uri = f"runs:/{run.info.run_id}/bike_prediction_model"
    mlflow.register_model(model_uri, "BikePredictionModel")

This code logs your trained model and registers it in the MLFlow Model Registry under the name "BikePredictionModel". Each new registration is automatically assigned a version number.

Promoting Models through Different Stages

Once the model is registered, you can transition it between stages (e.g., Staging to Production) using the MLFlow client API:

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Transition the model to staging
client.transition_model_version_stage(
    name="BikePredictionModel",
    version=1,
    stage="Staging"
)

# After testing, promote the model to production
client.transition_model_version_stage(
    name="BikePredictionModel",
    version=1,
    stage="Production"
)

In this example:

The model "BikePredictionModel", version 1, is first moved to the Staging phase for testing.
After it passes validation, it is promoted to Production, where it will be used for live predictions.

Tracking Model Versions

Each time you register a new version of a model, MLFlow keeps a record. You can access these versions and track their changes:

# List all versions of the registered model
model_versions = client.search_model_versions("name='BikePredictionModel'")
for version in model_versions:
    print(f"Version: {version.version}, Stage: {version.current_stage}")

This snippet lists all versions of the BikePredictionModel and shows their current stages, helping you track how models evolve over time.

11. Continuous Improvement

As machine learning models are deployed in production environments, the data they interact with often evolves. This shift in data distribution can cause a model’s performance to degrade over time, making continuous improvement a critical component of any MLOps pipeline. In this section, we’ll explore the importance of keeping your models up-to-date, how to retrain models with new data, and how to automate this process using MLFlow.

Why Continuous Improvement Matters

Once a model is in production, its performance may drift due to changing trends in data, new user behaviors, or shifts in environmental factors. This phenomenon, often referred to as model drift, can significantly reduce the accuracy and reliability of your predictions. To counteract this, continuous improvement involves regularly retraining models with fresh data, ensuring they remain relevant and perform optimally.

In the case of the Bike Prediction Dataset, imagine that factors such as traffic patterns, weather trends, or holidays change over time. Without regular updates, a model trained on older data may no longer provide accurate predictions. Continuous improvement addresses this by keeping the model aligned with the latest data patterns.

Retraining Models

Model retraining involves periodically feeding the model with new batches of data, adjusting its parameters, and fine-tuning it to reflect the current state of the environment. This can be done manually or automatically. With MLFlow, this process becomes efficient through automated tracking and logging, so you can easily compare retrained models with previous versions and determine which model performs best.

Key steps for retraining:

Collect new data: Gather fresh data that reflects any changes in the real-world scenario. For the bike prediction model, this could involve updated weather patterns, new holidays, or shifts in bike rental demand.
Retrain the model: Use the new data to retrain the model and adjust its parameters to improve predictions.
Track and log retraining runs: Each retraining session should be logged in MLFlow to keep track of the model’s performance metrics and parameters, enabling seamless comparison with previous versions.

Model Retraining Pipeline

A robust MLOps pipeline should incorporate an automated model retraining loop, ensuring the model is continuously updated without manual intervention. With MLFlow, you can create a feedback loop that retrains models based on new data, logs the updated model versions, and evaluates their performance automatically.

Here’s an example of how a continuous improvement pipeline can be set up:

Trigger the retraining process: This could be scheduled periodically (e.g., weekly or monthly) or triggered based on performance metrics (e.g., if the model’s accuracy drops below a certain threshold). For instance, if the bike prediction model starts underperforming, it could trigger a new retraining cycle.
Automate data ingestion: Automatically ingest and preprocess new batches of data. This could be done through data pipelines that fetch updated data from a database, data warehouse, or other sources.
Retrain the model and log the run: Use MLFlow to start a new run, retrain the model, and log the results, including new hyperparameters, evaluation metrics, and artifacts such as feature importance or validation curves.

Example code for automating the retraining and logging process:

import mlflow
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

def retrain_model(new_data):
    # Load new dataset
    X = new_data.drop(columns=["target"])
    y = new_data["target"]
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize and train a new model (e.g., Linear Regression)
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Log the retrained model and metrics with MLFlow
    with mlflow.start_run():
        mlflow.sklearn.log_model(model, "retrained_bike_prediction_model")
        # Log metrics like R² and RMSE
        predictions = model.predict(X_test)
        r2_score = model.score(X_test, y_test)
        mlflow.log_metric("r2_score", r2_score)
        # Add more metrics and artifacts as needed
        
    print(f"Retrained model logged with R²: {r2_score}")

# Retrain with new data
new_data = ...  # Load updated bike prediction dataset
retrain_model(new_data)

Evaluate and compare new models: Use MLFlow’s experiment tracking features to compare the performance of retrained models against older versions. You can visually inspect metrics like R², RMSE, and MAE to determine if the new model performs better and whether it should be promoted to production.
Deploy the updated model: Once the new model proves to be an improvement, promote it to production using the MLFlow model registry and update the serving infrastructure.
Monitor performance: Continuously monitor the model’s performance in production to ensure it remains stable. If any issues arise, the retraining loop can be triggered again.

12. Conclusion

In this guide, we’ve walked through building a complete MLOps pipeline using MLFlow with the Bike Prediction Dataset as a practical example. From setting up the environment to deploying and monitoring the model, each step demonstrates how an MLOps pipeline can streamline the process of developing, tracking, and improving machine learning models in production. We’ve covered everything from data ingestion and preprocessing to model training, deployment, and continuous improvement — all while leveraging MLFlow’s robust tracking and lifecycle management features.

Summary of the Pipeline

To recap, the pipeline we built included:

Data Preparation: Loading, cleaning, and feature engineering the bike prediction dataset.
Model Training and Tracking: Using MLFlow to log parameters, metrics, and artifacts.
Hyperparameter Tuning and Experimentation: Exploring different model variations and tracking them efficiently.
Model Deployment and Monitoring: Serving the model via a REST API and continuously monitoring its performance in production.
Continuous Improvement: Automatically retraining models with fresh data to ensure optimal performance.

The Importance of MLOps in Enterprises

In the real world, machine learning models are only as good as the systems that manage them. The era of manual model management is over — MLOps has evolved into an indispensable framework for scaling ML in the enterprise. By adopting MLOps, businesses can achieve:

Faster Deployment: Automated pipelines ensure models are deployed quickly and consistently.
Enhanced Monitoring: Proactive monitoring of models in production reduces risks of performance degradation.
Scalability: With robust versioning and lifecycle management, models can be scaled and retrained efficiently as data grows or shifts.

Tools like MLFlow take the complexity out of managing these pipelines. By providing visibility into model versions, performance metrics, and serving capabilities, MLFlow empowers teams to iterate faster and deploy more reliable models — giving businesses a competitive edge.

Final Thoughts

Whether you’re a beginner stepping into the world of machine learning or a professional looking to optimize your workflows, adopting MLOps practices is a game-changer. Not only does it streamline the end-to-end model lifecycle, but it also ensures that your models stay relevant, accurate, and scalable in dynamic environments. By using MLFlow, you gain a flexible, enterprise-grade solution that handles everything from experiment tracking to deployment, making it easier to focus on what truly matters — building great models and driving real business impact.

Take the leap into MLOps and start automating, tracking, and improving your models efficiently. The future of machine learning isn’t just about building models; it’s about managing them effectively at scale.