End-to-End MLOps Pipeline using MLFlow (Part-2)
Part 2: Data Preprocessing, Model Training, and Experimentation
- 4. Data Preprocessing and Feature Engineering
- 5. Model Training with MLFlow Tracking
- 6. Hyperparameter Tuning and Experimentation
Read Part 1: End-to-End MLOps Pipeline using MLFlow, in case you’ve missed the introduction.
4. Data Preprocessing and Feature Engineering
Effective data preprocessing and feature engineering are the building blocks of a successful machine learning model. In this section, we’ll walk through how to load the Bike Sharing Demand Dataset, clean it, and engineer useful features that can enhance the model’s predictive power. Proper preprocessing ensures that your model is trained on high-quality data, which is essential for its performance.
Loading the Dataset
We begin by loading the Bike Sharing Demand Dataset using the popular Python library, pandas. The dataset contains information on bike rentals, including timestamp data, weather conditions, and rental counts. You can download the dataset from Bike Sharing Demand | Kaggle.
import pandas as pd
data = pd.read_csv("../data/train.csv")
data.head()
The dataset typically contains the following columns:
- datetime: Date and time of the bike rental.
- season: Season in which the rental occurred (1: Spring, 2: Summer, 3: Fall, 4: Winter).
- holiday: Whether the day is a holiday (0: No, 1: Yes).
- workingday: Whether the day is neither a weekend nor holiday(0: No, 1: Yes).
- weather: Weather conditions (1: Clear, 2: Misty, 3: Light Snow/Rain, 4: Heavy Snow/Rain).
- temp: Temperature in Celsius.
- atemp: “feels like” temperature in Celsius.
- humidity: Humidity percentage.
- windspeed: Wind speed.
- casual: number of non-registered user rentals initiated.
- registered: number of registered user rentals initiated.
- count: The number of total rentals (target variable).
Data Cleaning: Handling Missing Data, Outliers, and Erroneous Values
Real-world data often contains missing values, outliers, or errors, and it’s essential to clean the data before model training.
- Handling Missing Data:
Check if the dataset contains any missing values. If there are, you can decide whether to impute (fill in) those values using methods like the mean or median of the column, or you can drop the rows with missing values entirely.
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)
- Handling Outliers:
Outliers can distort model training and predictions, especially for regression models. Visualize key features liketemp
,humidity
, andwindspeed
to identify extreme values and decide if they should be capped or removed.
import matplotlib.pyplot as plt
# Box plot to detect outliers in windspeed
plt.boxplot(data['windspeed'])
plt.title('Windspeed Outliers')
plt.show()# Cap windspeed at a maximum threshold if necessary
data['windspeed'] = data['windspeed'].clip(upper=40)
- Erroneous Values:
Some data may contain errors like negative values for variables where they don’t make sense (e.g., negative temperatures). In such cases, it’s essential to either correct or remove those records.
# Remove rows with negative or erroneous values
data = data[data['temp'] >= 0]
Feature Engineering: Creating New Features
Feature engineering is the process of transforming or creating new features that can improve the predictive power of the model. For the bike prediction dataset, we can create several meaningful features based on the raw data:
- Datetime Features: Extracting information from the
datetime
column, such as hour, day of the week, and month, can help capture patterns like peak rental hours and seasonal variations.
# Convert 'datetime' column to datetime object
data['datetime'] = pd.to_datetime(data['datetime'])
# Create new features
data['hour'] = data['datetime'].dt.hour
data['day_of_week'] = data['datetime'].dt.dayofweek
data['month'] = data['datetime'].dt.month
- Weather Patterns: The
weather
column is categorical but could be further refined. For instance, we could create binary features like is_clear_weather or is_rainy to simplify the relationship with the target variable.
# Create binary weather features
data['is_clear_weather'] = (data['weather'] == 1).astype(int)
data['is_rainy_weather'] = (data['weather'] >= 3).astype(int)
- Holiday and Working Day Interaction: Combining the
holiday
andworkingday
columns can help the model better understand behaviour on different types of days (e.g., a holiday that is also a working day might behave differently).
# Create a combined feature for holidays and working days
data['is_holiday_workingday'] = ((data['holiday'] == 1) & (data['workingday'] == 1)).astype(int)
These engineered features can significantly improve the model’s ability to predict bike rental demand by providing it with more context about the data.
Drop irrelevant columns
data.drop(columns=["datetime"], inplace=True)
Data Split: Training and Testing Split
Once the dataset is loaded, the next step is to split the data into training and testing sets. We’ll use the training set to build the model, while the testing set will help us evaluate the model’s performance on unseen data.
In this case, we’ll set aside 20% of the data for testing using scikit-learn’s train_test_split
function.
from sklearn.model_selection import train_test_split
# Split the data into features and target
X = data.drop(columns=["count"]) # Features (all columns except 'count')
y = data["count"] # Target variable# Perform an 80-20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Verify the split
print(f"Training data size: {X_train.shape}")
print(f"Testing data size: {X_test.shape}")
This ensures we have a training set (80% of the data) to fit the model and a testing set (20%) to validate its performance.
5. Model Training with MLFlow Tracking
Now that we’ve preprocessed the data and engineered useful features, the next step is to train a machine learning model. In this section, we’ll choose an appropriate regression model for predicting bike rentals and use MLFlow Tracking to log the model’s parameters, metrics, and artifacts, ensuring that our experiments are properly documented and easy to compare.
Model Choice
For the Bike Sharing Demand Dataset, we’ll use Decision Tree Regressor from the scikit-learn library as our baseline model. Decision trees are simple yet powerful models that can handle non-linear relationships between features and the target variable, making them well-suited for our dataset.
You can, of course, experiment with other models like Linear Regression or more advanced models like XGBoost, but for this tutorial, we’ll stick with Decision Trees for simplicity and interpretability.
MLFlow Tracking: Logging Parameters, Metrics, and Artifacts
The power of MLFlow comes from its ability to track machine learning experiments seamlessly. Here’s what we’ll track during the model training:
- Model Parameters: Key hyperparameters like the maximum depth of the decision tree.
- Evaluation Metrics: Metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), which are commonly used to evaluate regression models.
- Artifacts: Visualizations like feature importance plots that help us interpret the model.
Starting an MLFlow Run
Before training the model, we start an MLFlow run using the mlflow.start_run()
context manager. This ensures that all logs and artifacts are stored in the correct run, enabling us to later compare different experiments easily.
Let’s look at an example of how to log the model training process with MLFlow:
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
# Define the model
model = DecisionTreeRegressor(max_depth=10, random_state=42)# Start an MLFlow run
with mlflow.start_run():
# Log model parameters
mlflow.log_param("model_type", "DecisionTreeRegressor")
mlflow.log_param("max_depth", 10)
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
predictions = model.predict(X_test)
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
# Log evaluation metrics
mlflow.log_metric("MAE", mae)
mlflow.log_metric("RMSE", rmse)
# Plot and log feature importance as an artifact
feature_importances = model.feature_importances_
plt.figure(figsize=(10, 6))
plt.barh(X_train.columns, feature_importances)
plt.title("Feature Importance")
plt.savefig("feature_importance.png")
# Log the artifact (the feature importance plot)
mlflow.log_artifact("feature_importance.png")
# Log the model itself
mlflow.sklearn.log_model(model, "model")
print(f"Model training complete. MAE: {mae}, RMSE: {rmse}")
Breaking Down the Code
- Starting an MLFlow Run:
Themlflow.start_run()
function initiates a new run, which automatically logs all relevant parameters, metrics, and artifacts to a unique run ID. This run is stored in the MLFlow tracking server and can be accessed later for comparison. - Logging Parameters:
In this example, we’re logging the key hyperparameters for our DecisionTreeRegressor, such as the max_depth of the tree.
mlflow.log_param("max_depth", 10)
- Logging Metrics:
We compute two evaluation metrics, MAE and RMSE, which are logged usingmlflow.log_metric()
. These metrics allow us to track how well the model performs on the test set.
mlflow.log_metric("MAE", mae) mlflow.log_metric("RMSE", rmse)
- Logging Artifacts:
Artifacts can be any files that are relevant to the experiment, such as plots or model outputs. In this case, we plot the feature importance and log it as an artifact usingmlflow.log_artifact()
.
plt.savefig("feature_importance.png")
mlflow.log_artifact("feature_importance.png")
- Logging the Model:
After training, we log the entire trained model usingmlflow.sklearn.log_model()
. This enables us to load the model later for deployment or further analysis.
mlflow.sklearn.log_model(model, "model")
Why Track Experiments with MLFlow?
Experiment tracking is a vital part of building a scalable and reproducible machine learning pipeline. By logging hyperparameters, metrics, and artifacts, MLFlow enables you to:
- Compare different runs and models.
- Reproduce successful experiments with the same configurations.
- Share experiment results with team members or stakeholders.
- Store and retrieve models for future use or deployment.
In this case, we can easily run different models (e.g., changing the depth of the decision tree or trying a different algorithm) and compare their performance by inspecting the logged parameters and metrics in the MLFlow UI.
6. Hyperparameter Tuning and Experimentation
Experimentation is a key step in the machine learning workflow, and hyperparameter tuning is one of the most effective ways to improve a model’s performance. In this section, we’ll explore why hyperparameter tuning is important, how to leverage MLFlow for logging and comparing different experiments, and provide example code using GridSearchCV or RandomizedSearchCV for hyperparameter optimization.
Why Experimentation is Important
Every machine learning model has a set of parameters that control its behavior, known as hyperparameters. Unlike model parameters (which are learned from the data), hyperparameters are set before training begins and can significantly impact a model’s performance. For instance, in a Decision Tree Regressor, the depth of the tree (max_depth
) or the minimum samples required to split a node (min_samples_split
) can affect both the accuracy and generalization of the model.
The goal of experimentation is to find the optimal combination of hyperparameters that results in the best performance on your validation or test set. Without hyperparameter tuning, your model might underperform or overfit the data, leading to suboptimal results in production environments.
MLFlow for Hyperparameter Tuning
MLFlow makes it easy to log and track each experiment during the hyperparameter tuning process. Whether you use manual tuning, GridSearchCV, or RandomizedSearchCV, MLFlow allows you to:
- Log the different sets of hyperparameters tested.
- Log evaluation metrics for each run.
- Compare and analyze the performance of different runs using the MLFlow UI.
This allows you to maintain a history of experiments, so you can easily revert to the best model, share results with your team, or further fine-tune the best configuration.
Example Code: Using GridSearchCV with MLFlow
Let’s implement hyperparameter tuning using GridSearchCV while tracking everything in MLFlow. GridSearchCV automates the process of trying out different combinations of hyperparameters and selecting the one that performs best based on cross-validation.
We’ll continue with the DecisionTreeRegressor and test different values for max_depth
and min_samples_split
.
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
import mlflow
import mlflow.sklearn
# Define the model and the hyperparameter grid
model = DecisionTreeRegressor(random_state=42)
param_grid = {
'max_depth': [5, 10, 15],
'min_samples_split': [2, 10, 20]
}# Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', verbose=1)# Start an MLFlow run for hyperparameter tuning
with mlflow.start_run():
# Perform grid search
grid_search.fit(X_train, y_train)
# Log the best hyperparameters
best_params = grid_search.best_params_
mlflow.log_param("best_max_depth", best_params['max_depth'])
mlflow.log_param("best_min_samples_split", best_params['min_samples_split'])
# Log the best score (cross-validated)
best_score = -grid_search.best_score_ # Convert to positive since we used neg_mean_squared_error
mlflow.log_metric("best_cross_val_score", best_score)
# Log the final model
best_model = grid_search.best_estimator_
mlflow.sklearn.log_model(best_model, "best_model")
# Evaluate the model on the test set and log the result
test_predictions = best_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
mlflow.log_metric("test_RMSE", test_rmse)
print(f"Best Hyperparameters: {best_params}")
print(f"Test RMSE: {test_rmse}")
Breaking Down the Code
- GridSearchCV Setup:
We define a DecisionTreeRegressor and a parameter grid (param_grid
) with different values formax_depth
andmin_samples_split
. GridSearchCV will try every possible combination and use cross-validation (cv=5) to find the best set of hyperparameters.
param_grid = {
'max_depth': [5, 10, 15],
'min_samples_split': [2, 10, 20]
}
- MLFlow Experiment Tracking:
We start an MLFlow run withmlflow.start_run()
, log the best hyperparameters and cross-validation score usingmlflow.log_param()
andmlflow.log_metric()
. This ensures that each experiment is fully tracked, and we can later compare the performance of different runs.
mlflow.log_param("best_max_depth", best_params['max_depth'])
mlflow.log_metric("best_cross_val_score", best_score)
- Model Logging and Evaluation:
The best model is logged withmlflow.sklearn.log_model()
for future use. We also evaluate the model on the test set and log the final Root Mean Squared Error (RMSE), which is a common metric for regression tasks.
mlflow.sklearn.log_model(best_model, "best_model")
mlflow.log_metric("test_RMSE", test_rmse)
RandomizedSearchCV: A More Efficient Alternative
While GridSearchCV tries every combination of hyperparameters in the grid, it can be computationally expensive, especially when dealing with a large parameter space. An alternative is RandomizedSearchCV, which randomly samples a fixed number of hyperparameter combinations from a distribution, making it more efficient.
Here’s an example using RandomizedSearchCV:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
# Define the hyperparameter distributions
param_distributions = {
'max_depth': randint(5, 20),
'min_samples_split': randint(2, 20)
}# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42, verbose=1)# Perform RandomizedSearchCV with MLFlow tracking
with mlflow.start_run():
random_search.fit(X_train, y_train)
# Log the best hyperparameters and metrics
best_params = random_search.best_params_
mlflow.log_param("best_max_depth", best_params['max_depth'])
mlflow.log_param("best_min_samples_split", best_params['min_samples_split'])
best_score = -random_search.best_score_ # Convert to positive MSE
mlflow.log_metric("best_cross_val_score", best_score)
# Log the best model
best_model = random_search.best_estimator_
mlflow.sklearn.log_model(best_model, "best_model")
# Evaluate and log test set performance
test_predictions = best_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
mlflow.log_metric("test_RMSE", test_rmse)print(f"Best Hyperparameters: {best_params}")
print(f"Test RMSE: {test_rmse}")
Why Use MLFlow for Hyperparameter Tuning?
MLFlow makes the process of hyperparameter tuning and experimentation easier to manage, thanks to its centralized logging system. Here’s why it’s so useful:
- Reproducibility: You can easily track what hyperparameters were used in each run, ensuring that successful experiments can be reproduced later.
- Comparison: By logging the performance metrics of each run, MLFlow allows you to visually compare the results using its UI to determine which hyperparameters yielded the best results.
- Scalability: MLFlow’s tracking feature can be extended to larger, more complex experiments with more models and tuning strategies.
What next?
Read Part 3: End-to-End MLOps Pipeline using MLFlow to understand about ML model deployment and conitnuous monitoring pipeline.