Multi-Objective Machine Learning Hyperparameter Tuning

(Without Explicit Objective Weighting)

by Jacob Kravits

Tuning your machine learning model’s hyperparameters is a great way to tailor your model’s performance. Your model keeps overfitting to your training data? Tuning your hyperparameters can help! When you tune a model’s hyperparameters, you need to select some objective to quantify desirable model performance. For some problems, one of which will be discussed in this post, you may want to examine how model performance changes across several objectives. We will see that this process allows us to make a better-informed choice of hyperparameters!

What are hyperparameters and why should we “tune” them?

The specification of hyperparameter values (whether optimal, suboptimal, or default) is essential to many common machine learning algorithms because they specify model topology and ultimately determine model behavior. For example, the number of trees in a random forest or sizes of the hidden layers in a neural network would both be specified by hyperparameters. If you are looking for more explanation of what hyperparameters are and how they differ from regular parameters, I recommend reading this blog post (Brownlee 2017) or Probst, Wright, and Boulesteix (2019).

Tuning hyperparameters is the process of optimizing hyperparameters. I want to be clear, tuning hyperparameters is generally a “finishing touches” step when developing a machine learning model. Shahul ES said it well by stating that hyperparameter tuning is a great way to “extract the last juice out of your models” (ES 2021). Processes like model or feature selection cannot be overlooked. Moral of the story: if your model is severely under-performing or just flat-out broken, don’t look to hyperparameter tuning to fix all your problems.

Why consider multiple objectives?

For many problems, a single objective effectively captures desirable model performance. For example, consider the classic Iris classification problem where a model classifies types of Iris flowers. You are probably thinking that for this task you want a model that is as accurate as possible, and I would agree! For the Iris problem, I would pick the set of hyperparameters that maximizes model accuracy.

But now think about the other classic problem of breast cancer diagnosis based on the Breast Cancer Diagnostic dataset currently hosted in the University of California, Irvine (UCI) repository (Dua and Graff 2017). In this case, how your model classifies patients has very different implications. This is best visualized with its confusion matrix:

In the false negative case, your model is telling people they don’t have cancer when they do. In the false positive case, your model is falsely scaring people by telling them they do have cancer when they don’t. Each of these scenarios is undesirable but in different ways. One thing we could do is optimize our hyperparameters to perform well on objectives like false positive rate or true positive rate which explicitly considers those undesirable cases. But in those cases, we are saying that we only care about one of those off diagonal cases, which is often not true. Another common approach we could do is optimize to some weighted sum of false positive rate and true positive rate. But then the question becomes how do you weight the two objectives? Are they equally important? Maybe one is slightly more important? To further complicate the issue, there is a chance that some weighting schemes will not impact the actual values of optimal hyperparameters.

The good news is that very smart people have thought about how to solve multi-objective problems without needing to weight the objectives before the optimization. To use these methods, we will need to rethink what “optimality” means (which we will get to later). Through these methods we will be able to study the degree to which objectives tradeoff and make an informed decision of optimal hyperparameters. Let’s do an example!

Cancer Detection Example

We will use the previously introduced UCI Breast Cancer Diagnostic problem with a decision tree for this example. The code I am providing will be specific to using a decision tree in Python, but the methods could easily be adapted to many hyperparameterized machine learning algorithms in many modern programming languages.


This example will use the Scikit-Learn (Pedregosa et al. 2011), NumPy (Harris et al. 2020), Pandas (The pandas development team 2020), and Pymoo (Blank and Deb 2020) libraries for the actual analysis. We will use the HiPlot (Haziza, Rapin, and Synnaeve 2020) packages to do some interactive visualization. Install them in your current environment if you haven’t already done so. If you want to go the Github route, here is the repository which contains the script as well as goodies needed to run the code I will provide! Specifically, that repository has a Dockerfile and virtual environment dependencies. I recommend using one of those two options to ensure consistent results with the blog post (I have also included instructions for those not familiar with either method). However, the code should run in any environment with Python 3 and all the proper dependencies installed.

Once you have everything installed you should be able to import everything as such:

import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.model_selection
import sklearn.metrics
import sklearn.tree
import sklearn.ensemble
import pymoo.util.nds.non_dominated_sorting as nds
import hiplot as hip

cv_objs = [
    'Mean CV Accuracy',
    'Mean CV True Positive Rate',
    'Mean CV False Positive Rate',
    'Mean CV AUC'
cv_objs_max = ['Mean CV Accuracy', 'Mean CV True Positive Rate', 'Mean CV AUC']
test_objs = [
    'Test Accuracy',
    'Test True Positive Rate',
    'Test False Positive Rate',
    'Test AUC'

We also create some naming variables which we will leave in the global scope.

Data Preparation

Fortunately, the dataset is available for direct import via Scikit-Learn (so no need to manually download it yourself)! Feature selection is the process of determining only the most important features for your problem. Feature selection won’t be the focus of this post, so I encourage you to read Rahul Agarwal’s post (Agarwal 2020) or the seminal work Langley 1994 if you are not familiar with feature selection. We do a basic feature selection using the feature importances from a random forest. After we do our feature selection, we split the data and save 25 % for testing our model. We also do a stratified split (if you aren’t familiar see Brownlee 2020). I have provided a simple function and function call to do this:

def data_preparation():
    test_size = 0.25
    number_features = 5
    # Import
    data = sklearn.datasets.load_breast_cancer(as_frame=True)
    features = data.feature_names.tolist()
    df = data.frame
    df['Classification'] = data['target'].replace(
        {1: 'benign', 0: 'malignant'}
    # Feature Selection
    clf = sklearn.ensemble.RandomForestClassifier(random_state=1008)[features], df['Classification'])
    feature_importances = pd.Series(
    important_features = feature_importances[0:number_features].index.tolist()
    # Split
    X_train, X_test, y_train, y_test = \
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = data_preparation()

Default Hyperparameters

Let’s look at the performance of the default hyperparameters. These values are typically derived based on statistical proofs and are meant to perform decently on many problems. Default hyperparameters are used in the following code:

def default_hyperparameter(X_train, y_train):
    clf = sklearn.tree.DecisionTreeClassifier(random_state=1008), y_train)
    return clf

clf_default = default_hyperparameter(X_train, y_train)
print('Train Accuracy:', sklearn.metrics.accuracy_score(
    y_train, clf_default.predict(X_train))
print('Test Accuracy:', sklearn.metrics.accuracy_score(
    y_test, clf_default.predict(X_test))

After running this code, we see that the training accuracy is 1.00 (perfect accuracy) and the test accuracy is 0.91. This means that our decision tree is being overfit to our training data. Here is a great opportunity to tune our hyperparameters so as not to overfit!

Single Objective Hyperparameter Tuning

For this example, we will focus on two of the hyperparameters of a decision tree. In this single objective version, we want to find the set of hyperparameters that maximizes accuracy. We will specify a “grid” of possible values over which we will tune. This grid yields 84 possible combinations:

Additionally, we will be evaluating performance using five-fold cross validation. In this technique, we iteratively split our training data as to not overfit our model to the entire training data set as was done in the previous section. For more information on what cross validation means Saranya Mandava has a nice blog post about it (Mandava 2018). I also recommend the survey paper Arlot and Celisse 2010.

This analysis is applied in the following code:

def single_objective_gridsearch(X_train, y_train):
    parameter_grid = {
        'min_samples_split': np.insert(np.arange(10, 210, 10), 0, 2),
        'max_features': [2, 3, 4, 5]
    gs = sklearn.model_selection.GridSearchCV(
    ), y_train)
    clf = sklearn.tree.DecisionTreeClassifier(
    ), y_train)
    return clf, gs

clf_SO, gs_SO = single_objective_gridsearch(X_train, y_train)
print('CV Train Accuracy:', gs_SO.best_score_)
print('Test Accuracy:', sklearn.metrics.accuracy_score(
    y_test, gs_SO.predict(X_test))

Running this code yields that our cross-validated training accuracy has dropped to 0.94 (from 1.00) but our accuracy of predicting the test set has increased to 0.94 (from 0.91). This is great news; our model is no longer being overfit to our data!

But let’s return to our discussion about multiple objectives. Maximizing accuracy is like maximizing the “greatest good” which very much falls in line with the philosophy of Jeremy Bentham. What about other objectives like false positive rate or true positive rate (formulae linked here) which consider the minority of people that our model misclassifies? How do we consider those objectives without having to rank or weight them?

Multi-Objective Hyperparameter Tuning

In this multi-objective formulation, we will study the tradeoffs among the accuracy, false positive rate, true positive rate, and area under receiver operator characteristic curve objectives (formulae linked here). So, we start out by computing each of those five objectives for our 84 hyperparameter combinations in our grid. Just as in the single objective case, these objectives are evaluated in a five-fold cross-validated fashion on the training set. The code to do that is provided here:

def fpr(y_true, y_pred):
    tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true, y_pred).ravel()
    obj = fp / (fp + tn)
    return obj

def tpr(y_true, y_pred):
    tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true, y_pred).ravel()
    obj = tp / (tp + fn)
    return obj

def multi_objective_gridsearch(X_train, y_train):
    parameter_grid = {
        'min_samples_split': np.insert(np.arange(10, 210, 10), 0, 2),
        'max_features': [2, 3, 4, 5]
    scoring = {
        'Accuracy': 'accuracy',
        'True Positive Rate': sklearn.metrics.make_scorer(tpr),
        'False Positive Rate': sklearn.metrics.make_scorer(fpr),
        'AUC': 'roc_auc'
    gs = sklearn.model_selection.GridSearchCV(
    ), y_train)
    df = pd.DataFrame(gs.cv_results_['params'])
    df['Mean CV Accuracy'] = gs.cv_results_['mean_test_Accuracy']
    df['Mean CV True Positive Rate'] = \
        gs.cv_results_['mean_test_True Positive Rate']
    df['Mean CV False Positive Rate'] = \
        gs.cv_results_['mean_test_False Positive Rate']
    df['Mean CV AUC'] = gs.cv_results_['mean_test_AUC']
    return df
df_all = multi_objective_gridsearch(X_train, y_train)

In this code you will notice that we defined our own true positive rate and false positive rate functions while the other two objectives are built in to Scikit-Learn. I wanted to show how easy it is to extend Scikit-Learn’s functionality!

How can we visualize the objective performance of our 84 hyperparameter combinations? I’m a big fan of interactive parallel plots which can be easily implemented via the HiPlot package although many similar tools exist in other languages (Raseman, Jacobson, and Kasprzyk 2019). Here is code to create a parallel plot using the HiPlot package:

def parallel_plot(df, color_column, invert_column):
    # Make Unique IDs
    df['Solution ID'] = df.index + 1
    df['Solution ID'] = df['Solution ID'].apply(lambda x: '{0:0>5}'.format(x))
    df['Solution ID'] = 'S'+df['Solution ID'].astype(str)
    # Create Plot
    exp = hip.Experiment.from_dataframe(df)
    exp.parameters_definition[color_column].colormap = 'interpolateViridis'
            'hide': [
                'Solution ID'
            'invert': invert_column
    exp.display_data(hip.Displays.TABLE).update({'hide': ['uid', 'from_uid']})
    return exp

    color_column='Mean CV Accuracy',
So, what are we looking at here? Each hyperparameter combination is represented as a single line on this plot (you can see this as you hover over the table on the bottom). We oriented the axes such that down is optimal, meaning a solution that performed best on all objectives would be a straight line across the bottom. However, we don’t see any solutions with such behavior, instead we see the objectives trading off performance with one another (lines crossing).

But let’s study two solutions in this plot to compare their performance: solution S00065 (max_features: 5, min_sample_split: 10) and solution S00005 (max_features: 2, min_sample_split: 40). I have linked to the filtered state of this webpage with only these two solutions here (note, this will filter solutions on both plots on this page, so I recommend clicking “Restore” on each plot after viewing). We see that solution S00065 does better on every objective than solution S00005. By that reasoning, there would never be a reason to pick solution S00005 if all we cared about were these four objectives. Commonly, we say that solution S00065 “dominates” solution S00005. In order for a solution to “dominate” another, it needs to perform the same or better on all objectives and strictly better on at least one. Continuing this logic, we only really care about the nondominated solutions (which is the converse of dominated). So we apply a nondominated sort to this set and re-plot:

def nondom_sort(df, objs, max_objs=None):
    df_sorting = df.copy()
    # Flip Objectives to Maximize
    if max_objs is not None:
        df_sorting[max_objs] = -1.0 * df_sorting[max_objs]
    # Non-dominated Sorting
    nondom_idx = nds.find_non_dominated(df_sorting[objs].values)
    return df.iloc[nondom_idx].copy()

df_non_dom = nondom_sort(df_all, cv_objs, max_objs=cv_objs_max)
    color_column='Mean CV Accuracy',
We see that our original 84 solutions got filtered out to just 20 non-dominated solutions! I want to be clear that all 20 of these solutions are “optimal” which may be a bit hard to wrap your head around if you are new to multi-objective optimization. Another way to think about it: Imagine you tried every combination of objective weights for these four objectives; you would always get one of these 20 non-dominated hyperparameter solutions. This concept is also called Pareto optimality if you want to read further.

We can gain some insights into this problem through our plot of the set of non-dominated hyperparameters above! We can see that accuracy and false positive rate are generally redundant objectives. This means that by maximizing accuracy we are also reducing the amount of people we incorrectly diagnose who truly have cancer. We see that accuracy generally conflicts with true positive rate (i.e., you can’t increase performance in one objective without decreasing performance in the other). Recall that this means that by maximizing accuracy our model is falsely scaring people by telling them they have cancer when they don’t. What is nice about this multi-objective approach is that we can visually see to what extents these objectives tradeoff. We can also pick a solution, like solution S00067, that compromises among all the objectives. There are also analytical methods for selecting a compromise solution which I won’t cover in this post but are outlined in Wang and Rangaiah (2017).

Do these objective preferences translate to the “test” set we omitted at the start of this exercise? For example, does a solution that have good cross-validated accuracy also have good accuracy on the test set? We can check this by ranking the test performance for each objective and compare the test performance to the cross-validated performance. This is done in the following code:

def get_test_performance(X_train, X_test, y_train, y_test, params):
    # Fit Model with Specified Hyperparameters
    clf = sklearn.tree.DecisionTreeClassifier(
    ), y_train)
    y_pred = clf.predict(X_test)
    # Compute Objectives on Test Set
    tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_test, y_pred).ravel()
    acc = (tp + tn) / (tn + fp + fn + tp)
    tpr = tp / (tp + fn)
    fpr = fp / (fp + tn)
    auc = sklearn.metrics.roc_auc_score(
        clf.predict_proba(X_test)[:, 1]
    return pd.Series([acc, tpr, fpr, auc], test_objs)

# Non-Dominated Set Test Performance
df_non_dom_test = df_non_dom.apply(
    lambda row: get_test_performance(
        X_train, X_test, y_train, y_test, row
df_non_dom = df_non_dom.join(df_non_dom_test)
# Check if Performance is Preserved by Looking at Sorted Objective Values
for i, j in zip(cv_objs, test_objs):
    print(df_non_dom[[i, j]].sort_values(i, ascending=False))

After running this code, we see that the objective preferences generally translate to the test set (there are a few exceptions due to rounding). Of course, this behavior is expected given that we used a cross-validated approach to get the non-dominated set in the first place, but it’s nice to see the cross-validation is working. This is great news as we can be confident the objective preferences that we spent so much time forming and investigating throughout this process do in fact translate to new observations!


There you have it - a nice way to use a grid search and a non-dominated sort to conduct a simple, yet informative, multi-objective approach to tuning hyperparameters. We demonstrated how to use the default hyperparameters, how to conduct a single-objective grid search, and how to conduct a multi-objective grid search! Along the way we also showed some interactive plotting methods to view our results and make a choice of hyperparameters. Hopefully, I have inspired you to implement a similar approach for your own machine learning problem!

Dr. Joseph Kasprzyk, Dr. Kyri Baker, Dr. Kostas Andreadis, and I actually applied this methodology to the problem of dam hazard classification, another problem where the types of misclassification have different consequences. We optimized over many hyperparameters (and some other parameters of a geospatial model). So, we utilized a multi-objective evolutionary algorithm to solve our multi-objective problem instead of the methods used in this post. The code for that project, a video presentation, and the paper are all available here if you are interested!


