Skip to content

Multi-Metric Search on Imbalanced Data

Multi-metric search shines when your metrics disagree. On a balanced, easy dataset, accuracy, balanced accuracy, and F1 all crown the same candidate and there is nothing to choose between them. So we use a deliberately imbalanced problem — 90% of one class, 10% of the other — where a model can look great on accuracy while quietly ignoring the minority class. Here refit is a real decision, and the per-metric cv_results_ actually rank candidates differently.

Setup

We build a 2,000-sample binary problem with a 90/10 class split and a bit of label noise. The majority class is so dominant that a model predicting "always majority" already scores 90% accuracy — which is exactly why accuracy alone is misleading here.

python
import warnings
from pprint import pprint

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score, f1_score, make_scorer
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn_genetic import (
    EvolutionConfig, GASearchCV, OptimizationConfig, PopulationConfig, RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.plots import plot_candidate_rankings
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Categorical, Continuous, Integer

warnings.filterwarnings("ignore")

RANDOM_STATE = 42

X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=8,
    weights=[0.9, 0.1],
    flip_y=0.03,
    random_state=RANDOM_STATE,
)
X = pd.DataFrame(X, columns=[f"f{i:02d}" for i in range(X.shape[1])])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=RANDOM_STATE
)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

counts = np.bincount(y)
print(f"class balance: {counts[0]} majority / {counts[1]} minority "
      f"({counts[1] / counts.sum():.0%} minority)")
print(f"train={X_train.shape}  test={X_test.shape}")
text
class balance: 1776 majority / 224 minority (11% minority)
train=(1400, 20)  test=(600, 20)

Define Multiple Metrics

A multi-metric search receives a dictionary of scorers. On this dataset the three metrics measure very different things:

  • accuracy — fraction correct; flattered by the dominant majority class.
  • balanced_accuracy — average recall across classes; punishes ignoring the minority.
  • f1 — harmonic mean of precision and recall on the minority class.

The refit parameter decides which metric chooses best_params_ and refits best_estimator_. We refit on balanced_accuracy so the final model is forced to take the minority class seriously.

python
scoring = {
    "accuracy": "accuracy",
    "balanced_accuracy": make_scorer(balanced_accuracy_score),
    "f1": make_scorer(f1_score),  # minority (positive) class F1
}
sorted(scoring)
text
['accuracy', 'balanced_accuracy', 'f1']

Configure GASearchCV

We tune a scaled LogisticRegression. The key knob for imbalance is class_weight: leaving it None chases accuracy, while "balanced" reweights the minority class — so different candidates will favor different metrics, exactly the tension we want to expose.

python
model = Pipeline([
    ("scaler", StandardScaler()),
    ("logistic", LogisticRegression(solver="saga", max_iter=1500, random_state=RANDOM_STATE)),
])

param_grid = {
    "logistic__C": Continuous(1e-3, 30.0, distribution="log-uniform"),
    "logistic__l1_ratio": Continuous(0.0, 1.0),
    "logistic__class_weight": Categorical([None, "balanced"]),
    "logistic__max_iter": Integer(1200, 1800),
}

search = GASearchCV(
    estimator=model,
    random_state=RANDOM_STATE,
    param_grid=param_grid,
    scoring=scoring,
    refit="balanced_accuracy",   # drives best_params_ and best_estimator_
    cv=cv,
    evolution_config=EvolutionConfig(
        population_size=12,
        generations=10,
        crossover_probability=ExponentialAdapter(initial_value=0.8, end_value=0.4, adaptive_rate=0.15),
        mutation_probability=InverseAdapter(initial_value=0.25, end_value=0.08, adaptive_rate=0.25),
        tournament_size=3,
        elitism=True,
        keep_top_k=3,
    ),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[{
            "logistic__C": 1.0,
            "logistic__l1_ratio": 0.0,
            "logistic__class_weight": None,
            "logistic__max_iter": 1300,
        }],
    ),
    runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True, verbose=False),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.20,
        diversity_control=True,
        diversity_threshold=0.30,
        diversity_stagnation_generations=3,
        diversity_mutation_boost=1.8,
        random_immigrants_fraction=0.10,
        fitness_sharing=True,
        sharing_radius=0.40,
    ),
)

callbacks = [
    DeltaThreshold(threshold=0.001, generations=5, metric="fitness_best"),
    ConsecutiveStopping(generations=7, metric="fitness_best"),
    TimerStopping(total_seconds=90),
]

search.fit(X_train, y_train, callbacks=callbacks)
print("fitted:", search.refit_metric)
text
INFO: DeltaThreshold callback met its criteria
INFO: Stopping the algorithm
fitted: balanced_accuracy

Best Parameters and Test Metrics

Because refit="balanced_accuracy", best_params_ and best_estimator_ are selected by the CV rank of that metric.

python
print("Refit metric:", search.refit_metric)
print("Best balanced-accuracy CV score:", round(search.best_score_, 4))
print("Best params:")
pprint(search.best_params_)

predictions = search.predict(X_test)
test_metrics = {
    "accuracy": round(accuracy_score(y_test, predictions), 4),
    "balanced_accuracy": round(balanced_accuracy_score(y_test, predictions), 4),
    "f1": round(f1_score(y_test, predictions), 4),
}
test_metrics
text
Refit metric: balanced_accuracy
Best balanced-accuracy CV score: 0.7046
Best params:
{'logistic__C': 0.040979535386389716,
 'logistic__class_weight': 'balanced',
 'logistic__l1_ratio': 0.9046094635389027,
 'logistic__max_iter': 1799}
{'accuracy': 0.73, 'balanced_accuracy': 0.698, 'f1': 0.352}

Explore Multi-Metric cv_results_

For multi-metric searches, cv_results_ contains one set of columns per metric. The point of this page is visible right here: sorting by each metric's rank surfaces a different top candidate.

python
results = pd.DataFrame(search.cv_results_)
metric_columns = [
    "mean_test_accuracy", "rank_test_accuracy",
    "mean_test_balanced_accuracy", "rank_test_balanced_accuracy",
    "mean_test_f1", "rank_test_f1",
]
param_columns = ["param_logistic__C", "param_logistic__class_weight"]

results[metric_columns + param_columns].sort_values("rank_test_balanced_accuracy").head()
text
     mean_test_accuracy  rank_test_accuracy  mean_test_balanced_accuracy  rank_test_balanced_accuracy  mean_test_f1  rank_test_f1  param_logistic__C param_logistic__class_weight
97             0.712123                  60                     0.704607                            1      0.351874             1           0.040980                     balanced
33             0.707845                  66                     0.702020                            2      0.347684             6           2.735191                     balanced
109            0.707845                  66                     0.702020                            2      0.347684             6           2.735191                     balanced
58             0.707130                  69                     0.701617                            4      0.347158             8          11.585019                     balanced
43             0.711408                  62                     0.701400                            5      0.349366             2           0.044433                     balanced

The Metrics Disagree

The same cv_results_ can point to different winners. Pulling the best row for each metric — without rerunning the search — shows the tradeoff explicitly: accuracy tends to prefer the unweighted model, while balanced accuracy and F1 reward the candidate that pays attention to the minority class.

python
best_rows = []
for metric_name in ["accuracy", "balanced_accuracy", "f1"]:
    row = results.sort_values(f"rank_test_{metric_name}").iloc[0]
    best_rows.append({
        "winning_metric": metric_name,
        "candidate_index": int(row.name),
        "accuracy": round(row["mean_test_accuracy"], 4),
        "balanced_accuracy": round(row["mean_test_balanced_accuracy"], 4),
        "f1": round(row["mean_test_f1"], 4),
        "class_weight": row["param_logistic__class_weight"],
        "C": round(float(row["param_logistic__C"]), 3),
    })

pd.DataFrame(best_rows)
text
      winning_metric  candidate_index  accuracy  balanced_accuracy      f1 class_weight      C
0           accuracy               75    0.8957             0.5630  0.2215         None  1.650
1  balanced_accuracy               97    0.7121             0.7046  0.3519     balanced  0.041
2                 f1               97    0.7121             0.7046  0.3519     balanced  0.041
python
winners = {
    m: int(results.sort_values(f"rank_test_{m}").iloc[0].name)
    for m in ["accuracy", "balanced_accuracy", "f1"]
}
distinct = len(set(winners.values()))
print("top candidate index per metric:", winners)
print(f"{distinct} distinct candidates win across the 3 metrics "
      f"-> the metrics disagree." if distinct > 1
      else "metrics agreed on a single candidate.")
text
top candidate index per metric: {'accuracy': 75, 'balanced_accuracy': 97, 'f1': 97}
2 distinct candidates win across the 3 metrics -> the metrics disagree.

For advanced users the useful question is not only "which candidate won?", but whether different metrics prefer the same region. Plotting the top candidates per metric makes those tradeoffs visible without rerunning the search.

python
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for axis, metric in zip(axes, ["accuracy", "balanced_accuracy", "f1"]):
    plot_candidate_rankings(
        search,
        top_k=6,
        metric=metric,
        label_params=["logistic__C", "logistic__class_weight"],
        ax=axis,
        title=metric,
    )
fig.suptitle("Top candidates by metric — each metric prefers a different ordering")
fig.tight_layout()

Three side-by-side candidate-ranking plots, one per metric, showing different orderings

Each subplot ranks candidates by one metric; the orderings differ, so the refit choice matters.

Optimizer Telemetry

With multi-metric scoring the GA still optimizes a single scalar fitness — the selected refit metric. Telemetry explains how the optimizer moved through the space while optimizing balanced accuracy.

python
print(search.fit_stats_)
text
{'evaluated_candidates': 110, 'unique_candidates': 110, 'cross_validate_calls': 110, 'cache_hits': 0, 'duplicate_candidates': 0, 'skipped_invalid_candidates': 0, 'population_parallel_batches': 6, 'population_serial_batches': 0, 'random_immigrants': 0, 'local_refinement_candidates': 2}
python
history = pd.DataFrame(search.history)
cols = ["gen", "fitness", "fitness_max", "fitness_std",
        "unique_individual_ratio", "genotype_diversity", "stagnation_generations"]
history[[c for c in cols if c in history.columns]].tail()
text
   gen   fitness  fitness_max  fitness_std  unique_individual_ratio  genotype_diversity  stagnation_generations
0    0  0.591005     0.701215     0.078317                 1.000000            0.727273                       0
1    1  0.654288     0.701215     0.078101                 0.750000            0.431818                       1
2    2  0.578690     0.701400     0.073626                 0.666667            0.454545                       0
3    3  0.676753     0.701400     0.051773                 0.583333            0.431818                       1
4    4  0.644290     0.702020     0.066832                 0.583333            0.340909                       0

Practical Notes

  • Set refit to the metric that should define the final model before fitting; on imbalanced data, accuracy is rarely the right choice.
  • best_score_, best_params_, and best_estimator_ follow the refit metric, not every metric at once.
  • Use cv_results_ to inspect tradeoffs between metrics after fitting — when the ranks disagree, you are seeing a genuine modeling decision.
  • Use fit_stats_ and history to understand optimizer cost, diversity, stagnation, and convergence.

See Also

Released under the MIT License.