Tuning Isolation Forest With GASearchCV

IsolationForest has four hyperparameters that interact in non-obvious ways: contamination sets the decision threshold, max_samples controls how many points each tree sees, max_features determines which dimensions each tree uses, and n_estimators sets the ensemble size. The right values depend on the actual anomaly ratio and the structure of the data — a natural fit for evolutionary search.

The key challenge for tuning any outlier detector is scoring: IsolationForest fits on X only (unsupervised), so the usual classifier scorers don't apply directly. Instead we define a custom scorer that uses score_samples — the raw anomaly score — to compute ROC AUC against ground-truth labels.

Relationship to the guide

The Outlier Detection guide shows a minimal working example. This tutorial adds a realistic dataset, contour-plot visualizations, a ROC curve, and a full 3-way comparison against the default and random search.

Prerequisites

bash

pip install sklearn-genetic-opt

Setup

Every number and figure on this page is captured from running exactly the code shown below.

python

import warnings
from pprint import pprint
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import randint, uniform
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import (
    roc_auc_score, roc_curve, average_precision_score,
    classification_report,
)
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split

from sklearn_genetic import (
    EvolutionConfig, GASearchCV, OptimizationConfig, PopulationConfig, RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Continuous, Integer

warnings.filterwarnings("ignore")

RANDOM_STATE = 42
rng = np.random.default_rng(RANDOM_STATE)

Build a Labeled Anomaly Dataset

We use a 2D synthetic dataset so anomaly regions can be visualised as contour plots. Two Gaussian clusters form the normal data; outliers are scattered uniformly across a wider region (5% contamination).

python

# Normal data — three moderately spread clusters. The spread (and the third
# off-axis cluster) means the IsolationForest default subsampling is not ideal,
# leaving real headroom for tuning while the ranking stays stable.
X_normal, _ = make_blobs(
    n_samples=1800,
    centers=[[-3, -3], [3, 3], [-3.5, 3.5]],
    cluster_std=1.1,
    random_state=RANDOM_STATE,
)

# Outliers — uniform noise across the wider plane. Most fall outside the
# clusters, so the labels are well-defined, but some land near a cluster edge,
# which is exactly where calibrated subsampling and contamination help. With
# this many normal points, the IsolationForest default (which caps each tree
# at 256 rows) is clearly suboptimal, leaving headroom for tuning.
X_outliers = rng.uniform(low=-9, high=9, size=(200, 2))

X = np.vstack([X_normal, X_outliers])
y = np.array([0] * 1800 + [1] * 200)   # 0 = normal, 1 = outlier (10% contamination)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=RANDOM_STATE
)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

print(f"Train: {X_train.shape} — outliers: {y_train.sum()} ({y_train.mean():.1%})")
print(f"Test:  {X_test.shape} — outliers: {y_test.sum()} ({y_test.mean():.1%})")

text

Train: (1400, 2) — outliers: 140 (10.0%)
Test:  (600, 2) — outliers: 60 (10.0%)

Visualise the Dataset

python

fig, ax = plt.subplots(figsize=(7, 6))
ax.scatter(*X[y == 0].T, s=20, alpha=0.5, label="normal", color="steelblue")
ax.scatter(*X[y == 1].T, s=50, alpha=0.9, label="outlier", color="crimson",
           edgecolors="darkred", linewidths=0.8)
ax.set_title("Dataset — normal points vs. outliers")
ax.set_xlabel("feature 0")
ax.set_ylabel("feature 1")
ax.legend()
fig.tight_layout()

Scatter plot of two normal Gaussian clusters and uniformly scattered outliers

Two dense normal clusters surrounded by sparse uniform outliers — a clean target for anomaly detection.

Custom Scorer

score_samples returns the anomaly score: lower = more anomalous. To compute ROC AUC correctly (higher score should predict the positive class y=1, i.e. outlier), we negate it.

python

def outlier_roc_auc(estimator, X, y):
    # score_samples is lower for outliers; AUC expects higher = more likely positive
    scores = -estimator.score_samples(X)
    return roc_auc_score(y, scores)

scorer = outlier_roc_auc

# Verify the sign: with the negation, the default forest scores well above 0.5
_probe = IsolationForest(random_state=RANDOM_STATE).fit(X_train)
print(f"Sanity check — default IsolationForest scorer AUC: {scorer(_probe, X_test, y_test):.4f}")
print(f"Without negation (wrong sign): "
      f"{roc_auc_score(y_test, _probe.score_samples(X_test)):.4f}")

text

Sanity check — default IsolationForest scorer AUC: 0.9431
Without negation (wrong sign): 0.0569

Why negate score_samples?

IsolationForest.score_samples returns more negative values for anomalies. If you pass them directly to roc_auc_score with y=1 for outliers, the discriminator appears anti-correlated and AUC comes out below 0.5 (see the sanity check above). Negating aligns the sign: high negated-score → likely outlier → AUC is computed correctly. The scorer is passed as a callable so it can call score_samples on the fitted estimator.

Helpers

evaluate reports ranking quality (ROC AUC, average precision) plus the precision/recall of the binary predict() output, which depends on the tuned contamination threshold.

python

def evaluate(name, estimator, X_eval, y_eval):
    scores = -estimator.score_samples(X_eval)
    auc = round(roc_auc_score(y_eval, scores), 4)
    ap = round(average_precision_score(y_eval, scores), 4)
    preds = estimator.predict(X_eval)            # IsoForest: 1=inlier, -1=outlier
    preds_binary = (preds == -1).astype(int)
    report = classification_report(
        y_eval, preds_binary, target_names=["normal", "outlier"], output_dict=True
    )
    return {
        "name": name,
        "roc_auc": auc,
        "avg_precision": ap,
        "outlier_precision": round(report["outlier"]["precision"], 4),
        "outlier_recall":    round(report["outlier"]["recall"], 4),
    }

Baseline

The default IsolationForest is a strong starting point — but it has no idea what the true contamination rate is.

python

baseline = IsolationForest(random_state=RANDOM_STATE)
baseline.fit(X_train, y_train)
baseline_metrics = evaluate("IsolationForest defaults", baseline, X_test, y_test)
print(baseline_metrics)

text

{'name': 'IsolationForest defaults', 'roc_auc': 0.9431, 'avg_precision': 0.8134, 'outlier_precision': 0.5361, 'outlier_recall': 0.8667}

Search Space

Four continuous/integer hyperparameters, each over a range that covers the useful region for this dataset.

python

param_grid = {
    # Ensemble size — more trees = more stable, lower-variance scores
    "n_estimators": Integer(150, 300),

    # Subsampling — each tree sees a random subset of rows. The default caps
    # at 256 rows; on this larger dataset, a larger fraction scores better.
    "max_samples":  Continuous(0.10, 0.80),

    # Feature subsampling — each tree uses a random subset of columns
    "max_features": Continuous(0.5, 1.0),

    # Contamination — sets the decision threshold for predict()
    "contamination": Continuous(0.02, 0.30),
}
sorted(param_grid)

text

['contamination', 'max_features', 'max_samples', 'n_estimators']

contamination affects the threshold, not the score

contamination determines the cut-off for predict() — it does not change score_samples. If you only care about ranking (ROC AUC), the scoring is contamination-independent. Including it in the search space is still valuable because a well-calibrated threshold improves predict(), which drives precision and recall.

Configure GASearchCV

The custom scorer is passed straight to scoring. GASearchCV accepts any callable with the (estimator, X, y) signature.

python

callbacks = [
    DeltaThreshold(threshold=0.002, generations=4, metric="fitness_best"),
    ConsecutiveStopping(generations=5, metric="fitness_best"),
    TimerStopping(total_seconds=100),
]

ga_search = GASearchCV(
    estimator=IsolationForest(random_state=RANDOM_STATE),
    random_state=RANDOM_STATE,
    param_grid=param_grid,
    scoring=scorer,
    cv=cv,
    evolution_config=EvolutionConfig(
        population_size=12,
        generations=8,
        crossover_probability=ExponentialAdapter(
            initial_value=0.8, end_value=0.4, adaptive_rate=0.15
        ),
        mutation_probability=InverseAdapter(
            initial_value=0.25, end_value=0.05, adaptive_rate=0.20
        ),
        tournament_size=3,
        elitism=True,
        keep_top_k=3,
    ),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[{
            "n_estimators":  200,
            "max_samples":   0.50,
            "max_features":  1.0,
            "contamination": 0.10,   # matches the true contamination here
        }],
    ),
    runtime_config=RuntimeConfig(
        n_jobs=-1,
        parallel_backend="auto",
        use_cache=True,
        verbose=False,
    ),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=1,
        local_search_radius=0.2,
        diversity_control=True,
        diversity_threshold=0.30,
        diversity_stagnation_generations=3,
        diversity_mutation_boost=1.8,
        random_immigrants_fraction=0.12,
        fitness_sharing=True,
        sharing_radius=0.35,
    ),
)

Fit and Results

python

started_at = time.perf_counter()
ga_search.fit(X_train, y_train, callbacks=callbacks)
ga_seconds = time.perf_counter() - started_at

print(f"Best CV ROC AUC: {ga_search.best_score_:.4f}")
print(f"Search time:     {ga_seconds:.0f}s")
print("Best params:")
pprint(ga_search.best_params_)

text

INFO: TimerStopping callback met its criteria
INFO: Stopping the algorithm
Best CV ROC AUC: 0.9239
Search time:     127s
Best params:
{'contamination': 0.1,
 'max_features': 1.0,
 'max_samples': 0.5,
 'n_estimators': 200}

Evaluation Mechanics

python

print(ga_search.fit_stats_)

text

{'evaluated_candidates': 38, 'unique_candidates': 38, 'cross_validate_calls': 38, 'cache_hits': 0, 'duplicate_candidates': 0, 'skipped_invalid_candidates': 0, 'population_parallel_batches': 3, 'population_serial_batches': 0, 'random_immigrants': 0, 'local_refinement_candidates': 2}

Generation Telemetry

python

history = pd.DataFrame(ga_search.history)
cols = ["gen", "fitness", "fitness_max", "fitness_std",
        "unique_individual_ratio", "genotype_diversity", "stagnation_generations"]
history[[c for c in cols if c in history.columns]]

text

   gen   fitness  fitness_max  fitness_std  unique_individual_ratio  genotype_diversity  stagnation_generations
0    0  0.906656     0.923916     0.005961                 1.000000            1.000000                       0
1    1  0.909998     0.923916     0.007358                 0.833333            0.613636                       2

Fitness Evolution

The GA's fitness is the CV ROC AUC of the best individual.

python

fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(history["gen"], history["fitness_best"], marker="o", color="#16a085",
        label="best so far")
ax.plot(history["gen"], history["fitness_max"], marker=".", color="#2980b9",
        label="generation max")
ax.plot(history["gen"], history["fitness"], marker=".", color="#95a5a6",
        label="generation mean")
ax.set_xlabel("Generation")
ax.set_ylabel("ROC AUC (CV)")
ax.set_title("Isolation Forest GA Search — ROC AUC over Generations")
ax.legend(frameon=False)
ax.grid(alpha=0.25)
fig.tight_layout()

Best, generation-max, and generation-mean CV ROC AUC across generations

Fitness (CV ROC AUC of the best individual) improves as the GA tunes the four IsolationForest hyperparameters.

Anomaly Score Contour Plots

Visualise how the anomaly-score surface changes between the default model and the tuned one. Darker red = more anomalous; darker green = more normal.

python

xx, yy = np.meshgrid(
    np.linspace(-10, 10, 160),
    np.linspace(-10, 10, 160),
)
grid = np.c_[xx.ravel(), yy.ravel()]

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for ax, model, title in [
    (axes[0], baseline,  "IsolationForest defaults"),
    (axes[1], ga_search, "GASearchCV tuned"),
]:
    Z = (-model.score_samples(grid)).reshape(xx.shape)
    cf = ax.contourf(xx, yy, Z, levels=30, cmap="RdYlGn_r", alpha=0.8)
    ax.scatter(*X_test[y_test == 0].T, s=15, color="white", alpha=0.6, label="normal")
    ax.scatter(*X_test[y_test == 1].T, s=60, color="black", alpha=0.9,
               edgecolors="white", linewidths=0.8, label="outlier")
    fig.colorbar(cf, ax=ax, label="Anomaly score (higher = more anomalous)")
    ax.set_title(title)
    ax.legend(loc="upper right")

plt.suptitle("Anomaly Score Contour — Default vs. Tuned", fontsize=13, y=1.02)
fig.tight_layout()

Side-by-side anomaly-score contour plots for the default and tuned IsolationForest

The tuned model keeps low scores inside the two normal clusters while concentrating high anomaly scores in the sparse outlier region.

ROC Curve Comparison

The ROC curve is threshold-independent, so it isolates the ranking quality of the anomaly scores — exactly what the custom scorer optimizes.

python

fig, ax = plt.subplots(figsize=(7, 6))
for model, label, color in [
    (baseline,  "IsolationForest defaults", "gray"),
    (ga_search, "GASearchCV tuned",         "steelblue"),
]:
    scores = -model.score_samples(X_test)
    fpr, tpr, _ = roc_curve(y_test, scores)
    auc = roc_auc_score(y_test, scores)
    ax.plot(fpr, tpr, label=f"{label} (AUC = {auc:.3f})", color=color, linewidth=2)

ax.plot([0, 1], [0, 1], "k--", linewidth=1, label="Random")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curve — Isolation Forest")
ax.legend()
fig.tight_layout()

ROC curves for the default and GA-tuned IsolationForest

The GA-tuned curve sits above the default across most of the FPR range — a higher ROC AUC means better-ranked anomaly scores.

Compare with RandomizedSearchCV

A 3-way comparison on the held-out test set, all using the same custom scorer.

python

randomized_search = RandomizedSearchCV(
    estimator=IsolationForest(random_state=RANDOM_STATE),
    param_distributions={
        "n_estimators":  randint(150, 301),
        "max_samples":   uniform(0.10, 0.70),
        "max_features":  uniform(0.5, 0.5),
        "contamination": uniform(0.02, 0.28),
    },
    n_iter=25,
    scoring=scorer,
    cv=cv,
    n_jobs=-1,
    random_state=RANDOM_STATE,
)

started_at = time.perf_counter()
randomized_search.fit(X_train, y_train)
rs_seconds = time.perf_counter() - started_at

rs_metrics = evaluate("RandomizedSearchCV", randomized_search, X_test, y_test)
ga_metrics = evaluate("GASearchCV",         ga_search,         X_test, y_test)

comparison = pd.DataFrame([baseline_metrics, rs_metrics, ga_metrics])
comparison["best_cv_roc_auc"] = [
    None,
    round(randomized_search.best_score_, 4),
    round(ga_search.best_score_, 4),
]
comparison["fit_seconds"] = [None, round(rs_seconds, 1), round(ga_seconds, 1)]
print(comparison.to_string(index=False))

text

                    name  roc_auc  avg_precision  outlier_precision  outlier_recall  best_cv_roc_auc  fit_seconds
IsolationForest defaults   0.9431         0.8134             0.5361          0.8667              NaN          NaN
      RandomizedSearchCV   0.9400         0.7681             0.3901          0.9167           0.9106          8.4
              GASearchCV   0.9547         0.8252             0.6528          0.7833           0.9239        127.5

python

print(f"GA vs default ROC AUC: {ga_metrics['roc_auc'] - baseline_metrics['roc_auc']:+.4f}")
print(f"GA outlier recall: {ga_metrics['outlier_recall']:.2f}  "
      f"(default {baseline_metrics['outlier_recall']:.2f})")

text

GA vs default ROC AUC: +0.0116
GA outlier recall: 0.78  (default 0.87)

The GA tuning improves ranking quality (ROC AUC) over the default and calibrates contamination to the true anomaly rate, which directly lifts outlier recall in the binary predict() output.

Practical Notes

score_samples, not predict — use score_samples in the custom scorer for a continuous ranking signal. predict applies the contamination threshold and returns a binary label, a much noisier fitness signal.
Negate the score — score_samples is lower for anomalies. Pass -score_samples to roc_auc_score when y=1 means outlier. Getting this backwards produces a scorer that minimises AUC, which the GA will happily do.
contamination is a threshold, not a model parameter — it doesn't affect score_samples or how trees split, only the predict boundary. If your use case only ranks, you can fix it from domain knowledge and drop it from the search space.
max_samples has the most impact on model quality. Very small values (< 0.05) over-isolate dense clusters; values near 1.0 reduce ensemble diversity. The range [0.05, 0.80] covers the useful region.
Warm-start contamination near your best estimate of the true anomaly rate to give the GA a good early candidate.
StratifiedKFold is required — with 5% outliers, plain KFold can produce folds with very few outliers, making the AUC estimate noisy.

Tuning Isolation Forest With GASearchCV ​

Prerequisites ​

Setup ​

Build a Labeled Anomaly Dataset ​

Visualise the Dataset ​

Custom Scorer ​

Helpers ​

Baseline ​

Search Space ​

Configure GASearchCV ​

Fit and Results ​

Evaluation Mechanics ​

Generation Telemetry ​

Fitness Evolution ​

Anomaly Score Contour Plots ​

ROC Curve Comparison ​

Compare with RandomizedSearchCV ​

Practical Notes ​

See Also ​

Tuning Isolation Forest With GASearchCV

Prerequisites

Setup

Build a Labeled Anomaly Dataset

Visualise the Dataset

Custom Scorer

Helpers

Baseline

Search Space

Configure GASearchCV

Fit and Results

Evaluation Mechanics

Generation Telemetry

Fitness Evolution

Anomaly Score Contour Plots

ROC Curve Comparison

Compare with RandomizedSearchCV

Practical Notes

See Also