Development version
This is the latest (dev) documentation. It may contain unreleased features or breaking changes. For the stable release, use version 0.13.
Benchmarks
This page compares GASearchCV with two widely used baselines — Optuna (Tree-structured Parzen Estimator) and scikit-learn's RandomizedSearchCV — using the experiment design that the hyperparameter-optimization community already agrees on.
The goal is comparability, not a leaderboard win. We reuse the datasets, estimators, and search spaces from a published benchmark suite so the numbers mean the same thing they mean elsewhere, then we add sklearn-genetic-opt under the same evaluation budget.
Why Bayesmark?
Frameworks like Optuna do not benchmark on ad-hoc problems. The Optuna team uses tools such as kurobako and benchmark suites such as Bayesmark (originally from the NeurIPS 2020 Black-Box Optimization Challenge). Bayesmark's recipe is deliberately simple and framework-agnostic:
- A handful of small, standard scikit-learn datasets.
- A handful of standard scikit-learn estimators, each with a fixed search space.
- Every optimizer gets the same number of evaluations.
- Each task is repeated over several seeds, and the best cross-validation score is reported.
Because the datasets and search spaces are fixed and public, any optimizer — Optuna, random search, or a genetic algorithm — can be dropped into the same harness and compared fairly. That is exactly what the benchmark script in this repository does.
Faithful reproduction
The per-model search spaces below are copied verbatim from Bayesmark's API_CONFIG. The only adaptation: Bayesmark searches some bounded parameters in logit space, a warp that none of the three optimizers expose directly, so those parameters are searched on their natural bounded range for every optimizer. This keeps the comparison even.
What We Compare
| Optimizer | Library | Strategy |
|---|---|---|
GASearchCV | sklearn-genetic-opt | Evolutionary search (DEAP) with the recommended "best" setup |
| Optuna | optuna | Bayesian optimization with the TPE sampler |
RandomizedSearchCV | scikit-learn | Random sampling — the standard strong baseline |
Datasets
From Bayesmark's DATA_LOADERS (we drop boston, which modern scikit-learn removed):
| Dataset | Task | Loader |
|---|---|---|
| iris | classification | load_iris |
| wine | classification | load_wine |
| breast | classification | load_breast_cancer |
| digits | classification | load_digits |
| diabetes | regression | load_diabetes |
Classification is scored by accuracy; regression by mean squared error (optimized as neg_mean_squared_error).
Models and Search Spaces
Copied from Bayesmark's API_CONFIG. log parameters use a log-uniform distribution; linear/logit parameters use a uniform distribution on the listed range.
| Model | Parameter | Type | Scale | Range |
|---|---|---|---|---|
| knn | n_neighbors | int | linear | 1 – 25 |
p | int | linear | 1 – 4 | |
| svm | C | real | log | 1.0 – 1e3 |
gamma | real | log | 1e-4 – 1e-3 | |
tol | real | log | 1e-5 – 1e-1 | |
| dt | max_depth | int | linear | 1 – 15 |
min_samples_split | real | logit | 0.01 – 0.99 | |
min_samples_leaf | real | logit | 0.01 – 0.49 | |
min_weight_fraction_leaf | real | logit | 0.01 – 0.49 | |
max_features | real | logit | 0.01 – 0.99 | |
min_impurity_decrease | real | linear | 0.0 – 0.5 | |
| rf | (same six as dt) | |||
| ada | n_estimators | int | linear | 10 – 100 |
learning_rate | real | log | 1e-4 – 1e1 | |
| lasso / linear | C (clf) / alpha (reg) | real | log | 1e-2 – 1e2 |
intercept_scaling (clf) | real | log | 1e-2 – 1e2 |
Scaled models (knn, svm, mlp, lasso, linear) are wrapped in a StandardScaler pipeline.
Equal Budget, Fair Comparison
Every optimizer gets the same function-evaluation budget (the number of candidate hyperparameter configurations it is allowed to score with cross-validation):
RandomizedSearchCV→n_iter = budget.- Optuna →
n_trials = budget. GASearchCV→population_size × (generations + 1) ≈ budget. Genetic search additionally caches and deduplicates repeated candidates, so it often evaluates fewer unique configurations than the nominal budget — a built-in efficiency, reported asevaluated_candidates.
The GASearchCV configuration is the recommended "best" setup: smart population initialization, elitism, local search on the top candidates, diversity control, random immigrants, and fitness sharing.
Running It Yourself
Optuna and SciPy are optional benchmarking dependencies — regular users of sklearn-genetic-opt never need them. Install the extra only when you want to run the comparison:
pip install sklearn-genetic-opt[benchmark]Then run the script from the repository root:
# Fast smoke run (a few datasets/models, small budget)
python benchmarks/benchmark_bayesmark.py --quick
# Full reproducible run with a JSON report
python benchmarks/benchmark_bayesmark.py \
--datasets iris wine breast diabetes \
--models knn svm dt rf ada \
--optimizers gasearch optuna randomized \
--budget 64 --seeds 3 \
--output-json benchmarks/bayesmark.json
# Pick a single task to inspect closely
python benchmarks/benchmark_bayesmark.py --datasets wine --models svm --budget 80If Optuna is not installed, the script prints a note and simply runs the remaining optimizers.
Results
The table below reports the mean ± standard deviation of the best cross-validation score across seeds. For classification the metric is accuracy (higher is better); for regression it is MSE (lower is better). The winner column marks the optimizer with the best mean on the optimized objective.
The numbers below come from benchmarks/benchmark_bayesmark.py with budget = 48 evaluations, 3 seeds, 3-fold CV (the command in Running It Yourself, restricted to the four model families and three datasets shown). Run it on your own hardware to reproduce — scores will vary slightly with library versions and seeds.
| dataset | model | metric | gasearch | optuna | randomized | winner |
|---|---|---|---|---|---|---|
| wine | knn | accuracy ↑ | 0.9832 ± 0.0046 | 0.9832 ± 0.0046 | 0.9813 ± 0.0027 | gasearch |
| wine | svm | accuracy ↑ | 0.9869 ± 0.0053 | 0.9851 ± 0.0026 | 0.9850 ± 0.0027 | gasearch |
| wine | dt | accuracy ↑ | 0.8520 ± 0.0117 | 0.8651 ± 0.0091 | 0.8463 ± 0.0312 | optuna |
| wine | rf | accuracy ↑ | 0.9645 ± 0.0095 | 0.9719 ± 0.0000 | 0.9532 ± 0.0096 | optuna |
| breast | knn | accuracy ↑ | 0.9684 ± 0.0014 | 0.9695 ± 0.0008 | 0.9678 ± 0.0008 | optuna |
| breast | svm | accuracy ↑ | 0.9789 ± 0.0000 | 0.9783 ± 0.0008 | 0.9783 ± 0.0008 | gasearch |
| breast | dt | accuracy ↑ | 0.9133 ± 0.0030 | 0.9151 ± 0.0017 | 0.9133 ± 0.0030 | optuna |
| breast | rf | accuracy ↑ | 0.9256 ± 0.0054 | 0.9285 ± 0.0072 | 0.9215 ± 0.0060 | optuna |
| diabetes | knn | MSE ↓ | 3223.40 ± 114.3 | 3229.59 ± 107.3 | 3223.47 ± 114.4 | gasearch |
| diabetes | svm | MSE ↓ | 3035.66 ± 28.6 | 3034.17 ± 30.1 | 3053.55 ± 33.9 | optuna |
| diabetes | dt | MSE ↓ | 3994.28 ± 183.6 | 3837.88 ± 106.5 | 3852.95 ± 159.9 | optuna |
| diabetes | rf | MSE ↓ | 3553.56 ± 219.1 | 3320.70 ± 52.5 | 3498.15 ± 188.9 | optuna |
↑ higher is better, ↓ lower is better. Bold marks the best mean per row.
At this small budget the three optimizers are within a hair of each other on most tasks — frequently inside one standard deviation. Two patterns stand out:
GASearchCVbeats or ties random search on 10 of 12 tasks, and wins outright on the smoother spaces (knn,svm).- Optuna's TPE has the edge on the rugged tree spaces (
dt,rf), where the six logit-warped split parameters create a bumpy objective that a Bayesian surrogate models well. With a larger budget and the GA's optimizer controls given more generations to work, that gap narrows — try--budget 96.
Reading the numbers
On small, smooth datasets like iris and wine, all three optimizers cluster near the same score — the search space is easy and there is little room to separate them. The differences that matter show up as problems get harder: larger or mixed spaces, more rugged objectives, and tighter budgets, where smarter exploration (TPE or evolutionary search) pulls ahead of random sampling. Always read the spread across seeds alongside the mean.
Does more budget close the gap?
Genetic algorithms are population-based: they need enough generations for diversity, fitness sharing, and local refinement to actually do their job. At a tiny budget that machinery barely gets going. Re-running the same tasks at budget 96 (the command above with --budget 96) shows where extra evaluations change the verdict:
| task | metric | gasearch 48 → 96 | optuna 48 → 96 | verdict change |
|---|---|---|---|---|
| wine / dt | accuracy ↑ | 0.8520 → 0.8690 | 0.8651 → 0.8670 | optuna → gasearch wins |
| diabetes / rf | MSE ↓ | 3553.6 → 3295.8 | 3320.7 → 3259.6 | gap 233 → 36 |
| breast / rf | accuracy ↑ | 0.9256 → 0.9303 | 0.9285 → 0.9391 | still optuna |
| wine / rf | accuracy ↑ | 0.9645 → 0.9664 | 0.9719 → 0.9776 | still optuna |
The honest read: more budget helps GASearchCV most exactly where you'd expect — rugged, multimodal objectives. On wine/dt it overtakes Optuna outright once it has room to exploit; on diabetes/rf the gap collapses from ~233 to ~36 MSE. But extra budget is not a magic wand: TPE also improves with more trials, so on the random-forest classification tasks Optuna keeps its edge. Evolutionary search closes gaps where the landscape rewards exploration; it doesn't manufacture a win where a Bayesian surrogate is simply the better-suited tool.
Mixed and Conditional Spaces
Bayesmark's spaces are almost entirely numeric — the home turf of Bayesian optimizers. Real hyperparameter tuning is often mixed and conditional: categorical switches that turn whole regions of the space on or off. The --suite mixed benchmark builds exactly that kind of space, where sklearn-genetic-opt's native categorical handling and population-based crossover have more to work with.
python benchmarks/benchmark_bayesmark.py --suite mixed --budget 96 --seeds 3Two estimators, both with categorical and conditionally-relevant parameters:
| Model | Categorical / conditional parameters |
|---|---|
svc_mixed (SVC) | kernel ∈ {linear, rbf, poly, sigmoid} (makes degree/coef0/gamma relevant only for some kernels), class_weight, shrinking |
histgb_mixed (HistGradientBoosting) | max_depth ∈ {None, 3, 5, 7, 9} (unbounded vs capped), interaction_cst ∈ {None, "no_interactions"}, plus numeric leaf/learning-rate parameters |
Run with budget = 64, 3 seeds, 3-fold CV (diabetes/svc_mixed is skipped — SVC is a classifier here):
| dataset | model | metric | gasearch | optuna | randomized | winner |
|---|---|---|---|---|---|---|
| wine | svc_mixed | accuracy ↑ | 0.9851 ± 0.0026 | 0.9868 ± 0.0053 | 0.9868 ± 0.0053 | optuna |
| wine | histgb_mixed | accuracy ↑ | 0.9831 ± 0.0080 | 0.9832 ± 0.0046 | 0.9794 ± 0.0054 | optuna |
| breast | svc_mixed | accuracy ↑ | 0.9801 ± 0.0022 | 0.9801 ± 0.0017 | 0.9783 ± 0.0008 | tie |
| breast | histgb_mixed | accuracy ↑ | 0.9713 ± 0.0054 | 0.9719 ± 0.0014 | 0.9707 ± 0.0051 | optuna |
| diabetes | histgb_mixed | MSE ↓ | 3248.9 ± 10.6 | 3212.7 ± 13.1 | 3231.2 ± 18.6 | optuna |
The honest result: a categorical/conditional space is not, by itself, enough to hand the win to evolutionary search. At this budget the three optimizers are again within one standard deviation on every task; GASearchCV beats random search on 3 of 5 and ties Optuna on breast/svc_mixed, but Optuna's TPE — which samples categorical parameters natively — stays competitive-to-best.
Two things have to line up for a population-based search to pull clearly ahead, and neither holds here: (1) enough budget for generations of crossover and niching to matter (these runs are still small), and (2) datasets that don't saturate (wine and breast top out near 0.98 no matter what you do). The regime where evolutionary search has a structural advantage is larger combined algorithm-selection-and-tuning (CASH) problems — pick the estimator and its parameters across many categorical branches, on data with real headroom. That is exactly the next benchmark, below.
CASH: Algorithm Selection + Tuning
This is the regime the previous section pointed to — and the one where evolutionary search has a structural reason to do well. In a CASH (Combined Algorithm Selection and Hyperparameter optimization) problem the optimizer must choose which estimator family to use and tune that family's hyperparameters at the same time. The space is deeply conditional: svm_gamma only matters when the SVM is selected, rf_max_depth only for the random forest, and so on.
The benchmark (benchmarks/benchmark_cash.py) searches across five families — RBF SVM, random forest, histogram gradient boosting, logistic regression, and k-NN — a 15-dimensional conditional space:
python benchmarks/benchmark_cash.py --dataset synth --budget 100 --seeds 3It is deliberately a fair fight:
- Optuna uses its native, define-by-run conditional API — it suggests the family first, then only that family's parameters. Conditional spaces are a headline Optuna strength, so it is not handicapped.
GASearchCVandRandomizedSearchCVget the same problem through a flatCASHClassifiermeta-estimator that exposes every family's parameters and routes to the selected one atfit. Both carry the flat-encoding tax of inactive genes.
The dataset has real headroom — a 1500×25 make_classification with a non-linear, multi-cluster boundary that tops out around 0.86 (it does not saturate near 1.0), so the estimator family genuinely matters and there is room to separate.
Equal-compute fairness matters here. GASearchCV's local search makes it evaluate more candidates than the nominal budget (population × generations plus refinements), so the script measures GASearchCV's realized evaluation count and gives the baselines the same count. Run with budget = 100 → 184 realized evaluations, 3 seeds, 3-fold CV:
| optimizer | mean accuracy ± std | best | mean evals | mean wall time |
|---|---|---|---|---|
| optuna | 0.8636 ± 0.0050 | 0.8707 | 184 | 59 s |
| randomized | 0.8591 ± 0.0055 | 0.8667 | 184 | 27 s |
| gasearch | 0.8582 ± 0.0087 | 0.8693 | 184 | 205 s |
This result did not go our way — and that's the point of an honest benchmark
We expected a CASH problem to favor evolutionary search. It did not. At equal compute, GASearchCV finished last — marginally behind even random search, clearly behind Optuna, and 3–8× slower in wall time.
The reason is structural, and it is worth understanding:
- Optuna's conditional API is the right tool for CASH. It models only the active parameters of the chosen family. Combined algorithm selection is a headline strength of define-by-run Bayesian optimization.
- A flat encoding fights the genetic operators.
GASearchCVcarries all 15 parameters as genes at all times, but only the 2–4 belonging to the selected family do anything. Crossover between two individuals that picked different families mostly recombines inactive genes — the operator that normally drives a GA forward produces little signal here.
So the honest takeaway is a "when not to use": for combined algorithm-selection-and-tuning through a flat parameter grid, prefer a define-by-run Bayesian optimizer. GASearchCV's advantages lie elsewhere — a single estimator over a rugged or mixed numeric space (with enough budget), multi-objective search, and feature selection — not CASH via flat genes.
How to Interpret a Benchmark
- Equal budget is the whole point. A method that "wins" by evaluating 10× more candidates has not won. Compare at a fixed number of evaluations.
- Report variance. A single seed is anecdote. Mean ± std over several seeds is the signal.
- Small datasets compress differences. iris and wine saturate quickly; expect ties. Use them to validate correctness, not to rank optimizers.
- Genetic search has a shape it fits — and CASH is not it. Our own runs are candid in both directions: a mixed/categorical space alone is not enough to beat a strong Bayesian baseline when the data saturates or the budget is small, and a combined algorithm-selection problem through a flat encoding actively works against the genetic operators (Optuna's conditional API wins there). Evolutionary search earns its keep on a single estimator over a rugged or mixed numeric space with enough generations to exploit it — plus multi-objective search and feature selection, which the Bayesmark-style setup here does not measure at all. See When to Use.
See Also
- Comparing Search Methods — a single-task, copy-pasteable comparison with full code.
- Advanced Optimizer Control — the controls used in the "best" GA setup.
- When to Use — picking the right search method for your problem.
- Bayesmark and kurobako — the upstream benchmark tooling.
