Model Save and Load¶
PyOD ships a small, versioned wrapper around joblib that solves
two recurring pain points: cross-sklearn-version compatibility for
saved models, and the absence of any record of what a saved model
was fit with. The recommended API lives in
pyod.utils.persistence.
Quick Start¶
from pyod.models.iforest import IForest
from pyod.utils.persistence import save, load
clf = IForest().fit(X_train)
# Save with a versioned envelope.
save(clf, "clf.pyod.joblib", metadata={"dataset": "demo"})
# Later, in a possibly different environment:
clf = load("clf.pyod.joblib")
# Or get the envelope back alongside the model:
clf, env = load("clf.pyod.joblib", return_metadata=True)
print(env["sklearn_version"], env["saved_at"])
The complete example in
examples/save_load_model_example.py
also covers the legacy joblib.dump / joblib.load flow as a
secondary alternative.
Trust Boundary¶
pickle and joblib deserialize arbitrary Python code. Load only
from sources you trust. This applies equally to raw joblib.load,
raw pickle.load, load(), and
compat_load(). The new wrapper does not
change this security model; it does not sandbox the unpickling step.
Why a Versioned Wrapper¶
Saving a fitted detector with plain joblib.dump writes the model
and nothing else. When a downstream user later calls joblib.load,
the running environment’s sklearn, numpy, scipy, joblib, and Python
versions may differ from the save environment in ways that change
predictions or break loading outright. Users on PyOD have reported
this exact failure mode (see issue
#519) when sklearn
evolves its internal pickle layout; the error message is
ValueError: node array from the pickle has an incompatible dtype.
save() records the dependency versions
in effect at save time alongside the model.
load() reads that envelope and emits a
clear warning when any binary-format dependency drifts, so the issue
surfaces at load time rather than during a later prediction
incident. The schema is documented and stable; future PyOD releases
will read envelopes written by earlier ones.
Loading Legacy Pickles¶
If you already have artifacts saved with raw joblib.dump and they
fail to load with the dtype-mismatch error,
compat_load() repairs the most common
case: sklearn introduced a new Tree-node field (missing_go_to_left
in 1.3) and old pickles do not carry it. compat_load patches
joblib’s unpickler so the saved Tree state is realigned to the
running sklearn’s dtype before sklearn’s own __setstate__ raises.
from pyod.utils.persistence import compat_load
clf = compat_load("legacy.joblib")
# Re-save under the new envelope to avoid repeating the dance:
from pyod.utils.persistence import save
save(clf, "legacy_resaved.pyod.joblib")
You usually do not need to call compat_load directly.
load() falls through to compat_load
automatically when joblib.load raises the documented dtype
error, and routes the recovered model through the same envelope or
legacy handler:
from pyod.utils.persistence import load
clf = load("legacy.joblib") # transparently recovers from dtype drift
The fall-through emits a UserWarning so the recovery does not
go unnoticed. Re-save with save() (or
re-fit on the current sklearn) to remove the dependency on the
compat path.
Decision Tree¶
Saving a new model?
-> use save(clf, path)
Loading a model and load(path) works without warnings?
-> done
Loading a model and load(path) succeeds with a "recovered" warning?
-> the artifact was repaired via compat_load; re-save with save()
Loading a model and load(path) raises?
-> if the error is about Tree-node dtype, try compat_load directly
and check whether the warning recommends re-fit. If it cannot
recover, re-fit on the current sklearn.
Cross-Sklearn-Version Compatibility¶
The most common cross-version failure is the sklearn Tree node dtype
evolving across minor releases. sklearn 1.3 added a
missing_go_to_left field to its Tree node struct; older pickles
omit that field, and loading them on 1.3 or later raises
ValueError: node array from the pickle has an incompatible dtype.
compat_load() is the supported escape
hatch for this case. It is allowlist-driven and conservative:
Missing fields in the saved dtype that PyOD has documented a safe default for (currently only
missing_go_to_left = 0, the pre-1.3 “do not route on missingness” behavior) are zero-filled.Missing fields without a documented default raise
ValueErrorrather than silently inventing a value.Field-level dtype changes beyond byte order (kind, signedness, itemsize, shape) raise
ValueErrorrather than silently casting.Byte-order-only differences are realigned safely.
Two caveats apply. First, compat_load is best-effort: predictions
on inputs that contain missing values may differ from what the
original training would have produced, because zero-filled defaults
for fields like missing_go_to_left need not match what the
original training would have implied. The durable fix is to re-fit on
the current sklearn. Second, compat_load only repairs the Tree
node dtype. Other cross-version sklearn changes (newly required
private cached state, newly added class attributes) are out of
scope. If compat_load succeeds but predictions still fail with a
different sklearn error, re-fit on the current sklearn.
Troubleshooting¶
Error text starts with |
Recommended action |
|---|---|
|
Try |
|
Safe to ignore; sklearn is reminding you the save and run versions differ. Re-save or re-fit when convenient. |
Other sklearn unpickling errors |
The artifact is incompatible beyond what |
Strict Mode¶
For version-pinned production environments, pass strict=True to
load():
from pyod.utils.persistence import load
clf = load("prod.pyod.joblib", strict=True)
Under strict mode, any drift in sklearn, joblib, numpy, or scipy
raises ValueError rather than emitting a warning. Drift in the
Python version does not raise because it is informational only.
Strict mode also rejects raw legacy artifacts (no envelope to
compare against) and refuses to return a model that required a
compat_load repair: strict callers must either re-save under the
current environment or re-fit.
Reading Envelope Metadata¶
load(path, return_metadata=True) returns a (model, envelope)
tuple where envelope is the full envelope dict minus the
model field:
from pyod.utils.persistence import load
clf, env = load("clf.pyod.joblib", return_metadata=True)
print(env["pyod_version"], env["sklearn_version"])
print(env["saved_at"], env["model_class"])
print(env["metadata"]) # whatever you passed to save(... metadata=...)
A future PyOD release plans a true header-only inspect_artifact
(reading metadata without unpickling the model), paired with a
.pyod zip container that separates metadata from the model
payload. Until that ships, load(..., return_metadata=True) is
the supported way to introspect a saved artifact, and it does
unpickle the model.
Neural Network Models¶
Saving deep-learning detectors that wrap torch.nn.Module (e.g.,
AutoEncoder, DeepSVDD, VAE) has separate constraints
that this module does not yet address; see issues
#88
and
#328
for the current workaround.