Utility Functions¶

pyod.utils.encoders module¶

Encoder abstraction for EmbeddingOD.

Provides BaseEncoder and concrete implementations for converting raw data (text, images) to numeric embeddings.

class pyod.utils.encoders.BaseEncoder[source]¶

Bases: ABC

Abstract base class for embedding encoders.

All encoders must implement the encode method, which converts raw input data to a 2D numpy array of shape (n_samples, n_features).

abstractmethod encode(X, batch_size=32, show_progress=True)[source]¶

Convert raw input to numeric embeddings.

Parameters¶

Xlist or array-like: Raw input data.
batch_sizeint, optional (default=32): Batch size for encoding.
show_progressbool, optional (default=True): Whether to show a progress bar.

Returns¶

embeddings : numpy array of shape (n_samples, n_features)

class pyod.utils.encoders.CallableEncoder(fn)[source]¶

Bases: BaseEncoder

Encoder that wraps a user-provided callable.

Parameters¶

fncallable: A function that accepts raw input and returns a numpy array of shape (n_samples, n_features).

Examples¶

>>> import numpy as np
>>> encoder = CallableEncoder(fn=lambda X: np.random.randn(len(X), 10))
>>> embeddings = encoder.encode(["hello", "world"])
>>> embeddings.shape
(2, 10)

encode(X, batch_size=32, show_progress=True)[source]¶

Convert raw input to numeric embeddings.

Parameters¶

Xlist or array-like: Raw input data.
batch_sizeint, optional (default=32): Batch size for encoding.
show_progressbool, optional (default=True): Whether to show a progress bar.

Returns¶

embeddings : numpy array of shape (n_samples, n_features)

class pyod.utils.encoders.MultiModalEncoder(encoders, weights=None)[source]¶

Bases: BaseEncoder

Encode multiple modalities and concatenate into a single embedding.

Each modality is encoded by its own encoder. The resulting embeddings are concatenated column-wise into a single feature matrix suitable for any PyOD detector.

Parameters¶

encodersdict of {str: encoder}: Maps modality name to encoder. Each value can be: - A string (resolved via resolve_encoder at encode time) - A BaseEncoder instance - 'passthrough' for pre-computed numeric features
weightsdict of {str: float} or None, optional (default=None): Per-modality scaling applied after encoding. Useful when embedding dimensions differ significantly across modalities.

Examples¶

>>> from pyod.utils.encoders import MultiModalEncoder
>>> encoder = MultiModalEncoder({
...     'text': 'all-MiniLM-L6-v2',
...     'tabular': 'passthrough',
... })
>>> data = {'text': ["hello", "world"],
...         'tabular': np.array([[1, 2], [3, 4]])}
>>> embeddings = encoder.encode(data)
>>> embeddings.shape[0]
2

encode(X, batch_size=32, show_progress=True)[source]¶

Encode multi-modal input and concatenate.

Parameters¶

Xdict of {str: data}: Maps modality name to input data. Keys must match the encoders dict. Individual samples may be None to indicate a missing modality for that sample; missing embeddings are imputed with the training mean (if fit_encode was called) or zeros.
batch_sizeint, optional (default=32): Batch size for encoding.
show_progressbool, optional (default=True): Show progress bar.

Returns¶

embeddings : numpy array of shape (n_samples, total_features)

fit_encode(X, batch_size=32, show_progress=True)[source]¶

Encode training data and store per-modality mean embeddings.

Call this during training (EmbeddingOD.fit) so that mean embeddings are available for imputing missing samples at test time. Subsequent calls to encode will use these stored means.

Parameters¶

Xdict of {str: data}: Training data. Should not contain None samples.

Returns¶

embeddings : numpy array of shape (n_samples, total_features)

pyod.utils.encoders.resolve_encoder(encoder)[source]¶

Resolve an encoder from various input types.

Parameters¶

encoderstr, BaseEncoder, or callable

If BaseEncoder instance, returned as-is.
If callable, wrapped in CallableEncoder.
If string, looked up in the encoder registry. If not found, tries sentence-transformers first, then HuggingFace AutoModel. The auto-resolve fallback is designed for text embedding models. For image models (DINOv2, CLIP, etc.), use registry shortcuts (e.g., ‘dinov2-small’, ‘clip-vit-base’) instead of raw HuggingFace model IDs.

Returns¶

encoder : BaseEncoder

pyod.utils.encoders.sentence_transformer module¶

SentenceTransformerEncoder for EmbeddingOD.

class pyod.utils.encoders.sentence_transformer.SentenceTransformerEncoder(model_name='all-MiniLM-L6-v2', device=None, normalize=False, truncate_dim=None)[source]¶

Bases: BaseEncoder

Encoder using sentence-transformers library.

Wraps sentence_transformers.SentenceTransformer to produce text embeddings compatible with PyOD detectors.

Parameters¶

model_namestr or SentenceTransformer instance, optional

(default=’all-MiniLM-L6-v2’) - If str: model name (HF Hub ID) OR local filesystem path.

Local path is detected and loaded with local_files_only=True to prevent any network call.

If SentenceTransformer instance: used directly, skipping load. Useful for air-gapped environments or when you need custom model configuration not exposed via constructor params.

devicestr or None, optional (default=None)

Device for inference (‘cpu’, ‘cuda’, etc.). None for auto-detection.

normalizebool, optional (default=False)

L2-normalize output embeddings.

truncate_dimint or None, optional (default=None)

Truncate embeddings to this dimensionality (Matryoshka).

Examples¶

>>> from pyod.utils.encoders.sentence_transformer import \
...     SentenceTransformerEncoder
>>> encoder = SentenceTransformerEncoder('all-MiniLM-L6-v2')
>>> embeddings = encoder.encode(["hello world", "anomaly text"])
>>> embeddings.shape
(2, 384)

# Local filesystem path (air-gapped) >>> enc = SentenceTransformerEncoder(‘/mnt/models/my-weights’)

# Pre-instantiated model object >>> my_model = SentenceTransformer(‘all-MiniLM-L6-v2’) >>> enc = SentenceTransformerEncoder(my_model)

encode(X, batch_size=32, show_progress=True)[source]¶

Encode text strings to embeddings.

Parameters¶

Xlist of str: Text strings to encode.
batch_sizeint, optional (default=32): Batch size for encoding.
show_progressbool, optional (default=True): Show progress bar.

Returns¶

embeddings : numpy array of shape (n_samples, n_features)

pyod.utils.encoders.openai_encoder module¶

OpenAIEncoder for EmbeddingOD.

class pyod.utils.encoders.openai_encoder.OpenAIEncoder(model_name='text-embedding-3-small', dimensions=None, api_key=None)[source]¶

Bases: BaseEncoder

Encoder using OpenAI Embeddings API.

Produces text embeddings via the OpenAI API. Handles batching (max 2048 items per request) internally.

Parameters¶

model_namestr, optional (default=’text-embedding-3-small’): OpenAI embedding model name.
dimensionsint or None, optional (default=None): Truncate embeddings to this dimensionality (Matryoshka). Only supported by text-embedding-3-* models.
api_keystr or None, optional (default=None): OpenAI API key. Falls back to OPENAI_API_KEY environment variable.

Examples¶

>>> from pyod.utils.encoders.openai_encoder import OpenAIEncoder
>>> encoder = OpenAIEncoder('text-embedding-3-small')
>>> embeddings = encoder.encode(["normal text", "anomalous text"])

encode(X, batch_size=2048, show_progress=True)[source]¶

Encode text strings to embeddings via OpenAI API.

Parameters¶

Xlist of str: Text strings to encode.
batch_sizeint, optional (default=2048): Batch size. Capped at 2048 (OpenAI API limit).
show_progressbool, optional (default=True): Show progress bar (not used for API calls).

Returns¶

embeddings : numpy array of shape (n_samples, n_features)

pyod.utils.encoders.huggingface module¶

HuggingFaceEncoder for EmbeddingOD.

class pyod.utils.encoders.huggingface.HuggingFaceEncoder(model_name, device=None, pooling='cls', modality='text')[source]¶

Bases: BaseEncoder

Encoder using HuggingFace transformers.

Supports both text (AutoTokenizer + AutoModel) and image (AutoImageProcessor + AutoModel) modalities.

Parameters¶

model_namestr: HuggingFace model name or path.
devicestr or None, optional (default=None): Device for inference. None for auto-detection.
poolingstr, optional (default=’cls’): Pooling strategy: ‘cls’ for CLS token, ‘mean’ for mean of all token embeddings.
modalitystr, optional (default=’text’): Input modality: ‘text’ or ‘image’.

Examples¶

>>> from pyod.utils.encoders.huggingface import HuggingFaceEncoder
>>> encoder = HuggingFaceEncoder('bert-base-uncased', modality='text')
>>> embeddings = encoder.encode(["hello", "world"])

encode(X, batch_size=32, show_progress=True)[source]¶

Encode text or images to embeddings.

Parameters¶

Xlist of str (text) or list of PIL.Image (image): Input data.
batch_sizeint, optional (default=32): Batch size for encoding.
show_progressbool, optional (default=True): Show progress bar.

Returns¶

embeddings : numpy array of shape (n_samples, n_features)

pyod.utils.data module¶

Utility functions for manipulating data

pyod.utils.data.check_consistent_shape(X_train, y_train, X_test, y_test, y_train_pred, y_test_pred)[source]¶

Internal shape to check input data shapes are consistent.

Parameters¶

X_trainnumpy array of shape (n_samples, n_features): The training samples.
y_trainlist or array of shape (n_samples,): The ground truth of training samples.
X_testnumpy array of shape (n_samples, n_features): The test samples.
y_testlist or array of shape (n_samples,): The ground truth of test samples.
y_train_prednumpy array of shape (n_samples, n_features): The predicted binary labels of the training samples.
y_test_prednumpy array of shape (n_samples, n_features): The predicted binary labels of the test samples.

Returns¶

X_trainnumpy array of shape (n_samples, n_features): The training samples.
y_trainlist or array of shape (n_samples,): The ground truth of training samples.
X_testnumpy array of shape (n_samples, n_features): The test samples.
y_testlist or array of shape (n_samples,): The ground truth of test samples.
y_train_prednumpy array of shape (n_samples, n_features): The predicted binary labels of the training samples.
y_test_prednumpy array of shape (n_samples, n_features): The predicted binary labels of the test samples.

pyod.utils.data.evaluate_print(clf_name, y, y_pred)[source]¶

Utility function for evaluating and printing the results for examples. Default metrics include ROC and Precision @ n

Parameters¶

clf_namestr: The name of the detector.
ylist or numpy array of shape (n_samples,): The ground truth. Binary (0: inliers, 1: outliers).
y_predlist or numpy array of shape (n_samples,): The raw outlier scores as returned by a fitted model.

pyod.utils.data.generate_data(n_train=1000, n_test=500, n_features=2, contamination=0.1, train_only=False, offset=10, behaviour='new', random_state=None, n_nan=0, n_inf=0)[source]¶

Utility function to generate synthesized data. Normal data is generated by a multivariate Gaussian distribution and outliers are generated by a uniform distribution. “X_train, X_test, y_train, y_test” are returned.

Parameters¶

n_trainint, (default=1000): The number of training points to generate.
n_testint, (default=500): The number of test points to generate.
n_featuresint, optional (default=2): The number of features (dimensions).
contaminationfloat in (0., 0.5), optional (default=0.1): The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
train_onlybool, optional (default=False): If true, generate train data only.
offsetint, optional (default=10): Adjust the value range of Gaussian and Uniform.
behaviourstr, default=’new’: Behaviour of the returned datasets which can be either ‘old’ or ‘new’. Passing behaviour='new' returns “X_train, X_test, y_train, y_test”, while passing behaviour='old' returns “X_train, y_train, X_test, y_test”.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
n_nanint: The number of values that are missing (np.nan). Defaults to zero.
n_infint: The number of values that are infinite. (np.inf). Defaults to zero.

Returns¶

X_trainnumpy array of shape (n_train, n_features): Training data.
X_testnumpy array of shape (n_test, n_features): Test data.
y_trainnumpy array of shape (n_train,): Training ground truth.
y_testnumpy array of shape (n_test,): Test ground truth.

pyod.utils.data.generate_data_categorical(n_train=1000, n_test=500, n_features=2, n_informative=2, n_category_in=2, n_category_out=2, contamination=0.1, shuffle=True, random_state=None)[source]¶

Utility function to generate synthesized categorical data.

Parameters¶

n_trainint, (default=1000): The number of training points to generate.
n_testint, (default=500): The number of test points to generate.
n_featuresint, optional (default=2): The number of features for each sample.
n_informativeint in (1, n_features), optional (default=2): The number of informative features in the outlier points. The higher the easier the outlier detection should be. Note that n_informative should not be less than or equal n_features.
n_category_inint in (1, n_inliers), optional (default=2): The number of categories in the inlier points.
n_category_outint in (1, n_outliers), optional (default=2): The number of categories in the outlier points.
contaminationfloat in (0., 0.5), optional (default=0.1): The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
shuffle: bool, optional(default=True): If True, inliers will be shuffled which makes more noisy distribution.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns¶

X_trainnumpy array of shape (n_train, n_features): Training data.
y_trainnumpy array of shape (n_train,): Training ground truth.
X_testnumpy array of shape (n_test, n_features): Test data.
y_testnumpy array of shape (n_test,): Test ground truth.

pyod.utils.data.generate_data_clusters(n_train=1000, n_test=500, n_clusters=2, n_features=2, contamination=0.1, size='same', density='same', dist=0.25, random_state=None, return_in_clusters=False)[source]¶

Utility function to generate synthesized data in clusters.: Generated data can involve the low density pattern problem and global outliers which are considered as difficult tasks for outliers detection algorithms.

Parameters¶

n_trainint, (default=1000): The number of training points to generate.
n_testint, (default=500): The number of test points to generate.
n_clustersint, optional (default=2): The number of centers (i.e. clusters) to generate.
n_featuresint, optional (default=2): The number of features for each sample.
contaminationfloat in (0., 0.5), optional (default=0.1): The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
sizestr, optional (default=’same’): Size of each cluster: ‘same’ generates clusters with same size, ‘different’ generate clusters with different sizes.
densitystr, optional (default=’same’): Density of each cluster: ‘same’ generates clusters with same density, ‘different’ generate clusters with different densities.
dist: float, optional (default=0.25): Distance between clusters. Should be between 0. and 1.0 It is used to avoid clusters overlapping as much as possible. However, if number of samples and number of clusters are too high, it is unlikely to separate them fully even if dist set to 1.0
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
return_in_clustersbool, optional (default=False): If True, the function returns x_train, y_train, x_test, y_test each as a list of numpy arrays where each index represents a cluster. If False, it returns x_train, y_train, x_test, y_test each as numpy array after joining the sequence of clusters arrays,

Returns¶

X_trainnumpy array of shape (n_train, n_features): Training data.
y_trainnumpy array of shape (n_train,): Training ground truth.
X_testnumpy array of shape (n_test, n_features): Test data.
y_testnumpy array of shape (n_test,): Test ground truth.

pyod.utils.data.generate_graph_data(n_nodes=300, n_features=16, n_edges_per_node=5, contamination=0.1, random_state=None)[source]¶

Generate synthetic attributed graph data with planted anomalies.

Normal nodes have features from N(0, 1). Anomaly nodes have features shifted by +5 standard deviations. Edges are generated via random neighbor selection (undirected, no self-loops, no duplicates).

Parameters¶

n_nodesint, default=300: Number of nodes.
n_featuresint, default=16: Dimensionality of node features.
n_edges_per_nodeint, default=5: Average number of edges per node (Poisson-sampled per node).
contaminationfloat, default=0.1: Fraction of nodes that are anomalies.
random_stateint, RandomState or None, default=None: Seed for reproducibility.

Returns¶

Xnp.ndarray of shape (n_nodes, n_features): Node feature matrix (float32).
edge_indexnp.ndarray of shape (2, n_edges): COO-format edge list (int64, undirected, no self-loops).
ynp.ndarray of shape (n_nodes,): Binary labels: 0 = normal, 1 = anomaly.

pyod.utils.data.generate_ts_data(n_train=500, n_test=200, n_channels=1, contamination=0.05, period=50, noise_std=0.3, anomaly_type='point', random_state=None)[source]¶

Generate synthetic time series data with injected anomalies.

Creates a sinusoidal base signal with Gaussian noise and injects anomalies at random locations. Follows conventions from the TS-AD literature (e.g., TSB-AD benchmark).

Parameters¶

n_trainint, optional (default=500): Length of training time series.
n_testint, optional (default=200): Length of test time series.
n_channelsint, optional (default=1): Number of channels (univariate=1, multivariate>1).
contaminationfloat, optional (default=0.05): Fraction of timestamps that are anomalous (approximately). For subsequence anomalies, the total labeled timestamps are controlled to stay near this fraction.
periodint, optional (default=50): Period of the sinusoidal base signal.
noise_stdfloat, optional (default=0.3): Standard deviation of Gaussian noise.
anomaly_typestr, optional (default=’point’): Type of anomaly: ‘point’ (spikes), ‘subsequence’ (shape change), or ‘both’.
random_stateint, RandomState instance, or None (default=None): Random seed for reproducibility.

Returns¶

X_trainnp.ndarray of shape (n_train,) or (n_train, n_channels): Training time series. Univariate returned as 1D.
X_testnp.ndarray of shape (n_test,) or (n_test, n_channels): Test time series.
y_trainnp.ndarray of shape (n_train,): Binary labels (1=anomaly, 0=normal) for training.
y_testnp.ndarray of shape (n_test,): Binary labels for test.

pyod.utils.data.get_outliers_inliers(X, y)[source]¶

Internal method to separate inliers from outliers.

Parameters¶

Xnumpy array of shape (n_samples, n_features): The input samples
ylist or array of shape (n_samples,): The ground truth of input samples.

Returns¶

X_outliersnumpy array of shape (n_samples, n_features): Outliers.
X_inliersnumpy array of shape (n_samples, n_features): Inliers.

pyod.utils.example module¶

Utility functions for running examples

pyod.utils.example.data_visualize(X_train, y_train, show_figure=True, save_figure=False)[source]¶

Utility function for visualizing the synthetic samples generated by generate_data_cluster function.

Parameters¶

X_trainnumpy array of shape (n_samples, n_features): The training samples.
y_trainlist or array of shape (n_samples,): The ground truth of training samples.
show_figurebool, optional (default=True): If set to True, show the figure.
save_figurebool, optional (default=False): If set to True, save the figure to the local.

pyod.utils.example.visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)[source]¶

Utility function for visualizing the results in examples. Internal use only.

Parameters¶

clf_namestr: The name of the detector.
X_trainnumpy array of shape (n_samples, n_features): The training samples.
y_trainlist or array of shape (n_samples,): The ground truth of training samples.
X_testnumpy array of shape (n_samples, n_features): The test samples.
y_testlist or array of shape (n_samples,): The ground truth of test samples.
y_train_prednumpy array of shape (n_samples, n_features): The predicted binary labels of the training samples.
y_test_prednumpy array of shape (n_samples, n_features): The predicted binary labels of the test samples.
show_figurebool, optional (default=True): If set to True, show the figure.
save_figurebool, optional (default=False): If set to True, save the figure to the local.

pyod.utils.stat_models module¶

A collection of statistical models

pyod.utils.stat_models.column_ecdf(matrix: ndarray) → ndarray[source]¶

Utility function to compute the column wise empirical cumulative distribution of a 2D feature matrix, where the rows are samples and the columns are features per sample. The accumulation is done in the positive direction of the sample axis.

E.G. p(1) = 0.2, p(0) = 0.3, p(2) = 0.1, p(6) = 0.4 ECDF E(5) = p(x <= 5) ECDF E would be E(-1) = 0, E(0) = 0.3, E(1) = 0.5, E(2) = 0.6, E(3) = 0.6, E(4) = 0.6, E(5) = 0.6, E(6) = 1

Returns¶

pyod.utils.stat_models.ecdf_terminate_equals_inplace(matrix: ndarray, probabilities: ndarray)[source]¶

This is a helper function for computing the ecdf of an array. It has been outsourced from the original function in order to be able to use the njit compiler of numpy for increased speeds, as it unfortunately needs a loop over all rows and columns of a matrix. It acts in place on the probabilities’ matrix.

Parameters¶

matrix : a feature matrix where the rows are samples and each column is a feature !(expected to be sorted)!

probabilitiesa probability matrix that will be used building the ecdf. It has values between 0 and 1 and: is also sorted.

Returns¶

pyod.utils.stat_models.pairwise_distances_no_broadcast(X, Y)[source]¶

Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast.

For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4).

Parameters¶

Xarray of shape (n_samples, n_features): First input samples
Yarray of shape (n_samples, n_features): Second input samples

Returns¶

distancearray of shape (n_samples,): Row-wise euclidean distance of X and Y

pyod.utils.stat_models.pearsonr_mat(mat, w=None)[source]¶

Utility function to calculate pearson matrix (row-wise).

Parameters¶

matnumpy array of shape (n_samples, n_features): Input matrix.
wnumpy array of shape (n_features,): Weights.

Returns¶

pear_matnumpy array of shape (n_samples, n_samples): Row-wise pearson score matrix.

pyod.utils.stat_models.wpearsonr(x, y, w=None)[source]¶

Utility function to calculate the weighted Pearson correlation of two samples.

See https://stats.stackexchange.com/questions/221246/such-thing-as-a-weighted-correlation for more information

Parameters¶

xarray, shape (n,): Input x.
yarray, shape (n,): Input y.
warray, shape (n,): Weights w.

Returns¶

scoresfloat in range of [-1,1]: Weighted Pearson Correlation between x and y.

pyod.utils.utility module¶

A set of utility functions to support outlier detection.

pyod.utils.utility.argmaxn(value_list, n, order='desc')[source]¶

Return the index of top n elements in the list if order is set to ‘desc’, otherwise return the index of n smallest ones.

Parameters¶

value_listlist, array, numpy array of shape (n_samples,)

A list containing all values.

nint

The number of elements to select.

orderstr, optional (default=’desc’)

The order to sort {‘desc’, ‘asc’}:

‘desc’: descending
‘asc’: ascending

Returns¶

index_listnumpy array of shape (n,): The index of the top n elements.

pyod.utils.utility.check_detector(detector)[source]¶

Checks if fit and decision_function methods exist for given detector

Parameters¶

detectorpyod.models: Detector instance for which the check is performed.

pyod.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)[source]¶

Check if an input is within the defined range.

Parameters¶

paramint, float: The input parameter to check.
lowint, float: The lower bound of the range.
highint, float: The higher bound of the range.
param_namestr, optional (default=’’): The name of the parameter.
include_leftbool, optional (default=False): Whether includes the lower bound (lower bound <=).
include_rightbool, optional (default=False): Whether includes the higher bound (<= higher bound).

Returns¶

within_rangebool or raise errors: Whether the parameter is within the range of (low, high)

pyod.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)[source]¶

Randomly draw feature indices. Internal use only.

Modified from sklearn/ensemble/bagging.py

Parameters¶

random_stateRandomState: A random number generator instance to define the state of the random permutations generator.
bootstrap_featuresbool: Specifies whether to bootstrap indice generation
n_featuresint: Specifies the population size when generating indices
min_featuresint: Lower limit for number of features to randomly sample
max_featuresint: Upper limit for number of features to randomly sample

Returns¶

feature_indicesnumpy array, shape (n_samples,): Indices for features to bag

pyod.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)[source]¶

Draw randomly sampled indices. Internal use only.

See sklearn/ensemble/bagging.py

Parameters¶

random_stateRandomState: A random number generator instance to define the state of the random permutations generator.
bootstrapbool: Specifies whether to bootstrap indice generation
n_populationint: Specifies the population size when generating indices
n_samplesint: Specifies number of samples to draw

Returns¶

indicesnumpy array, shape (n_samples,): randomly drawn indices

pyod.utils.utility.get_diff_elements(li1, li2)[source]¶

get the elements in li1 but not li2, and vice versa

Parameters¶

li1list or numpy array: Input list 1.
li2list or numpy array: Input list 2.

Returns¶

differencelist: The difference between li1 and li2.

pyod.utils.utility.get_intersection(lst1, lst2)[source]¶

get the overlapping between two lists

Parameters¶

li1list or numpy array: Input list 1.
li2list or numpy array: Input list 2.

Returns¶

differencelist: The overlapping between li1 and li2.

pyod.utils.utility.get_label_n(y, y_pred, n=None)[source]¶

Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores.

Parameters¶

ylist or numpy array of shape (n_samples,): The ground truth. Binary (0: inliers, 1: outliers).
y_predlist or numpy array of shape (n_samples,): The raw outlier scores as returned by a fitted model.
nint, optional (default=None): The number of outliers. if not defined, infer using ground truth.

Returns¶

labelsnumpy array of shape (n_samples,): binary labels 0: normal points and 1: outliers

Examples¶

>>> from pyod.utils.utility import get_label_n
>>> y = [0, 1, 1, 0, 0]
>>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7]
>>> get_label_n(y, y_pred)
array([0, 1, 0, 0, 1])

pyod.utils.utility.get_list_diff(li1, li2)[source]¶

get the elements in li1 but not li2. li1-li2

Parameters¶

li1list or numpy array: Input list 1.
li2list or numpy array: Input list 2.

Returns¶

differencelist: The difference between li1 and li2.

pyod.utils.utility.get_optimal_n_bins(X, upper_bound=None, epsilon=1)[source]¶

Determine optimal number of bins for a histogram using the Birge Rozenblac method (see [MBirgeR06] for details.)

See https://doi.org/10.1051/ps:2006001

Parameters¶

Xarray-like of shape (n_samples, n_features): The samples to determine the optimal number of bins for.
upper_boundint, default=None: The maximum value of n_bins to be considered. If set to None, np.sqrt(X.shape[0]) will be used as upper bound.
epsilonfloat, default = 1: A stabilizing term added to the logarithm to prevent division by zero.

Returns¶

optimal_n_binsint: The optimal value of n_bins according to the Birge Rozenblac method

pyod.utils.utility.invert_order(scores, method='multiplication')[source]¶

Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different.

Parameters¶

scoreslist, array or numpy array with shape (n_samples,)

The list of values to be inverted

methodstr, optional (default=’multiplication’)

Methods used for order inversion. Valid methods are:

‘multiplication’: multiply by -1
‘subtraction’: max(scores) - scores

Returns¶

inverted_scoresnumpy array of shape (n_samples,): The inverted list

Examples¶

>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1]
>>> invert_order(scores1)
array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1])
>>> invert_order(scores1, method='subtraction')
array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])

pyod.utils.utility.precision_n_scores(y, y_pred, n=None)[source]¶

Utility function to calculate precision @ rank n.

Parameters¶

ylist or numpy array of shape (n_samples,): The ground truth. Binary (0: inliers, 1: outliers).
y_predlist or numpy array of shape (n_samples,): The raw outlier scores as returned by a fitted model.
nint, optional (default=None): The number of outliers. if not defined, infer using ground truth.

Returns¶

precision_at_rank_nfloat: Precision at rank n score.

pyod.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)[source]¶

Turn raw outlier outlier scores to binary labels (0 or 1).

Parameters¶

pred_scoreslist or numpy array of shape (n_samples,): Raw outlier scores. Outliers are assumed have larger values.
outliers_fractionfloat in (0,1): Percentage of outliers.

Returns¶

outlier_labelsnumpy array of shape (n_samples,): For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

pyod.utils.utility.standardizer(X, X_t=None, keep_scalar=False)[source]¶

Conduct Z-normalization on data to turn input samples become zero-mean and unit variance.

Parameters¶

Xnumpy array of shape (n_samples, n_features): The training samples
X_tnumpy array of shape (n_samples_new, n_features), optional (default=None): The data to be converted
keep_scalarbool, optional (default=False): The flag to indicate whether to return the scalar

Returns¶

X_normnumpy array of shape (n_samples, n_features): X after the Z-score normalization
X_t_normnumpy array of shape (n_samples, n_features): X_t after the Z-score normalization
scalarsklearn scalar object: The scalar used in conversion

pyod.utils.persistence module¶

See Model Save and Load for the user-facing guide on saving and loading PyOD detectors, including cross-sklearn-version compatibility and strict mode.

Cross-sklearn-version model persistence for PyOD.

This module is the recommended way to save and load PyOD detectors. It wraps joblib with two capabilities the raw joblib.dump / joblib.load path does not provide:

A versioned envelope written by save(). The envelope records the PyOD, sklearn, numpy, scipy, joblib, and Python versions in effect at save time. load() compares the envelope against the running environment and emits a UserWarning when any binary-format dependency drifts; load(…, strict=True) raises instead. This lets users detect dependency drift before it surprises them in production.
A compat_load() helper that loads legacy artifacts whose sklearn Tree node dtype no longer matches the running sklearn (a recurring user pain documented in issue #519). compat_load uses joblib’s own unpickler with the BUILD-opcode dispatch entry patched so that sklearn Tree state is realigned to the running dtype before sklearn.tree._tree.Tree.__setstate__ sees it.

load() automatically falls through to compat_load() when the underlying joblib.load raises the specific sklearn dtype ValueError, so users who only call load() get the rescue path transparently.

WARNING: pickle and joblib load arbitrary Python code. Load only from trusted sources. The compat_load helper does not change this security model.

See docs/model_persistence.rst for the user-facing guide.

pyod.utils.persistence.compat_load(path: Any, mmap_mode: str | None = None) → Any[source]¶

Load an artifact whose sklearn Tree node dtype no longer matches.

Mirrors joblib.load but plugs a dispatch-table override into joblib’s unpickler so that sklearn Tree state is realigned to the running sklearn dtype before Tree.__setstate__ raises.

Realignment is name-based and bounded by _TREE_NODE_FIELD_DEFAULTS plus _TREE_NODE_FIELD_RENAMES. Unknown added/removed fields, dtype kind/signedness/itemsize changes, and shape changes raise ValueError. Same-name byte-order-only differences realign safely.

Emits a UserWarning recommending re-fit ONLY when at least one Tree was actually realigned. A no-op pass-through on a non-tree artifact is silent.

Parameters¶

pathstr, pathlib.Path, or file-like: The artifact to load.
mmap_modestr or None, default None: Forwarded to joblib’s underlying load path. Supported values mirror joblib’s: None, ‘r’, ‘r+’, ‘w+’, ‘c’.

Returns¶

objAny: The raw top-level object from the file (a fitted detector for legacy raw saves; an envelope dict for Phase 2 saves). Callers that need envelope unwrapping should use load().

pyod.utils.persistence.load(path: Any, strict: bool = False, return_metadata: bool = False) → Any[source]¶

Load a PyOD detector saved by save() or by raw joblib.dump.

load() understands three input shapes:

An envelope dict written by save(). The envelope’s recorded dependency versions are compared against the running environment. Drift in sklearn, joblib, numpy, or scipy emits a UserWarning; strict=True raises ValueError instead.
A raw detector object written by joblib.dump(clf, path) on a previous PyOD release. Returned as-is when strict=False; raises under strict=True because legacy artifacts have no envelope to verify.
A file that fails the initial joblib.load with the sklearn Tree node dtype error. load() falls through to compat_load(path) and routes the recovered object through the same envelope/legacy handler. See module docstring.

Parameters¶

pathstr or pathlib.Path: Path to the artifact.
strictbool, default False: When True, version drift in any warn-severity dependency raises ValueError. info-severity drift (Python version) never raises. Legacy artifacts without an envelope also raise under strict mode.
return_metadatabool, default False: When True, return (model, envelope_without_model_field) instead of just the model. For legacy artifacts the second element is None.

Returns¶

modelAny: The unpickled model. When return_metadata=True, returns (model, envelope_dict_or_None).

Raises¶

ValueError: On schema-version mismatch, strict-mode drift, strict-mode legacy artifacts, or after a successful compat repair under strict mode.

pyod.utils.persistence.save(model: Any, path: Any, metadata: dict | None = None) → None[source]¶

Save a fitted PyOD detector with a versioned envelope.

The envelope records every dependency version that can affect pickle/joblib layout, plus a save timestamp and a user-supplied metadata dict. The actual model object is written via joblib.dump; the only difference from raw joblib.dump(clf, path) is that the model is wrapped in a header dict the matching load() recognizes.

Parameters¶

modelAny: The fitted detector to save. Anything picklable will work; PyOD BaseDetector subclasses are the typical case.
pathstr or pathlib.Path: Destination file path.
metadatadict or None: Optional user-supplied metadata (training dataset id, feature schema hash, run id, anything). No schema is imposed; the dict round-trips as-is.

Returns¶

None

Notes¶

Loading the file with raw joblib.load returns the envelope dict, not the model. Use load() from this module to unwrap.