Utility Functions¶
pyod.utils.encoders module¶
Encoder abstraction for EmbeddingOD.
Provides BaseEncoder and concrete implementations for converting raw data (text, images) to numeric embeddings.
- class pyod.utils.encoders.BaseEncoder[source]¶
Bases:
ABCAbstract base class for embedding encoders.
All encoders must implement the
encodemethod, which converts raw input data to a 2D numpy array of shape (n_samples, n_features).- abstractmethod encode(X, batch_size=32, show_progress=True)[source]¶
Convert raw input to numeric embeddings.
Parameters¶
- Xlist or array-like
Raw input data.
- batch_sizeint, optional (default=32)
Batch size for encoding.
- show_progressbool, optional (default=True)
Whether to show a progress bar.
Returns¶
embeddings : numpy array of shape (n_samples, n_features)
- class pyod.utils.encoders.CallableEncoder(fn)[source]¶
Bases:
BaseEncoderEncoder that wraps a user-provided callable.
Parameters¶
- fncallable
A function that accepts raw input and returns a numpy array of shape (n_samples, n_features).
Examples¶
>>> import numpy as np >>> encoder = CallableEncoder(fn=lambda X: np.random.randn(len(X), 10)) >>> embeddings = encoder.encode(["hello", "world"]) >>> embeddings.shape (2, 10)
- encode(X, batch_size=32, show_progress=True)[source]¶
Convert raw input to numeric embeddings.
Parameters¶
- Xlist or array-like
Raw input data.
- batch_sizeint, optional (default=32)
Batch size for encoding.
- show_progressbool, optional (default=True)
Whether to show a progress bar.
Returns¶
embeddings : numpy array of shape (n_samples, n_features)
- class pyod.utils.encoders.MultiModalEncoder(encoders, weights=None)[source]¶
Bases:
BaseEncoderEncode multiple modalities and concatenate into a single embedding.
Each modality is encoded by its own encoder. The resulting embeddings are concatenated column-wise into a single feature matrix suitable for any PyOD detector.
Parameters¶
- encodersdict of {str: encoder}
Maps modality name to encoder. Each value can be: - A string (resolved via resolve_encoder at encode time) - A BaseEncoder instance -
'passthrough'for pre-computed numeric features- weightsdict of {str: float} or None, optional (default=None)
Per-modality scaling applied after encoding. Useful when embedding dimensions differ significantly across modalities.
Examples¶
>>> from pyod.utils.encoders import MultiModalEncoder >>> encoder = MultiModalEncoder({ ... 'text': 'all-MiniLM-L6-v2', ... 'tabular': 'passthrough', ... }) >>> data = {'text': ["hello", "world"], 'tabular': np.array([[1, 2], [3, 4]])} >>> embeddings = encoder.encode(data) >>> embeddings.shape[0] 2
- encode(X, batch_size=32, show_progress=True)[source]¶
Encode multi-modal input and concatenate.
Parameters¶
- Xdict of {str: data}
Maps modality name to input data. Keys must match the
encodersdict. Individual samples may beNoneto indicate a missing modality for that sample; missing embeddings are imputed with the training mean (iffit_encodewas called) or zeros.- batch_sizeint, optional (default=32)
Batch size for encoding.
- show_progressbool, optional (default=True)
Show progress bar.
Returns¶
embeddings : numpy array of shape (n_samples, total_features)
- fit_encode(X, batch_size=32, show_progress=True)[source]¶
Encode training data and store per-modality mean embeddings.
Call this during training (EmbeddingOD.fit) so that mean embeddings are available for imputing missing samples at test time. Subsequent calls to
encodewill use these stored means.Parameters¶
- Xdict of {str: data}
Training data. Should not contain
Nonesamples.
Returns¶
embeddings : numpy array of shape (n_samples, total_features)
- pyod.utils.encoders.resolve_encoder(encoder)[source]¶
Resolve an encoder from various input types.
Parameters¶
- encoderstr, BaseEncoder, or callable
If BaseEncoder instance, returned as-is.
If callable, wrapped in CallableEncoder.
If string, looked up in the encoder registry. If not found, tries sentence-transformers first, then HuggingFace AutoModel. The auto-resolve fallback is designed for text embedding models. For image models (DINOv2, CLIP, etc.), use registry shortcuts (e.g., ‘dinov2-small’, ‘clip-vit-base’) instead of raw HuggingFace model IDs.
Returns¶
encoder : BaseEncoder
pyod.utils.encoders.sentence_transformer module¶
SentenceTransformerEncoder for EmbeddingOD.
- class pyod.utils.encoders.sentence_transformer.SentenceTransformerEncoder(model_name='all-MiniLM-L6-v2', device=None, normalize=False, truncate_dim=None)[source]¶
Bases:
BaseEncoderEncoder using sentence-transformers library.
Wraps
sentence_transformers.SentenceTransformerto produce text embeddings compatible with PyOD detectors.Parameters¶
- model_namestr, optional (default=’all-MiniLM-L6-v2’)
Name or path of a sentence-transformers model.
- devicestr or None, optional (default=None)
Device for inference (‘cpu’, ‘cuda’, etc.). None for auto-detection.
- normalizebool, optional (default=False)
L2-normalize output embeddings.
- truncate_dimint or None, optional (default=None)
Truncate embeddings to this dimensionality (Matryoshka).
Examples¶
>>> from pyod.utils.encoders.sentence_transformer import \ ... SentenceTransformerEncoder >>> encoder = SentenceTransformerEncoder('all-MiniLM-L6-v2') >>> embeddings = encoder.encode(["hello world", "anomaly text"]) >>> embeddings.shape (2, 384)
- encode(X, batch_size=32, show_progress=True)[source]¶
Encode text strings to embeddings.
Parameters¶
- Xlist of str
Text strings to encode.
- batch_sizeint, optional (default=32)
Batch size for encoding.
- show_progressbool, optional (default=True)
Show progress bar.
Returns¶
embeddings : numpy array of shape (n_samples, n_features)
pyod.utils.encoders.openai_encoder module¶
OpenAIEncoder for EmbeddingOD.
- class pyod.utils.encoders.openai_encoder.OpenAIEncoder(model_name='text-embedding-3-small', dimensions=None, api_key=None)[source]¶
Bases:
BaseEncoderEncoder using OpenAI Embeddings API.
Produces text embeddings via the OpenAI API. Handles batching (max 2048 items per request) internally.
Parameters¶
- model_namestr, optional (default=’text-embedding-3-small’)
OpenAI embedding model name.
- dimensionsint or None, optional (default=None)
Truncate embeddings to this dimensionality (Matryoshka). Only supported by text-embedding-3-* models.
- api_keystr or None, optional (default=None)
OpenAI API key. Falls back to OPENAI_API_KEY environment variable.
Examples¶
>>> from pyod.utils.encoders.openai_encoder import OpenAIEncoder >>> encoder = OpenAIEncoder('text-embedding-3-small') >>> embeddings = encoder.encode(["normal text", "anomalous text"])
- encode(X, batch_size=2048, show_progress=True)[source]¶
Encode text strings to embeddings via OpenAI API.
Parameters¶
- Xlist of str
Text strings to encode.
- batch_sizeint, optional (default=2048)
Batch size. Capped at 2048 (OpenAI API limit).
- show_progressbool, optional (default=True)
Show progress bar (not used for API calls).
Returns¶
embeddings : numpy array of shape (n_samples, n_features)
pyod.utils.encoders.huggingface module¶
HuggingFaceEncoder for EmbeddingOD.
- class pyod.utils.encoders.huggingface.HuggingFaceEncoder(model_name, device=None, pooling='cls', modality='text')[source]¶
Bases:
BaseEncoderEncoder using HuggingFace transformers.
Supports both text (AutoTokenizer + AutoModel) and image (AutoImageProcessor + AutoModel) modalities.
Parameters¶
- model_namestr
HuggingFace model name or path.
- devicestr or None, optional (default=None)
Device for inference. None for auto-detection.
- poolingstr, optional (default=’cls’)
Pooling strategy: ‘cls’ for CLS token, ‘mean’ for mean of all token embeddings.
- modalitystr, optional (default=’text’)
Input modality: ‘text’ or ‘image’.
Examples¶
>>> from pyod.utils.encoders.huggingface import HuggingFaceEncoder >>> encoder = HuggingFaceEncoder('bert-base-uncased', modality='text') >>> embeddings = encoder.encode(["hello", "world"])
- encode(X, batch_size=32, show_progress=True)[source]¶
Encode text or images to embeddings.
Parameters¶
- Xlist of str (text) or list of PIL.Image (image)
Input data.
- batch_sizeint, optional (default=32)
Batch size for encoding.
- show_progressbool, optional (default=True)
Show progress bar.
Returns¶
embeddings : numpy array of shape (n_samples, n_features)
pyod.utils.data module¶
Utility functions for manipulating data
- pyod.utils.data.check_consistent_shape(X_train, y_train, X_test, y_test, y_train_pred, y_test_pred)[source]¶
Internal shape to check input data shapes are consistent.
Parameters¶
- X_trainnumpy array of shape (n_samples, n_features)
The training samples.
- y_trainlist or array of shape (n_samples,)
The ground truth of training samples.
- X_testnumpy array of shape (n_samples, n_features)
The test samples.
- y_testlist or array of shape (n_samples,)
The ground truth of test samples.
- y_train_prednumpy array of shape (n_samples, n_features)
The predicted binary labels of the training samples.
- y_test_prednumpy array of shape (n_samples, n_features)
The predicted binary labels of the test samples.
Returns¶
- X_trainnumpy array of shape (n_samples, n_features)
The training samples.
- y_trainlist or array of shape (n_samples,)
The ground truth of training samples.
- X_testnumpy array of shape (n_samples, n_features)
The test samples.
- y_testlist or array of shape (n_samples,)
The ground truth of test samples.
- y_train_prednumpy array of shape (n_samples, n_features)
The predicted binary labels of the training samples.
- y_test_prednumpy array of shape (n_samples, n_features)
The predicted binary labels of the test samples.
- pyod.utils.data.evaluate_print(clf_name, y, y_pred)[source]¶
Utility function for evaluating and printing the results for examples. Default metrics include ROC and Precision @ n
Parameters¶
- clf_namestr
The name of the detector.
- ylist or numpy array of shape (n_samples,)
The ground truth. Binary (0: inliers, 1: outliers).
- y_predlist or numpy array of shape (n_samples,)
The raw outlier scores as returned by a fitted model.
- pyod.utils.data.generate_data(n_train=1000, n_test=500, n_features=2, contamination=0.1, train_only=False, offset=10, behaviour='new', random_state=None, n_nan=0, n_inf=0)[source]¶
Utility function to generate synthesized data. Normal data is generated by a multivariate Gaussian distribution and outliers are generated by a uniform distribution. “X_train, X_test, y_train, y_test” are returned.
Parameters¶
- n_trainint, (default=1000)
The number of training points to generate.
- n_testint, (default=500)
The number of test points to generate.
- n_featuresint, optional (default=2)
The number of features (dimensions).
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- train_onlybool, optional (default=False)
If true, generate train data only.
- offsetint, optional (default=10)
Adjust the value range of Gaussian and Uniform.
- behaviourstr, default=’new’
Behaviour of the returned datasets which can be either ‘old’ or ‘new’. Passing
behaviour='new'returns “X_train, X_test, y_train, y_test”, while passingbehaviour='old'returns “X_train, y_train, X_test, y_test”.- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- n_nanint
The number of values that are missing (np.nan). Defaults to zero.
- n_infint
The number of values that are infinite. (np.inf). Defaults to zero.
Returns¶
- X_trainnumpy array of shape (n_train, n_features)
Training data.
- X_testnumpy array of shape (n_test, n_features)
Test data.
- y_trainnumpy array of shape (n_train,)
Training ground truth.
- y_testnumpy array of shape (n_test,)
Test ground truth.
- pyod.utils.data.generate_data_categorical(n_train=1000, n_test=500, n_features=2, n_informative=2, n_category_in=2, n_category_out=2, contamination=0.1, shuffle=True, random_state=None)[source]¶
Utility function to generate synthesized categorical data.
Parameters¶
- n_trainint, (default=1000)
The number of training points to generate.
- n_testint, (default=500)
The number of test points to generate.
- n_featuresint, optional (default=2)
The number of features for each sample.
- n_informativeint in (1, n_features), optional (default=2)
The number of informative features in the outlier points. The higher the easier the outlier detection should be. Note that n_informative should not be less than or equal n_features.
- n_category_inint in (1, n_inliers), optional (default=2)
The number of categories in the inlier points.
- n_category_outint in (1, n_outliers), optional (default=2)
The number of categories in the outlier points.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
- shuffle: bool, optional(default=True)
If True, inliers will be shuffled which makes more noisy distribution.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Returns¶
- X_trainnumpy array of shape (n_train, n_features)
Training data.
- y_trainnumpy array of shape (n_train,)
Training ground truth.
- X_testnumpy array of shape (n_test, n_features)
Test data.
- y_testnumpy array of shape (n_test,)
Test ground truth.
- pyod.utils.data.generate_data_clusters(n_train=1000, n_test=500, n_clusters=2, n_features=2, contamination=0.1, size='same', density='same', dist=0.25, random_state=None, return_in_clusters=False)[source]¶
- Utility function to generate synthesized data in clusters.
Generated data can involve the low density pattern problem and global outliers which are considered as difficult tasks for outliers detection algorithms.
Parameters¶
- n_trainint, (default=1000)
The number of training points to generate.
- n_testint, (default=500)
The number of test points to generate.
- n_clustersint, optional (default=2)
The number of centers (i.e. clusters) to generate.
- n_featuresint, optional (default=2)
The number of features for each sample.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
- sizestr, optional (default=’same’)
Size of each cluster: ‘same’ generates clusters with same size, ‘different’ generate clusters with different sizes.
- densitystr, optional (default=’same’)
Density of each cluster: ‘same’ generates clusters with same density, ‘different’ generate clusters with different densities.
- dist: float, optional (default=0.25)
Distance between clusters. Should be between 0. and 1.0 It is used to avoid clusters overlapping as much as possible. However, if number of samples and number of clusters are too high, it is unlikely to separate them fully even if
distset to 1.0- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- return_in_clustersbool, optional (default=False)
If True, the function returns x_train, y_train, x_test, y_test each as a list of numpy arrays where each index represents a cluster. If False, it returns x_train, y_train, x_test, y_test each as numpy array after joining the sequence of clusters arrays,
Returns¶
- X_trainnumpy array of shape (n_train, n_features)
Training data.
- y_trainnumpy array of shape (n_train,)
Training ground truth.
- X_testnumpy array of shape (n_test, n_features)
Test data.
- y_testnumpy array of shape (n_test,)
Test ground truth.
- pyod.utils.data.generate_graph_data(n_nodes=300, n_features=16, n_edges_per_node=5, contamination=0.1, random_state=None)[source]¶
Generate synthetic attributed graph data with planted anomalies.
Normal nodes have features from N(0, 1). Anomaly nodes have features shifted by +5 standard deviations. Edges are generated via random neighbor selection (undirected, no self-loops, no duplicates).
Parameters¶
- n_nodesint, default=300
Number of nodes.
- n_featuresint, default=16
Dimensionality of node features.
- n_edges_per_nodeint, default=5
Average number of edges per node (Poisson-sampled per node).
- contaminationfloat, default=0.1
Fraction of nodes that are anomalies.
- random_stateint, RandomState or None, default=None
Seed for reproducibility.
Returns¶
- Xnp.ndarray of shape (n_nodes, n_features)
Node feature matrix (float32).
- edge_indexnp.ndarray of shape (2, n_edges)
COO-format edge list (int64, undirected, no self-loops).
- ynp.ndarray of shape (n_nodes,)
Binary labels: 0 = normal, 1 = anomaly.
- pyod.utils.data.generate_ts_data(n_train=500, n_test=200, n_channels=1, contamination=0.05, period=50, noise_std=0.3, anomaly_type='point', random_state=None)[source]¶
Generate synthetic time series data with injected anomalies.
Creates a sinusoidal base signal with Gaussian noise and injects anomalies at random locations. Follows conventions from the TS-AD literature (e.g., TSB-AD benchmark).
Parameters¶
- n_trainint, optional (default=500)
Length of training time series.
- n_testint, optional (default=200)
Length of test time series.
- n_channelsint, optional (default=1)
Number of channels (univariate=1, multivariate>1).
- contaminationfloat, optional (default=0.05)
Fraction of timestamps that are anomalous (approximately). For subsequence anomalies, the total labeled timestamps are controlled to stay near this fraction.
- periodint, optional (default=50)
Period of the sinusoidal base signal.
- noise_stdfloat, optional (default=0.3)
Standard deviation of Gaussian noise.
- anomaly_typestr, optional (default=’point’)
Type of anomaly: ‘point’ (spikes), ‘subsequence’ (shape change), or ‘both’.
- random_stateint, RandomState instance, or None (default=None)
Random seed for reproducibility.
Returns¶
- X_trainnp.ndarray of shape (n_train,) or (n_train, n_channels)
Training time series. Univariate returned as 1D.
- X_testnp.ndarray of shape (n_test,) or (n_test, n_channels)
Test time series.
- y_trainnp.ndarray of shape (n_train,)
Binary labels (1=anomaly, 0=normal) for training.
- y_testnp.ndarray of shape (n_test,)
Binary labels for test.
- pyod.utils.data.get_outliers_inliers(X, y)[source]¶
Internal method to separate inliers from outliers.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples
- ylist or array of shape (n_samples,)
The ground truth of input samples.
Returns¶
- X_outliersnumpy array of shape (n_samples, n_features)
Outliers.
- X_inliersnumpy array of shape (n_samples, n_features)
Inliers.
pyod.utils.example module¶
Utility functions for running examples
- pyod.utils.example.data_visualize(X_train, y_train, show_figure=True, save_figure=False)[source]¶
Utility function for visualizing the synthetic samples generated by generate_data_cluster function.
Parameters¶
- X_trainnumpy array of shape (n_samples, n_features)
The training samples.
- y_trainlist or array of shape (n_samples,)
The ground truth of training samples.
- show_figurebool, optional (default=True)
If set to True, show the figure.
- save_figurebool, optional (default=False)
If set to True, save the figure to the local.
- pyod.utils.example.visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)[source]¶
Utility function for visualizing the results in examples. Internal use only.
Parameters¶
- clf_namestr
The name of the detector.
- X_trainnumpy array of shape (n_samples, n_features)
The training samples.
- y_trainlist or array of shape (n_samples,)
The ground truth of training samples.
- X_testnumpy array of shape (n_samples, n_features)
The test samples.
- y_testlist or array of shape (n_samples,)
The ground truth of test samples.
- y_train_prednumpy array of shape (n_samples, n_features)
The predicted binary labels of the training samples.
- y_test_prednumpy array of shape (n_samples, n_features)
The predicted binary labels of the test samples.
- show_figurebool, optional (default=True)
If set to True, show the figure.
- save_figurebool, optional (default=False)
If set to True, save the figure to the local.
pyod.utils.stat_models module¶
A collection of statistical models
- pyod.utils.stat_models.column_ecdf(matrix: ndarray) ndarray[source]¶
Utility function to compute the column wise empirical cumulative distribution of a 2D feature matrix, where the rows are samples and the columns are features per sample. The accumulation is done in the positive direction of the sample axis.
E.G. p(1) = 0.2, p(0) = 0.3, p(2) = 0.1, p(6) = 0.4 ECDF E(5) = p(x <= 5) ECDF E would be E(-1) = 0, E(0) = 0.3, E(1) = 0.5, E(2) = 0.6, E(3) = 0.6, E(4) = 0.6, E(5) = 0.6, E(6) = 1
Similar to and tested against: https://www.statsmodels.org/stable/generated/statsmodels.distributions.empirical_distribution.ECDF.html
Returns¶
- pyod.utils.stat_models.ecdf_terminate_equals_inplace(matrix: ndarray, probabilities: ndarray)[source]¶
This is a helper function for computing the ecdf of an array. It has been outsourced from the original function in order to be able to use the njit compiler of numpy for increased speeds, as it unfortunately needs a loop over all rows and columns of a matrix. It acts in place on the probabilities’ matrix.
Parameters¶
matrix : a feature matrix where the rows are samples and each column is a feature !(expected to be sorted)!
- probabilitiesa probability matrix that will be used building the ecdf. It has values between 0 and 1 and
is also sorted.
Returns¶
- pyod.utils.stat_models.pairwise_distances_no_broadcast(X, Y)[source]¶
Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast.
For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4).
Parameters¶
- Xarray of shape (n_samples, n_features)
First input samples
- Yarray of shape (n_samples, n_features)
Second input samples
Returns¶
- distancearray of shape (n_samples,)
Row-wise euclidean distance of X and Y
- pyod.utils.stat_models.pearsonr_mat(mat, w=None)[source]¶
Utility function to calculate pearson matrix (row-wise).
Parameters¶
- matnumpy array of shape (n_samples, n_features)
Input matrix.
- wnumpy array of shape (n_features,)
Weights.
Returns¶
- pear_matnumpy array of shape (n_samples, n_samples)
Row-wise pearson score matrix.
- pyod.utils.stat_models.wpearsonr(x, y, w=None)[source]¶
Utility function to calculate the weighted Pearson correlation of two samples.
See https://stats.stackexchange.com/questions/221246/such-thing-as-a-weighted-correlation for more information
Parameters¶
- xarray, shape (n,)
Input x.
- yarray, shape (n,)
Input y.
- warray, shape (n,)
Weights w.
Returns¶
- scoresfloat in range of [-1,1]
Weighted Pearson Correlation between x and y.
pyod.utils.utility module¶
A set of utility functions to support outlier detection.
- pyod.utils.utility.argmaxn(value_list, n, order='desc')[source]¶
Return the index of top n elements in the list if order is set to ‘desc’, otherwise return the index of n smallest ones.
Parameters¶
- value_listlist, array, numpy array of shape (n_samples,)
A list containing all values.
- nint
The number of elements to select.
- orderstr, optional (default=’desc’)
The order to sort {‘desc’, ‘asc’}:
‘desc’: descending
‘asc’: ascending
Returns¶
- index_listnumpy array of shape (n,)
The index of the top n elements.
- pyod.utils.utility.check_detector(detector)[source]¶
Checks if fit and decision_function methods exist for given detector
Parameters¶
- detectorpyod.models
Detector instance for which the check is performed.
- pyod.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)[source]¶
Check if an input is within the defined range.
Parameters¶
- paramint, float
The input parameter to check.
- lowint, float
The lower bound of the range.
- highint, float
The higher bound of the range.
- param_namestr, optional (default=’’)
The name of the parameter.
- include_leftbool, optional (default=False)
Whether includes the lower bound (lower bound <=).
- include_rightbool, optional (default=False)
Whether includes the higher bound (<= higher bound).
Returns¶
- within_rangebool or raise errors
Whether the parameter is within the range of (low, high)
- pyod.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)[source]¶
Randomly draw feature indices. Internal use only.
Modified from sklearn/ensemble/bagging.py
Parameters¶
- random_stateRandomState
A random number generator instance to define the state of the random permutations generator.
- bootstrap_featuresbool
Specifies whether to bootstrap indice generation
- n_featuresint
Specifies the population size when generating indices
- min_featuresint
Lower limit for number of features to randomly sample
- max_featuresint
Upper limit for number of features to randomly sample
Returns¶
- feature_indicesnumpy array, shape (n_samples,)
Indices for features to bag
- pyod.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)[source]¶
Draw randomly sampled indices. Internal use only.
See sklearn/ensemble/bagging.py
Parameters¶
- random_stateRandomState
A random number generator instance to define the state of the random permutations generator.
- bootstrapbool
Specifies whether to bootstrap indice generation
- n_populationint
Specifies the population size when generating indices
- n_samplesint
Specifies number of samples to draw
Returns¶
- indicesnumpy array, shape (n_samples,)
randomly drawn indices
- pyod.utils.utility.get_diff_elements(li1, li2)[source]¶
get the elements in li1 but not li2, and vice versa
Parameters¶
- li1list or numpy array
Input list 1.
- li2list or numpy array
Input list 2.
Returns¶
- differencelist
The difference between li1 and li2.
- pyod.utils.utility.get_intersection(lst1, lst2)[source]¶
get the overlapping between two lists
Parameters¶
- li1list or numpy array
Input list 1.
- li2list or numpy array
Input list 2.
Returns¶
- differencelist
The overlapping between li1 and li2.
- pyod.utils.utility.get_label_n(y, y_pred, n=None)[source]¶
Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores.
Parameters¶
- ylist or numpy array of shape (n_samples,)
The ground truth. Binary (0: inliers, 1: outliers).
- y_predlist or numpy array of shape (n_samples,)
The raw outlier scores as returned by a fitted model.
- nint, optional (default=None)
The number of outliers. if not defined, infer using ground truth.
Returns¶
- labelsnumpy array of shape (n_samples,)
binary labels 0: normal points and 1: outliers
Examples¶
>>> from pyod.utils.utility import get_label_n >>> y = [0, 1, 1, 0, 0] >>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7] >>> get_label_n(y, y_pred) array([0, 1, 0, 0, 1])
- pyod.utils.utility.get_list_diff(li1, li2)[source]¶
get the elements in li1 but not li2. li1-li2
Parameters¶
- li1list or numpy array
Input list 1.
- li2list or numpy array
Input list 2.
Returns¶
- differencelist
The difference between li1 and li2.
- pyod.utils.utility.get_optimal_n_bins(X, upper_bound=None, epsilon=1)[source]¶
Determine optimal number of bins for a histogram using the Birge Rozenblac method (see [MBirgeR06] for details.)
See https://doi.org/10.1051/ps:2006001
Parameters¶
- Xarray-like of shape (n_samples, n_features)
The samples to determine the optimal number of bins for.
- upper_boundint, default=None
The maximum value of n_bins to be considered. If set to None, np.sqrt(X.shape[0]) will be used as upper bound.
- epsilonfloat, default = 1
A stabilizing term added to the logarithm to prevent division by zero.
Returns¶
- optimal_n_binsint
The optimal value of n_bins according to the Birge Rozenblac method
- pyod.utils.utility.invert_order(scores, method='multiplication')[source]¶
Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different.
Parameters¶
- scoreslist, array or numpy array with shape (n_samples,)
The list of values to be inverted
- methodstr, optional (default=’multiplication’)
Methods used for order inversion. Valid methods are:
‘multiplication’: multiply by -1
‘subtraction’: max(scores) - scores
Returns¶
- inverted_scoresnumpy array of shape (n_samples,)
The inverted list
Examples¶
>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1] >>> invert_order(scores1) array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1]) >>> invert_order(scores1, method='subtraction') array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])
- pyod.utils.utility.precision_n_scores(y, y_pred, n=None)[source]¶
Utility function to calculate precision @ rank n.
Parameters¶
- ylist or numpy array of shape (n_samples,)
The ground truth. Binary (0: inliers, 1: outliers).
- y_predlist or numpy array of shape (n_samples,)
The raw outlier scores as returned by a fitted model.
- nint, optional (default=None)
The number of outliers. if not defined, infer using ground truth.
Returns¶
- precision_at_rank_nfloat
Precision at rank n score.
- pyod.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)[source]¶
Turn raw outlier outlier scores to binary labels (0 or 1).
Parameters¶
- pred_scoreslist or numpy array of shape (n_samples,)
Raw outlier scores. Outliers are assumed have larger values.
- outliers_fractionfloat in (0,1)
Percentage of outliers.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].
- pyod.utils.utility.standardizer(X, X_t=None, keep_scalar=False)[source]¶
Conduct Z-normalization on data to turn input samples become zero-mean and unit variance.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training samples
- X_tnumpy array of shape (n_samples_new, n_features), optional (default=None)
The data to be converted
- keep_scalarbool, optional (default=False)
The flag to indicate whether to return the scalar
Returns¶
- X_normnumpy array of shape (n_samples, n_features)
X after the Z-score normalization
- X_t_normnumpy array of shape (n_samples, n_features)
X_t after the Z-score normalization
- scalarsklearn scalar object
The scalar used in conversion
pyod.utils.persistence module¶
See Model Save and Load for the user-facing guide on saving and loading PyOD detectors, including cross-sklearn-version compatibility and strict mode.
Cross-sklearn-version model persistence for PyOD.
This module is the recommended way to save and load PyOD detectors. It wraps joblib with two capabilities the raw joblib.dump / joblib.load path does not provide:
A versioned envelope written by save(). The envelope records the PyOD, sklearn, numpy, scipy, joblib, and Python versions in effect at save time. load() compares the envelope against the running environment and emits a UserWarning when any binary-format dependency drifts; load(…, strict=True) raises instead. This lets users detect dependency drift before it surprises them in production.
A compat_load() helper that loads legacy artifacts whose sklearn Tree node dtype no longer matches the running sklearn (a recurring user pain documented in issue #519). compat_load uses joblib’s own unpickler with the BUILD-opcode dispatch entry patched so that sklearn Tree state is realigned to the running dtype before sklearn.tree._tree.Tree.__setstate__ sees it.
load() automatically falls through to compat_load() when the underlying joblib.load raises the specific sklearn dtype ValueError, so users who only call load() get the rescue path transparently.
WARNING: pickle and joblib load arbitrary Python code. Load only from trusted sources. The compat_load helper does not change this security model.
See docs/model_persistence.rst for the user-facing guide.
- pyod.utils.persistence.compat_load(path: Any, mmap_mode: str | None = None) Any[source]¶
Load an artifact whose sklearn Tree node dtype no longer matches.
Mirrors joblib.load but plugs a dispatch-table override into joblib’s unpickler so that sklearn Tree state is realigned to the running sklearn dtype before Tree.__setstate__ raises.
Realignment is name-based and bounded by _TREE_NODE_FIELD_DEFAULTS plus _TREE_NODE_FIELD_RENAMES. Unknown added/removed fields, dtype kind/signedness/itemsize changes, and shape changes raise ValueError. Same-name byte-order-only differences realign safely.
Emits a UserWarning recommending re-fit ONLY when at least one Tree was actually realigned. A no-op pass-through on a non-tree artifact is silent.
Parameters¶
- pathstr, pathlib.Path, or file-like
The artifact to load.
- mmap_modestr or None, default None
Forwarded to joblib’s underlying load path. Supported values mirror joblib’s: None, ‘r’, ‘r+’, ‘w+’, ‘c’.
Returns¶
- objAny
The raw top-level object from the file (a fitted detector for legacy raw saves; an envelope dict for Phase 2 saves). Callers that need envelope unwrapping should use load().
- pyod.utils.persistence.load(path: Any, strict: bool = False, return_metadata: bool = False) Any[source]¶
Load a PyOD detector saved by save() or by raw joblib.dump.
load() understands three input shapes:
An envelope dict written by save(). The envelope’s recorded dependency versions are compared against the running environment. Drift in sklearn, joblib, numpy, or scipy emits a UserWarning; strict=True raises ValueError instead.
A raw detector object written by joblib.dump(clf, path) on a previous PyOD release. Returned as-is when strict=False; raises under strict=True because legacy artifacts have no envelope to verify.
A file that fails the initial joblib.load with the sklearn Tree node dtype error. load() falls through to compat_load(path) and routes the recovered object through the same envelope/legacy handler. See module docstring.
Parameters¶
- pathstr or pathlib.Path
Path to the artifact.
- strictbool, default False
When True, version drift in any warn-severity dependency raises ValueError. info-severity drift (Python version) never raises. Legacy artifacts without an envelope also raise under strict mode.
- return_metadatabool, default False
When True, return
(model, envelope_without_model_field)instead of just the model. For legacy artifacts the second element isNone.
Returns¶
- modelAny
The unpickled model. When return_metadata=True, returns
(model, envelope_dict_or_None).
Raises¶
- ValueError
On schema-version mismatch, strict-mode drift, strict-mode legacy artifacts, or after a successful compat repair under strict mode.
- pyod.utils.persistence.save(model: Any, path: Any, metadata: dict | None = None) None[source]¶
Save a fitted PyOD detector with a versioned envelope.
The envelope records every dependency version that can affect pickle/joblib layout, plus a save timestamp and a user-supplied metadata dict. The actual model object is written via
joblib.dump; the only difference from rawjoblib.dump(clf, path)is that the model is wrapped in a header dict the matchingload()recognizes.Parameters¶
- modelAny
The fitted detector to save. Anything picklable will work; PyOD BaseDetector subclasses are the typical case.
- pathstr or pathlib.Path
Destination file path.
- metadatadict or None
Optional user-supplied metadata (training dataset id, feature schema hash, run id, anything). No schema is imposed; the dict round-trips as-is.
Returns¶
None
Notes¶
Loading the file with raw
joblib.loadreturns the envelope dict, not the model. Useload()from this module to unwrap.