All Models#

pyod.models.abod module#

Angle-based Outlier Detector (ABOD)

class pyod.models.abod.ABOD(contamination=0.1, n_neighbors=5, method='fast')[source]#

Bases: BaseDetector

ABOD class for Angle-base Outlier Detection. For an observation, the variance of its weighted cosine scores to all neighbors could be viewed as the outlying score. See [BKZ+08] for details.

Two version of ABOD are supported:

  • Fast ABOD: use k nearest neighbors to approximate.

  • Original ABOD: consider all training points with high time complexity at O(n^3).

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_neighborsint, optional (default=10)

Number of neighbors to use by default for k neighbors queries.

method: str, optional (default=’fast’)

Valid values for metric are:

  • ‘fast’: fast ABOD. Only consider n_neighbors of training points

  • ‘default’: original ABOD with all training points, which could be slow

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.alad module#

Using Adversarially Learned Anomaly Detection

class pyod.models.alad.ALAD(activation_hidden_gen='tanh', activation_hidden_disc='tanh', output_activation=None, dropout_rate=0.2, latent_dim=2, dec_layers=[5, 10, 25], enc_layers=[25, 10, 5], disc_xx_layers=[25, 10, 5], disc_zz_layers=[25, 10, 5], disc_xz_layers=[25, 10, 5], learning_rate_gen=0.0001, learning_rate_disc=0.0001, add_recon_loss=False, lambda_recon_loss=0.1, epochs=200, verbose=0, preprocessing=False, add_disc_zz_loss=True, spectral_normalization=False, batch_size=32, contamination=0.1)[source]#

Bases: BaseDetector

Adversarially Learned Anomaly Detection (ALAD). Paper: https://arxiv.org/pdf/1812.02288.pdf

See [BZRF+18] for details.

Parameters#

output_activationstr, optional (default=None)

Activation function to use for output layers for encoder and dector. See https://keras.io/activations/

activation_hidden_discstr, optional (default=’tanh’)

Activation function to use for hidden layers in discrimators. See https://keras.io/activations/

activation_hidden_genstr, optional (default=’tanh’)

Activation function to use for hidden layers in encoder and decoder (i.e. generator). See https://keras.io/activations/

epochsint, optional (default=500)

Number of epochs to train the model.

batch_sizeint, optional (default=32)

Number of samples per gradient update.

dropout_ratefloat in (0., 1), optional (default=0.2)

The dropout to be used across all layers.

dec_layerslist, optional (default=[5,10,25])

List that indicates the number of nodes per hidden layer for the d ecoder network. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.

enc_layerslist, optional (default=[25,10,5])

List that indicates the number of nodes per hidden layer for the encoder network. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.

disc_xx_layerslist, optional (default=[25,10,5])

List that indicates the number of nodes per hidden layer for discriminator_xx. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.

disc_zz_layerslist, optional (default=[25,10,5])

List that indicates the number of nodes per hidden layer for discriminator_zz. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.

disc_xz_layerslist, optional (default=[25,10,5])

List that indicates the number of nodes per hidden layer for discriminator_xz. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.

learning_rate_gen: float in (0., 1), optional (default=0.001)

learning rate of training the encoder and decoder

learning_rate_disc: float in (0., 1), optional (default=0.001)

learning rate of training the discriminators

add_recon_loss: bool optional (default=False)

add an extra loss for encoder and decoder based on the reconstruction error

lambda_recon_loss: float in (0., 1), optional (default=0.1)

if add_recon_loss= True, the reconstruction loss gets multiplied by lambda_recon_loss and added to the total loss for the generator

(i.e. encoder and decoder).

preprocessingbool, optional (default=True)

If True, apply standardization on the data.

verboseint, optional (default=1)

Verbosity mode. - 0 = silent - 1 = progress bar

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data [0,1]. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. Parameters ———- X : numpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None, noise_std=0.1)[source]#

Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_outlier_scores(X_norm)[source]#
get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

plot_learning_curves(start_ind=0, window_smoothening=10)[source]#
predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

train_more(X, epochs=100, noise_std=0.1)[source]#

This function allows the researcher to perform extra training instead of the fixed number determined by the fit() function.

train_step(data)[source]#

pyod.models.anogan module#

Anomaly Detection with Generative Adversarial Networks (AnoGAN) Paper: https://arxiv.org/pdf/1703.05921.pdf Note, that this is another implementation of AnoGAN as the one from https://github.com/fuchami/ANOGAN

class pyod.models.anogan.AnoGAN(activation_hidden='tanh', dropout_rate=0.2, latent_dim_G=2, G_layers=[20, 10, 3, 10, 20], verbose=0, D_layers=[20, 10, 5], index_D_layer_for_recon_error=1, epochs=500, preprocessing=False, learning_rate=0.001, learning_rate_query=0.01, epochs_query=20, batch_size=32, output_activation=None, contamination=0.1)[source]#

Bases: BaseDetector

Anomaly Detection with Generative Adversarial Networks (AnoGAN). See the original paper “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery”.

See [BSSeebockW+17] for details.

Parameters#

output_activationstr, optional (default=None)

Activation function to use for output layer. See https://keras.io/activations/

activation_hiddenstr, optional (default=’tanh’)

Activation function to use for output layer. See https://keras.io/activations/

epochsint, optional (default=500)

Number of epochs to train the model.

batch_sizeint, optional (default=32)

Number of samples per gradient update.

dropout_ratefloat in (0., 1), optional (default=0.2)

The dropout to be used across all layers.

G_layerslist, optional (default=[20,10,3,10,20])

List that indicates the number of nodes per hidden layer for the generator. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.

D_layerslist, optional (default=[20,10,5])

List that indicates the number of nodes per hidden layer for the discriminator. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.

learning_rate: float in (0., 1), optional (default=0.001)

learning rate of training the network

index_D_layer_for_recon_error: int, optional (default = 1)

This is the index of the hidden layer in the discriminator for which the reconstruction error will be determined between query sample and the sample created from the latent space.

learning_rate_query: float in (0., 1), optional (default=0.001)

learning rate for the backpropagation steps needed to find a point in the latent space of the generator that approximate the query sample

epochs_query: int, optional (default=20)

Number of epochs to approximate the query sample in the latent space of the generator

preprocessingbool, optional (default=True)

If True, apply standardization on the data.

verboseint, optional (default=1)

Verbosity mode. - 0 = silent - 1 = progress bar

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data [0,1]. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

fit_query(query_sample)[source]#
get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

plot_learning_curves(start_ind=0, window_smoothening=10)[source]#
predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

train_step(data)[source]#

pyod.models.auto_encoder module#

Using Auto Encoder with Outlier Detection

class pyod.models.auto_encoder.AutoEncoder(hidden_neurons=None, hidden_activation='relu', output_activation='sigmoid', loss=<function mean_squared_error>, optimizer='adam', epochs=100, batch_size=32, dropout_rate=0.2, l2_regularizer=0.1, validation_size=0.1, preprocessing=True, verbose=1, random_state=None, contamination=0.1)[source]#

Bases: BaseDetector

Auto Encoder (AE) is a type of neural networks for learning useful data representations unsupervisedly. Similar to PCA, AE could be used to detect outlying objects in the data by calculating the reconstruction errors. See [BAgg15] Chapter 3 for details.

Parameters#

hidden_neuronslist, optional (default=[64, 32, 32, 64])

The number of neurons per hidden layers.

hidden_activationstr, optional (default=’relu’)

Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://keras.io/activations/

output_activationstr, optional (default=’sigmoid’)

Activation function to use for output layer. See https://keras.io/activations/

lossstr or obj, optional (default=keras.losses.mean_squared_error)

String (name of objective function) or objective function. See https://keras.io/losses/

optimizerstr, optional (default=’adam’)

String (name of optimizer) or optimizer instance. See https://keras.io/optimizers/

epochsint, optional (default=100)

Number of epochs to train the model.

batch_sizeint, optional (default=32)

Number of samples per gradient update.

dropout_ratefloat in (0., 1), optional (default=0.2)

The dropout to be used across all layers.

l2_regularizerfloat in (0., 1), optional (default=0.1)

The regularization strength of activity_regularizer applied on each layer. By default, l2 regularizer is used. See https://keras.io/regularizers/

validation_sizefloat in (0., 1), optional (default=0.1)

The percentage of data to be used for validation.

preprocessingbool, optional (default=True)

If True, apply standardization on the data.

verboseint, optional (default=1)

Verbosity mode.

  • 0 = silent

  • 1 = progress bar

  • 2 = one line per epoch.

For verbose >= 1, model summary may be printed.

random_staterandom_state: int, RandomState instance or None, optional

(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.

Attributes#

encoding_dim_int

The number of neurons in the encoding layer.

compression_rate_float

The ratio between the original feature and the number of neurons in the encoding layer.

model_Keras Object

The underlying AutoEncoder in Keras.

history_: Keras Object

The AutoEncoder training history.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.auto_encoder_torch module#

Using AutoEncoder with Outlier Detection (PyTorch)

class pyod.models.auto_encoder_torch.AutoEncoder(hidden_neurons=None, hidden_activation='relu', batch_norm=True, learning_rate=0.001, epochs=100, batch_size=32, dropout_rate=0.2, weight_decay=1e-05, preprocessing=True, loss_fn=None, contamination=0.1, device=None)[source]#

Bases: BaseDetector

Auto Encoder (AE) is a type of neural networks for learning useful data representations in an unsupervised manner. Similar to PCA, AE could be used to detect outlying objects in the data by calculating the reconstruction errors. See [BAgg15] Chapter 3 for details.

Notes#

This is the PyTorch version of AutoEncoder. See auto_encoder.py for the TensorFlow version.

The documentation is not finished!

Parameters#

hidden_neuronslist, optional (default=[64, 32])

The number of neurons per hidden layers. So the network has the structure as [n_features, 64, 32, 32, 64, n_features]

hidden_activationstr, optional (default=’relu’)

Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://pytorch.org/docs/stable/nn.html for details. Currently only ‘relu’: nn.ReLU() ‘sigmoid’: nn.Sigmoid() ‘tanh’: nn.Tanh() are supported. See pyod/utils/torch_utility.py for details.

batch_normboolean, optional (default=True)

Whether to apply Batch Normalization, See https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html

learning_ratefloat, optional (default=1e-3)

Learning rate for the optimizer. This learning_rate is given to an Adam optimizer (torch.optim.Adam). See https://pytorch.org/docs/stable/generated/torch.optim.Adam.html

epochsint, optional (default=100)

Number of epochs to train the model.

batch_sizeint, optional (default=32)

Number of samples per gradient update.

dropout_ratefloat in (0., 1), optional (default=0.2)

The dropout to be used across all layers.

weight_decayfloat, optional (default=1e-5)

The weight decay for Adam optimizer. See https://pytorch.org/docs/stable/generated/torch.optim.Adam.html

preprocessingbool, optional (default=True)

If True, apply standardization on the data.

loss_fnobj, optional (default=torch.nn.MSELoss)

Optimizer instance which implements torch.nn._Loss. One of https://pytorch.org/docs/stable/nn.html#loss-functions or a custom loss. Custom losses are currently unstable.

verboseint, optional (default=1)

Verbosity mode.

  • 0 = silent

  • 1 = progress bar

  • 2 = one line per epoch.

For verbose >= 1, model summary may be printed. !CURRENTLY NOT SUPPORTED.!

random_staterandom_state: int, RandomState instance or None, optional

(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. !CURRENTLY NOT SUPPORTED.!

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.

Attributes#

encoding_dim_int

The number of neurons in the encoding layer.

compression_rate_float

The ratio between the original feature and the number of neurons in the encoding layer.

model_Keras Object

The underlying AutoEncoder in Keras.

history_: Keras Object

The AutoEncoder training history.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

class pyod.models.auto_encoder_torch.InnerAutoencoder(n_features, hidden_neurons=(128, 64), dropout_rate=0.2, batch_norm=True, hidden_activation='relu')[source]#

Bases: Module

add_module(name: str, module: Module | None) None#

Add a child module to the current module.

The module can be accessed as an attribute using the given name.

Args:
name (str): name of the child module. The child module can be

accessed from this module using the given name

module (Module): child module to be added to the module.

apply(fn: Callable[[Module], None]) T#

Apply fn recursively to every submodule (as returned by .children()) as well as self.

Typical use includes initializing the parameters of a model (see also nn-init-doc).

Args:

fn (Module -> None): function to be applied to each submodule

Returns:

Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
bfloat16() T#

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

buffers(recurse: bool = True) Iterator[Tensor]#

Return an iterator over module buffers.

Args:
recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
children() Iterator[Module]#

Return an iterator over immediate children modules.

Yields:

Module: a child module

compile(*args, **kwargs)#

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu() T#

Move all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

Module: self

cuda(device: int | device | None = None) T#

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Args:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

double() T#

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

eval() T#

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

Module: self

extra_repr() str#

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

float() T#

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

forward(x)[source]#

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_buffer(target: str) Tensor#

Return the buffer given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the buffer

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not a buffer

get_extra_state() Any#

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

object: Any extra state to store in the module’s state_dict

get_parameter(target: str) Parameter#

Return the parameter given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the Parameter

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Parameter

get_submodule(target: str) Module#

Return the submodule given by target if it exists, otherwise throw an error.

For example, let’s say you have an nn.Module A that looks like this:

A(
    (net_b): Module(
        (net_c): Module(
            (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2))
        )
        (linear): Linear(in_features=100, out_features=200, bias=True)
    )
)

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Args:
target: The fully-qualified string name of the submodule

to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Module

half() T#

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

ipu(device: int | device | None = None) T#

Move all model parameters and buffers to the IPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on IPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

load_state_dict(state_dict: Mapping[str, Any], strict: bool = True, assign: bool = False)#

Copy parameters and buffers from state_dict into this module and its descendants.

If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

assign (bool, optional): whether to assign items in the state

dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules() Iterator[Module]#

Return an iterator over all modules in the network.

Yields:

Module: a module in the network

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
...     print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)
named_buffers(prefix: str = '', recurse: bool = True, remove_duplicate: bool = True) Iterator[Tuple[str, Tensor]]#

Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.

remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.

Yields:

(str, torch.Tensor): Tuple containing the name and buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, buf in self.named_buffers():
>>>     if name in ['running_var']:
>>>         print(buf.size())
named_children() Iterator[Tuple[str, Module]]#

Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Yields:

(str, Module): Tuple containing a name and child module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)
named_modules(memo: Set[Module] | None = None, prefix: str = '', remove_duplicate: bool = True)#

Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:

memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result

or not

Yields:

(str, Module): Tuple of name and module

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
...     print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
named_parameters(prefix: str = '', recurse: bool = True, remove_duplicate: bool = True) Iterator[Tuple[str, Parameter]]#

Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

remove_duplicate (bool, optional): whether to remove the duplicated

parameters in the result. Defaults to True.

Yields:

(str, Parameter): Tuple containing the name and parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, param in self.named_parameters():
>>>     if name in ['bias']:
>>>         print(param.size())
parameters(recurse: bool = True) Iterator[Parameter]#

Return an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
register_backward_hook(hook: Callable[[Module, Tuple[Tensor, ...] | Tensor, Tuple[Tensor, ...] | Tensor], None | Tuple[Tensor, ...] | Tensor]) RemovableHandle#

Register a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

register_buffer(name: str, tensor: Tensor | None, persistent: bool = True) None#

Add a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Args:
name (str): name of the buffer. The buffer can be accessed

from this module using the given name

tensor (Tensor or None): buffer to be registered. If None, then operations

that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.

persistent (bool): whether the buffer is part of this module’s

state_dict.

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> self.register_buffer('running_mean', torch.zeros(num_features))
register_forward_hook(hook: Callable[[T, Tuple[Any, ...], Any], Any | None] | Callable[[T, Tuple[Any, ...], Dict[str, Any], Any], Any | None], *, prepend: bool = False, with_kwargs: bool = False, always_call: bool = False) RemovableHandle#

Register a forward hook on the module.

The hook will be called every time after forward() has computed an output.

If with_kwargs is False or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called. The hook should have the following signature:

hook(module, args, output) -> None or modified output

If with_kwargs is True, the forward hook will be passed the kwargs given to the forward function and be expected to return the output possibly modified. The hook should have the following signature:

hook(module, args, kwargs, output) -> None or modified output
Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If True, the provided hook will be fired

before all existing forward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward hooks on this torch.nn.modules.Module. Note that global forward hooks registered with register_module_forward_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If True, the hook will be passed the

kwargs given to the forward function. Default: False

always_call (bool): If True the hook will be run regardless of

whether an exception is raised while calling the Module. Default: False

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

register_forward_pre_hook(hook: Callable[[T, Tuple[Any, ...]], Any | None] | Callable[[T, Tuple[Any, ...], Dict[str, Any]], Tuple[Any, Dict[str, Any]] | None], *, prepend: bool = False, with_kwargs: bool = False) RemovableHandle#

Register a forward pre-hook on the module.

The hook will be called every time before forward() is invoked.

If with_kwargs is false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:

hook(module, args) -> None or modified input

If with_kwargs is true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:

hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing forward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward_pre hooks on this torch.nn.modules.Module. Note that global forward_pre hooks registered with register_module_forward_pre_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If true, the hook will be passed the kwargs

given to the forward function. Default: False

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_hook(hook: Callable[[Module, Tuple[Tensor, ...] | Tensor, Tuple[Tensor, ...] | Tensor], None | Tuple[Tensor, ...] | Tensor], prepend: bool = False) RemovableHandle#

Register a backward hook on the module.

The hook will be called every time the gradients with respect to a module are computed, i.e. the hook will execute if and only if the gradients with respect to module outputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward hooks on this torch.nn.modules.Module. Note that global backward hooks registered with register_module_full_backward_hook() will fire before all hooks registered by this method.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_pre_hook(hook: Callable[[Module, Tuple[Tensor, ...] | Tensor], None | Tuple[Tensor, ...] | Tensor], prepend: bool = False) RemovableHandle#

Register a backward pre-hook on the module.

The hook will be called every time the gradients for the module are computed. The hook should have the following signature:

hook(module, grad_output) -> tuple[Tensor] or None

The grad_output is a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place of grad_output in subsequent computations. Entries in grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward_pre hooks on this torch.nn.modules.Module. Note that global backward_pre hooks registered with register_module_full_backward_pre_hook() will fire before all hooks registered by this method.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

register_load_state_dict_post_hook(hook)#

Register a post hook to be run after module’s load_state_dict is called.

It should have the following signature::

hook(module, incompatible_keys) -> None

The module argument is the current module that this hook is registered on, and the incompatible_keys argument is a NamedTuple consisting of attributes missing_keys and unexpected_keys. missing_keys is a list of str containing the missing keys and unexpected_keys is a list of str containing the unexpected keys.

The given incompatible_keys can be modified inplace if needed.

Note that the checks performed when calling load_state_dict() with strict=True are affected by modifications the hook makes to missing_keys or unexpected_keys, as expected. Additions to either set of keys will result in an error being thrown when strict=True, and clearing out both missing and unexpected keys will avoid an error.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

register_module(name: str, module: Module | None) None#

Alias for add_module().

register_parameter(name: str, param: Parameter | None) None#

Add a parameter to the module.

The parameter can be accessed as an attribute using given name.

Args:
name (str): name of the parameter. The parameter can be accessed

from this module using the given name

param (Parameter or None): parameter to be added to the module. If

None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

register_state_dict_pre_hook(hook)#

Register a pre-hook for the load_state_dict() method.

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self. The registered hooks can be used to perform pre-processing before the state_dict call is made.

requires_grad_(requires_grad: bool = True) T#

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Args:
requires_grad (bool): whether autograd should record operations on

parameters in this module. Default: True.

Returns:

Module: self

set_extra_state(state: Any)#

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:

state (dict): Extra state from the state_dict

share_memory() T#

See torch.Tensor.share_memory_().

state_dict(*args, destination=None, prefix='', keep_vars=False)#

Return a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:
destination (dict, optional): If provided, the state of module will

be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.

prefix (str, optional): a prefix added to parameter and buffer

names to compose the keys in state_dict. Default: ''.

keep_vars (bool, optional): by default the Tensor s

returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']
to(*args, **kwargs)#

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:
device (torch.device): the desired device of the parameters

and buffers in this module

dtype (torch.dtype): the desired floating point or complex dtype of

the parameters and buffers in this module

tensor (torch.Tensor): Tensor whose dtype and device are the desired

dtype and device for all parameters and buffers in this module

memory_format (torch.memory_format): the desired memory

format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
to_empty(*, device: int | str | device | None, recurse: bool = True) T#

Move the parameters and buffers to the specified device without copying storage.

Args:
device (torch.device): The desired device of the parameters

and buffers in this module.

recurse (bool): Whether parameters and buffers of submodules should

be recursively moved to the specified device.

Returns:

Module: self

train(mode: bool = True) T#

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Args:
mode (bool): whether to set training mode (True) or evaluation

mode (False). Default: True.

Returns:

Module: self

type(dst_type: dtype | str) T#

Casts all parameters and buffers to dst_type.

Note

This method modifies the module in-place.

Args:

dst_type (type or string): the desired type

Returns:

Module: self

xpu(device: int | device | None = None) T#

Move all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

zero_grad(set_to_none: bool = True) None#

Reset gradients of all model parameters.

See similar function under torch.optim.Optimizer for more context.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

See torch.optim.Optimizer.zero_grad() for details.

class pyod.models.auto_encoder_torch.PyODDataset(X, y=None, mean=None, std=None)[source]#

Bases: Dataset

PyOD Dataset class for PyTorch Dataloader

pyod.models.cblof module#

Clustering Based Local Outlier Factor (CBLOF)

class pyod.models.cblof.CBLOF(n_clusters=8, contamination=0.1, clustering_estimator=None, alpha=0.9, beta=5, use_weights=False, check_estimator=False, random_state=None, n_jobs=1)[source]#

Bases: BaseDetector

The CBLOF operator calculates the outlier score based on cluster-based local outlier factor.

CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It classifies the clusters into small clusters and large clusters using the parameters alpha and beta. The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster.

Use weighting for outlier factor based on the sizes of the clusters as proposed in the original publication. Since this might lead to unexpected behavior (outliers close to small clusters are not found), it is disabled by default.Outliers scores are solely computed based on their distance to the closest large cluster center.

By default, kMeans is used for clustering algorithm instead of Squeezer algorithm mentioned in the original paper for multiple reasons.

See [BHXD03] for details.

Parameters#

n_clustersint, optional (default=8)

The number of clusters to form as well as the number of centroids to generate.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

clustering_estimatorEstimator, optional (default=None)

The base clustering algorithm for performing data clustering. A valid clustering algorithm should be passed in. The estimator should have standard sklearn APIs, fit() and predict(). The estimator should have attributes labels_ and cluster_centers_. If cluster_centers_ is not in the attributes once the model is fit, it is calculated as the mean of the samples in a cluster.

If not set, CBLOF uses KMeans for scalability. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

alphafloat in (0.5, 1), optional (default=0.9)

Coefficient for deciding small and large clusters. The ratio of the number of samples in large clusters to the number of samples in small clusters.

betaint or float in (1,), optional (default=5).

Coefficient for deciding small and large clusters. For a list sorted clusters by size |C1|, |C2|, …, |Cn|, beta = |Ck|/|Ck-1|

use_weightsbool, optional (default=False)

If set to True, the size of clusters are used as weights in outlier score calculation.

check_estimatorbool, optional (default=False)

If set to True, check whether the base estimator is consistent with sklearn standard.

Warning

check_estimator may throw errors with scikit-learn 0.20 above.

random_stateint, RandomState or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes#

clustering_estimator_Estimator, sklearn instance

Base estimator for clustering.

cluster_labels_list of shape (n_samples,)

Cluster assignment for the training samples.

n_clusters_int

Actual number of clusters (possibly different from n_clusters).

cluster_sizes_list of shape (n_clusters_,)

The size of each cluster once fitted with the training data.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

cluster_centers_numpy array of shape (n_clusters_, n_features)

The center of each cluster.

small_cluster_labels_list of clusters numbers

The cluster assignments belonging to small clusters.

large_cluster_labels_list of clusters numbers

The cluster assignments belonging to large clusters.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.cof module#

Connectivity-Based Outlier Factor (COF) Algorithm

class pyod.models.cof.COF(contamination=0.1, n_neighbors=20, method='fast')[source]#

Bases: BaseDetector

Connectivity-Based Outlier Factor (COF) COF uses the ratio of average chaining distance of data point and the average of average chaining distance of k nearest neighbor of the data point, as the outlier score for observations.

See [BTCFC02] for details.

Two version of COF are supported:

  • Fast COF: computes the entire pairwise distance matrix at the cost of a O(n^2) memory requirement.

  • Memory efficient COF: calculates pairwise distances incrementally. Use this implementation when it is not feasible to fit the n-by-n distance in memory. This leads to a linear overhead because many distances will have to be recalculated.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_neighborsint, optional (default=20)

Number of neighbors to use by default for k neighbors queries. Note that n_neighbors should be less than the number of samples. If n_neighbors is larger than the number of samples provided, all samples will be used.

methodstring, optional (default=’fast’)

Valid values for method are:

  • ‘fast’ Fast COF, computes the full pairwise distance matrix up front.

  • ‘memory’ Memory-efficient COF, computes pairwise distances only when needed at the cost of computational speed.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

n_neighbors_: int

Number of neighbors to use by default for k neighbors queries.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.combination module#

A collection of model combination functionalities.

pyod.models.combination.aom(scores, n_buckets=5, method='static', bootstrap_estimators=False, random_state=None)[source]#

Average of Maximum - An ensemble method for combining multiple estimators. See [BAS15] for details.

First dividing estimators into subgroups, take the maximum score as the subgroup score. Finally, take the average of all subgroup outlier scores.

Parameters#

scoresnumpy array of shape (n_samples, n_estimators)

The score matrix outputted from various estimators

n_bucketsint, optional (default=5)

The number of subgroups to build

methodstr, optional (default=’static’)

{‘static’, ‘dynamic’}, if ‘dynamic’, build subgroups randomly with dynamic bucket size.

bootstrap_estimatorsbool, optional (default=False)

Whether estimators are drawn with replacement.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns#

combined_scoresNumpy array of shape (n_samples,)

The combined outlier scores.

pyod.models.combination.average(scores, estimator_weights=None)[source]#

Combination method to merge the outlier scores from multiple estimators by taking the average.

Parameters#

scoresnumpy array of shape (n_samples, n_estimators)

Score matrix from multiple estimators on the same samples.

estimator_weightslist of shape (1, n_estimators)

If specified, using weighted average

Returns#

combined_scoresnumpy array of shape (n_samples, )

The combined outlier scores.

pyod.models.combination.majority_vote(scores, weights=None)[source]#

Combination method to merge the scores from multiple estimators by majority vote.

Parameters#

scoresnumpy array of shape (n_samples, n_estimators)

Score matrix from multiple estimators on the same samples.

weightsnumpy array of shape (1, n_estimators)

If specified, using weighted majority weight.

Returns#

combined_scoresnumpy array of shape (n_samples, )

The combined scores.

pyod.models.combination.maximization(scores)[source]#

Combination method to merge the outlier scores from multiple estimators by taking the maximum.

Parameters#

scoresnumpy array of shape (n_samples, n_estimators)

Score matrix from multiple estimators on the same samples.

Returns#

combined_scoresnumpy array of shape (n_samples, )

The combined outlier scores.

pyod.models.combination.median(scores)[source]#

Combination method to merge the scores from multiple estimators by taking the median.

Parameters#

scoresnumpy array of shape (n_samples, n_estimators)

Score matrix from multiple estimators on the same samples.

Returns#

combined_scoresnumpy array of shape (n_samples, )

The combined scores.

pyod.models.combination.moa(scores, n_buckets=5, method='static', bootstrap_estimators=False, random_state=None)[source]#

Maximization of Average - An ensemble method for combining multiple estimators. See [BAS15] for details.

First dividing estimators into subgroups, take the average score as the subgroup score. Finally, take the maximization of all subgroup outlier scores.

Parameters#

scoresnumpy array of shape (n_samples, n_estimators)

The score matrix outputted from various estimators

n_bucketsint, optional (default=5)

The number of subgroups to build

methodstr, optional (default=’static’)

{‘static’, ‘dynamic’}, if ‘dynamic’, build subgroups randomly with dynamic bucket size.

bootstrap_estimatorsbool, optional (default=False)

Whether estimators are drawn with replacement.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns#

combined_scoresNumpy array of shape (n_samples,)

The combined outlier scores.

pyod.models.cd module#

Cook’s distance outlier detection (CD)

class pyod.models.cd.CD(contamination=0.1, model=LinearRegression())[source]#

Bases: BaseDetector

Cook’s distance can be used to identify points that negatively

affect a regression model. A combination of each observation’s leverage and residual values are used in the measurement. Higher leverage and residuals relate to higher Cook’s distances. Note that this method is unsupervised and requires at least two features for X with which to calculate the mean Cook’s distance for each datapoint. Read more in the [BCoo77].

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

modelobject, optional (default=LinearRegression())

Regression model used to calculate the Cook’s distance

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

“Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.copod module#

Copula Based Outlier Detector (COPOD)

class pyod.models.copod.COPOD(contamination=0.1, n_jobs=1)[source]#

Bases: BaseDetector

COPOD class for Copula Based Outlier Detector. COPOD is a parameter-free, highly interpretable outlier detection algorithm based on empirical copula models. See [BLZB+20] for details.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_jobsoptional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#
Predict raw anomaly score of X using the fitted detector.

For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

explain_outlier(ind, columns=None, cutoffs=None, feature_names=None, file_name=None, file_type=None)[source]#

Plot dimensional outlier graph for a given data point within the dataset.

Parameters#

indint

The index of the data point one wishes to obtain a dimensional outlier graph for.

columnslist

Specify a list of features/dimensions for plotting. If not specified, use all features.

cutoffslist of floats in (0., 1), optional (default=[0.95, 0.99])

The significance cutoff bands of the dimensional outlier graph.

feature_nameslist of strings

The display names of all columns of the dataset, to show on the x-axis of the plot.

file_namestring

The name to save the figure

file_typestring

The file type to save the figure

Returns#

Plotmatplotlib plot

The dimensional outlier graph for data point with index ind.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.copod.skew(X, axis=0)[source]#

pyod.models.deep_svdd module#

Deep One-Class Classification for outlier detection

class pyod.models.deep_svdd.DeepSVDD(c=None, use_ae=False, hidden_neurons=None, hidden_activation='relu', output_activation='sigmoid', optimizer='adam', epochs=100, batch_size=32, dropout_rate=0.2, l2_regularizer=0.1, validation_size=0.1, preprocessing=True, verbose=1, random_state=None, contamination=0.1)[source]#

Bases: BaseDetector

Deep One-Class Classifier with AutoEncoder (AE) is a type of neural networks for learning useful data representations in an unsupervised way. DeepSVDD trains a neural network while minimizing the volume of a hypersphere that encloses the network representations of the data, forcing the network to extract the common factors of variation. Similar to PCA, DeepSVDD could be used to detect outlying objects in the data by calculating the distance from center See [BRVG+18] for details.

Parameters#

c: float, optional (default=’forwad_nn_pass’)

Deep SVDD center, the default will be calculated based on network initialization first forward pass. To get repeated results set random_state if c is set to None.

use_ae: bool, optional (default=False)

The AutoEncoder type of DeepSVDD it reverse neurons from hidden_neurons if set to True.

hidden_neuronslist, optional (default=[64, 32])

The number of neurons per hidden layers. if use_ae is True, neurons will be reversed eg. [64, 32] -> [64, 32, 32, 64, n_features]

hidden_activationstr, optional (default=’relu’)

Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://keras.io/activations/

output_activationstr, optional (default=’sigmoid’)

Activation function to use for output layer. See https://keras.io/activations/

optimizerstr, optional (default=’adam’)

String (name of optimizer) or optimizer instance. See https://keras.io/optimizers/

epochsint, optional (default=100)

Number of epochs to train the model.

batch_sizeint, optional (default=32)

Number of samples per gradient update.

dropout_ratefloat in (0., 1), optional (default=0.2)

The dropout to be used across all layers.

l2_regularizerfloat in (0., 1), optional (default=0.1)

The regularization strength of activity_regularizer applied on each layer. By default, l2 regularizer is used. See https://keras.io/regularizers/

validation_sizefloat in (0., 1), optional (default=0.1)

The percentage of data to be used for validation.

preprocessingbool, optional (default=True)

If True, apply standardization on the data.

verboseint, optional (default=1)

Verbosity mode.

  • 0 = silent

  • 1 = progress bar

  • 2 = one line per epoch.

For verbose >= 1, model summary may be printed.

random_staterandom_state: int, RandomState instance or None, optional

(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.

Attributes#

model_Keras Object

The underlying DeppSVDD in Keras.

history_: Keras Object

The AutoEncoder training history.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.dif module#

Deep Isolation Forest for Anomaly Detection (DIF)

class pyod.models.dif.DIF(batch_size=1000, representation_dim=20, hidden_neurons=None, hidden_activation='tanh', skip_connection=False, n_ensemble=50, n_estimators=6, max_samples=256, contamination=0.1, random_state=None, device=None)[source]#

Bases: BaseDetector

Deep Isolation Forest (DIF) is an extension of iForest. It uses deep representation ensemble to achieve non-linear isolation on original data space. See [BXPWW23] for details.

Parameters#

batch_sizeint, optional (default=1000)

Number of samples per gradient update.

representation_dim, int, optional (default=20)

Dimensionality of the representation space.

hidden_neurons, list, optional (default=[64, 32])

The number of neurons per hidden layers. So the network has the structure as [n_features, hidden_neurons[0], hidden_neurons[1], …, representation_dim]

hidden_activation, str, optional (default=’tanh’)

Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://pytorch.org/docs/stable/nn.html for details. Currently only ‘relu’: nn.ReLU() ‘sigmoid’: nn.Sigmoid() ‘tanh’: nn.Tanh() are supported. See pyod/utils/torch_utility.py for details.

skip_connection, boolean, optional (default=False)

If True, apply skip-connection in the neural network structure.

n_ensemble, int, optional (default=50)

The number of deep representation ensemble members.

n_estimators, int, optional (default=6)

The number of isolation forest of each representation.

max_samples, int, optional (default=256)

The number of samples to draw from X to train each base isolation tree.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

random_stateint or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

device, ‘cuda’, ‘cpu’, or None, optional (default=None)

if ‘cuda’, use GPU acceleration in torch if ‘cpu’, use cpu in torch if None, automatically determine whether GPU is available

Attributes#

net_lstlist of torch.Module

The list of representation neural networks.

iForest_lstlist of iForest

The list of instantiated iForest model.

x_reduced_lst: list of numpy array

The list of training data representations

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.ecod module#

Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions (ECOD)

class pyod.models.ecod.ECOD(contamination=0.1, n_jobs=1)[source]#

Bases: BaseDetector

ECOD class for Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions (ECOD) ECOD is a parameter-free, highly interpretable outlier detection algorithm based on empirical CDF functions. See [] for details.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_jobsoptional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#
Predict raw anomaly score of X using the fitted detector.

For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

explain_outlier(ind, columns=None, cutoffs=None, feature_names=None, file_name=None, file_type=None)[source]#

Plot dimensional outlier graph for a given data point within the dataset.

Parameters#

indint

The index of the data point one wishes to obtain a dimensional outlier graph for.

columnslist

Specify a list of features/dimensions for plotting. If not specified, use all features.

cutoffslist of floats in (0., 1), optional (default=[0.95, 0.99])

The significance cutoff bands of the dimensional outlier graph.

feature_nameslist of strings

The display names of all columns of the dataset, to show on the x-axis of the plot.

file_namestring

The name to save the figure

file_typestring

The file type to save the figure

Returns#

Plotmatplotlib plot

The dimensional outlier graph for data point with index ind.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.ecod.skew(X, axis=0)[source]#

pyod.models.feature_bagging module#

Feature bagging detector

class pyod.models.feature_bagging.FeatureBagging(base_estimator=None, n_estimators=10, contamination=0.1, max_features=1.0, bootstrap_features=False, check_detector=True, check_estimator=False, n_jobs=1, random_state=None, combination='average', verbose=0, estimator_params=None)[source]#

Bases: BaseDetector

A feature bagging detector is a meta estimator that fits a number of base detectors on various sub-samples of the dataset and use averaging or other combination methods to improve the predictive accuracy and control over-fitting.

The sub-sample size is always the same as the original input sample size but the features are randomly sampled from half of the features to all features.

By default, LOF is used as the base estimator. However, any estimator could be used as the base estimator, such as kNN and ABOD.

Feature bagging first construct n subsamples by random selecting a subset of features, which induces the diversity of base estimators.

Finally, the prediction score is generated by averaging/taking the maximum of all base detectors. See [BLK05] for details.

Parameters#

base_estimatorobject or None, optional (default=None)

The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a LOF detector.

n_estimatorsint, optional (default=10)

The number of base estimators in the ensemble.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

max_featuresint or float, optional (default=1.0)

The number of features to draw from X to train each base estimator.

  • If int, then draw max_features features.

  • If float, then draw max_features * X.shape[1] features.

bootstrap_featuresbool, optional (default=False)

Whether features are drawn with replacement.

check_detectorbool, optional (default=True)

If set to True, check whether the base estimator is consistent with pyod standard.

check_estimatorbool, optional (default=False)

If set to True, check whether the base estimator is consistent with sklearn standard.

Deprecated since version 0.6.9: check_estimator will be removed in pyod 0.8.0.; it will be replaced by check_detector.

n_jobsoptional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

random_stateint, RandomState or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

combinationstr, optional (default=’average’)

The method of combination:

  • if ‘average’: take the average of all detectors

  • if ‘max’: take the maximum scores of all detectors

verboseint, optional (default=0)

Controls the verbosity of the building process.

estimator_paramsdict, optional (default=None)

The list of attributes to use as parameters when instantiating a new base estimator. If none are given, default parameters are used.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.gmm module#

Outlier detection based on Gaussian Mixture Model (GMM).

class pyod.models.gmm.GMM(n_components=1, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, contamination=0.1)[source]#

Bases: BaseDetector

Wrapper of scikit-learn Gaussian Mixture Model with more functionalities. Unsupervised Outlier Detection.

See [BAgg15] Chapter 2 for details.

Parameters#

n_componentsint, default=1

The number of mixture components.

covariance_type{‘full’, ‘tied’, ‘diag’, ‘spherical’}, default=’full’

String describing the type of covariance parameters to use.

tolfloat, default=1e-3

The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.

reg_covarfloat, default=1e-6

Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.

max_iterint, default=100

The number of EM iterations to perform.

n_initint, default=1

The number of initializations to perform. The best results are kept.

init_params{‘kmeans’, ‘random’}, default=’kmeans’

The method used to initialize the weights, the means and the precisions.

weights_initarray-like of shape (n_components, ), default=None

The user-provided initial weights. If it is None, weights are initialized using the init_params method.

means_initarray-like of shape (n_components, n_features), default=None

The user-provided initial means, If it is None, means are initialized using the init_params method.

precisions_initarray-like, default=None

The user-provided initial precisions (inverse of the covariance matrices). If it is None, precisions are initialized using the ‘init_params’ method.

random_stateint, RandomState instance or None, default=None

Controls the random seed given to the method chosen to initialize the parameters.

warm_startbool, default=False

If ‘warm_start’ is True, the solution of the last fitting is used as initialization for the next call of fit().

verboseint, default=0

Enable verbose output.

verbose_intervalint, default=10

Number of iteration done before the next print.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set.

Attributes#

weights_array-like of shape (n_components,)

The weights of each mixture components.

means_array-like of shape (n_components, n_features)

The mean of each mixture component.

covariances_array-like

The covariance of each mixture component.

precisions_array-like

The precision matrices for each component in the mixture.

precisions_cholesky_array-like

The cholesky decomposition of the precision matrices of each mixture component.

converged_bool

True when convergence was reached in fit(), False otherwise.

n_iter_int

Number of step used by the best fit of EM to reach the convergence.

lower_bound_float

Lower bound value on the log-likelihood (of the training data with respect to the model) of the best fit of EM.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

property converged_#

True when convergence was reached in fit(), False otherwise. Decorator for scikit-learn Gaussian Mixture Model attributes.

property covariances_#

The covariance of each mixture component. Decorator for scikit-learn Gaussian Mixture Model attributes.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

sample_weightarray-like, shape (n_samples,)

Per-sample weights. Rescale C per sample. Higher weights force the classifier to put more emphasis on these points.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

property lower_bound_#

Lower bound value on the log-likelihood of the best fit of EM. Decorator for scikit-learn Gaussian Mixture Model attributes.

property means_#

The mean of each mixture component. Decorator for scikit-learn Gaussian Mixture Model attributes.

property n_iter_#

Number of step used by the best fit of EM to reach the convergence. Decorator for scikit-learn Gaussian Mixture Model attributes.

property precisions_#

The precision matrices for each component in the mixture. Decorator for scikit-learn Gaussian Mixture Model attributes.

property precisions_cholesky_#
The cholesky decomposition of the precision matrices

of each mixture component.

Decorator for scikit-learn Gaussian Mixture Model attributes.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

property weights_#

The weights of each mixture components. Decorator for scikit-learn Gaussian Mixture Model attributes.

pyod.models.hbos module#

Histogram-based Outlier Detection (HBOS)

class pyod.models.hbos.HBOS(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)[source]#

Bases: BaseDetector

Histogram- based outlier detection (HBOS) is an efficient unsupervised method. It assumes the feature independence and calculates the degree of outlyingness by building histograms. See [BGD12] for details.

Two versions of HBOS are supported: - Static number of bins: uses a static number of bins for all features. - Automatic number of bins: every feature uses a number of bins deemed to

be optimal according to the Birge-Rozenblac method ([BBirgeR06]).

Parameters#

n_binsint or string, optional (default=10)

The number of bins. “auto” uses the birge-rozenblac method for automatic selection of the optimal number of bins for each feature.

alphafloat in (0, 1), optional (default=0.1)

The regularizer for preventing overflow.

tolfloat in (0, 1), optional (default=0.5)

The parameter to decide the flexibility while dealing the samples falling outside the bins.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

Attributes#

bin_edges_numpy array of shape (n_bins + 1, n_features )

The edges of the bins.

hist_numpy array of shape (n_bins, n_features)

The density of each histogram.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.iforest module#

IsolationForest Outlier Detector. Implemented on scikit-learn library.

class pyod.models.iforest.IForest(n_estimators=100, max_samples='auto', contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=1, behaviour='old', random_state=None, verbose=0)[source]#

Bases: BaseDetector

Wrapper of scikit-learn Isolation Forest with more functionalities.

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. See [BLTZ08, BLTZ12] for details.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Parameters#

n_estimatorsint, optional (default=100)

The number of base estimators in the ensemble.

max_samplesint or float, optional (default=”auto”)

The number of samples to draw from X to train each base estimator.

  • If int, then draw max_samples samples.

  • If float, then draw max_samples * X.shape[0] samples.

  • If “auto”, then max_samples=min(256, n_samples).

If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

max_featuresint or float, optional (default=1.0)

The number of features to draw from X to train each base estimator.

  • If int, then draw max_features features.

  • If float, then draw max_features * X.shape[1] features.

bootstrapbool, optional (default=False)

If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.

n_jobsinteger, optional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

behaviourstr, default=’old’

Behaviour of the decision_function which can be either ‘old’ or ‘new’. Passing behaviour='new' makes the decision_function change to match other anomaly detection algorithm API which will be the default behaviour in the future. As explained in details in the offset_ attribute documentation, the decision_function becomes dependent on the contamination parameter, in such a way that 0 becomes its natural threshold to detect outliers.

New in version 0.7.0: behaviour is added in 0.7.0 for back-compatibility purpose.

Deprecated since version 0.20: behaviour='old' is deprecated in sklearn 0.20 and will not be possible in 0.22.

Deprecated since version 0.22: behaviour parameter will be deprecated in sklearn 0.22 and removed in 0.24.

Warning

Only applicable for sklearn 0.20 above.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

verboseint, optional (default=0)

Controls the verbosity of the tree building process.

Attributes#

estimators_list of DecisionTreeClassifier

The collection of fitted sub-estimators.

estimators_samples_list of arrays

The subset of drawn samples (i.e., the in-bag samples) for each base estimator.

max_samples_integer

The actual number of samples

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

property estimators_#

The collection of fitted sub-estimators. Decorator for scikit-learn Isolation Forest attributes.

property estimators_features_#

The indeces of the subset of features used to train the estimators. Decorator for scikit-learn Isolation Forest attributes.

property estimators_samples_#

The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Decorator for scikit-learn Isolation Forest attributes.

property feature_importances_#

The impurity-based feature importance. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

impurity-based feature importance can be misleading for high cardinality features (many unique values). See https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html as an alternative.

Returns#

feature_importances_ndarray of shape (n_features,)

The values of this array sum to 1, unless all trees are single node trees consisting of only the root node, in which case it will be an array of zeros.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

property max_samples_#

The actual number of samples. Decorator for scikit-learn Isolation Forest attributes.

property n_features_in_#

The number of features seen during the fit. Decorator for scikit-learn Isolation Forest attributes.

property offset_#

Offset used to define the decision function from the raw scores. Decorator for scikit-learn Isolation Forest attributes.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.inne module#

Isolation-based anomaly detection using nearest-neighbor ensembles. Part of the codes are adapted from https://github.com/xhan97/inne

class pyod.models.inne.INNE(n_estimators=200, max_samples='auto', contamination=0.1, random_state=None)[source]#

Bases: BaseDetector

Isolation-based anomaly detection using nearest-neighbor ensembles.

The INNE algorithm uses the nearest neighbour ensemble to isolate anomalies. It partitions the data space into regions using a subsample and determines an isolation score for each region. As each region adapts to local distribution, the calculated isolation score is a local measure that is relative to the local neighbourhood, enabling it to detect both global and local anomalies. INNE has linear time complexity to efficiently handle large and high-dimensional datasets with complex distributions.

See [BBTA+18] for details.

Parameters#

n_estimatorsint, default=200

The number of base estimators in the ensemble.

max_samplesint or float, optional (default=”auto”)

The number of samples to draw from X to train each base estimator.

  • If int, then draw max_samples samples.

  • If float, then draw max_samples * X.shape[0]` samples.

  • If “auto”, then max_samples=min(8, n_samples).

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes#

max_samples_integer

The actual number of samples

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.kde module#

Kernel Density Estimation (KDE) for Unsupervised Outlier Detection.

class pyod.models.kde.KDE(contamination=0.1, bandwidth=1.0, algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None)[source]#

Bases: BaseDetector

KDE class for outlier detection.

For an observation, its negative log probability density could be viewed as the outlying score.

See [BLLP07] for details.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

bandwidthfloat, optional (default=1.0)

The bandwidth of the kernel.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’}, optional

Algorithm used to compute the kernel density estimator:

  • ‘ball_tree’ will use BallTree

  • ‘kd_tree’ will use KDTree

  • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

leaf_sizeint, optional (default = 30)

Leaf size passed to BallTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

metricstring or callable, default ‘minkowski’

metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

Distance matrices are not supported.

Valid values for metric are:

  • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

  • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

See the documentation for scipy.spatial.distance for details on these metrics.

metric_paramsdict, optional (default = None)

Additional keyword arguments for the metric function.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.knn module#

k-Nearest Neighbors Detector (kNN)

class pyod.models.knn.KNN(contamination=0.1, n_neighbors=5, method='largest', radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, **kwargs)[source]#

Bases: BaseDetector

kNN class for outlier detection. For an observation, its distance to its kth nearest neighbor could be viewed as the outlying score. It could be viewed as a way to measure the density. See [BAP02, BRRS00] for details.

Three kNN detectors are supported: largest: use the distance to the kth neighbor as the outlier score mean: use the average of all k neighbors as the outlier score median: use the median of the distance to k neighbors as the outlier score

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_neighborsint, optional (default = 5)

Number of neighbors to use by default for k neighbors queries.

methodstr, optional (default=’largest’)

{‘largest’, ‘mean’, ‘median’}

  • ‘largest’: use the distance to the kth neighbor as the outlier score

  • ‘mean’: use the average of all k neighbors as the outlier score

  • ‘median’: use the median of the distance to k neighbors as the outlier score

radiusfloat, optional (default = 1.0)

Range of parameter space to use by default for radius_neighbors queries.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors:

  • ‘ball_tree’ will use BallTree

  • ‘kd_tree’ will use KDTree

  • ‘brute’ will use a brute-force search.

  • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

Deprecated since version 0.74: algorithm is deprecated in PyOD 0.7.4 and will not be possible in 0.7.6. It has to use BallTree for consistency.

leaf_sizeint, optional (default = 30)

Leaf size passed to BallTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

metricstring or callable, default ‘minkowski’

metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

Distance matrices are not supported.

Valid values for metric are:

  • from scikit-learn: [‘cityblock’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

  • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

See the documentation for scipy.spatial.distance for details on these metrics.

pinteger, optional (default = 2)

Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances

metric_paramsdict, optional (default = None)

Additional keyword arguments for the metric function.

n_jobsint, optional (default = 1)

The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.kpca module#

Kernel Principal Component Analysis (KPCA) Outlier Detector

class pyod.models.kpca.KPCA(contamination=0.1, n_components=None, n_selected_components=None, kernel='rbf', gamma=None, degree=3, coef0=1, kernel_params=None, alpha=1.0, eigen_solver='auto', tol=0, max_iter=None, remove_zero_eig=False, copy_X=True, n_jobs=None, sampling=False, subset_size=20, random_state=None)[source]#

Bases: BaseDetector

KPCA class for outlier detection.

PCA is performed on the feature space uniquely determined by the kernel, and the reconstruction error on the feature space is used as the anomaly score.

See [BHof07] Heiko Hoffmann, “Kernel PCA for novelty detection,” Pattern Recognition, vol.40, no.3, pp. 863-874, 2007. https://www.sciencedirect.com/science/article/pii/S0031320306003414 for details.

Parameters#

n_componentsint, optional (default=None)

Number of components. If None, all non-zero components are kept.

n_selected_componentsint, optional (default=None)

Number of selected principal components for calculating the outlier scores. It is not necessarily equal to the total number of the principal components. If not set, use all principal components.

kernelstring {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’,

‘cosine’, ‘precomputed’}, optional (default=’rbf’)

Kernel used for PCA.

gammafloat, optional (default=None)

Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If gamma is None, then it is set to 1/n_features.

degreeint, optional (default=3)

Degree for poly kernels. Ignored by other kernels.

coef0float, optional (default=1)

Independent term in poly and sigmoid kernels. Ignored by other kernels.

kernel_paramsdict, optional (default=None)

Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.

alphafloat, optional (default=1.0)

Hyperparameter of the ridge regression that learns the inverse transform (when inverse_transform=True).

eigen_solverstring, {‘auto’, ‘dense’, ‘arpack’, ‘randomized’}, default=’auto’

Select eigensolver to use. If n_components is much less than the number of training samples, randomized (or arpack to a smaller extend) may be more efficient than the dense eigensolver. Randomized SVD is performed according to the method of Halko et al.

auto :

the solver is selected by a default policy based on n_samples (the number of training samples) and n_components: if the number of components to extract is less than 10 (strict) and the number of samples is more than 200 (strict), the ‘arpack’ method is enabled. Otherwise the exact full eigenvalue decomposition is computed and optionally truncated afterwards (‘dense’ method).

dense :

run exact full eigenvalue decomposition calling the standard LAPACK solver via scipy.linalg.eigh, and select the components by postprocessing.

arpack :

run SVD truncated to n_components calling ARPACK solver using scipy.sparse.linalg.eigsh. It requires strictly 0 < n_components < n_samples

randomized :

run randomized SVD. implementation selects eigenvalues based on their module; therefore using this method can lead to unexpected results if the kernel is not positive semi-definite.

tolfloat, optional (default=0)

Convergence tolerance for arpack. If 0, optimal value will be chosen by arpack.

max_iterint, optional (default=None)

Maximum number of iterations for arpack. If None, optimal value will be chosen by arpack.

remove_zero_eigbool, optional (default=False)

If True, then all components with zero eigenvalues are removed, so that the number of components in the output may be < n_components (and sometimes even zero due to numerical instability). When n_components is None, this parameter is ignored and components with zero eigenvalues are removed regardless.

copy_Xbool, optional (default=True)

If True, input X is copied and stored by the model in the X_fit_ attribute. If no further changes will be done to X, setting copy_X=False saves memory by storing a reference.

n_jobsint, optional (default=None)

The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

samplingbool, optional (default=False)

If True, sampling subset from the dataset is performed only once, in order to reduce time complexity while keeping detection performance.

subset_sizefloat in (0., 1.0) or int (0, n_samples), optional (default=20)

If sampling is True, the size of subset is specified.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

class pyod.models.kpca.PyODKernelPCA(n_components=None, kernel='rbf', gamma=None, degree=3, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False, eigen_solver='auto', tol=0, max_iter=None, remove_zero_eig=False, copy_X=True, n_jobs=None, random_state=None)[source]#

Bases: KernelPCA

A wrapper class for KernelPCA class of scikit-learn.

fit(X, y=None)#

Fit the model from data in X.

Parameters#

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Returns the instance itself.

fit_transform(X, y=None, **params)#

Fit the model from data in X and transform X.

Parameters#

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

yIgnored

Not used, present for API consistency by convention.

**paramskwargs

Parameters (keyword arguments) and values passed to the fit_transform instance.

Returns#

X_newndarray of shape (n_samples, n_components)

Returns the instance itself.

property get_centerer#

Return a protected member _centerer.

get_feature_names_out(input_features=None)#

Get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: [“class_name0”, “class_name1”, “class_name2”].

Parameters#

input_featuresarray-like of str or None, default=None

Only used to validate feature names with the names seen in fit.

Returns#

feature_names_outndarray of str objects

Transformed feature names.

property get_kernel#

Return a protected member _get_kernel.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns#

routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters#

deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsdict

Parameter names mapped to their values.

inverse_transform(X)#

Transform X back to original space.

inverse_transform approximates the inverse transformation using a learned pre-image. The pre-image is learned by kernel ridge regression of the original data on their low-dimensional representation vectors.

Note

When users want to compute inverse transformation for ‘linear’ kernel, it is recommended that they use PCA instead. Unlike PCA, KernelPCA’s inverse_transform does not reconstruct the mean of data when ‘linear’ kernel is used due to the use of centered kernel.

Parameters#

X{array-like, sparse matrix} of shape (n_samples, n_components)

Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns#

X_newndarray of shape (n_samples, n_features)

Returns the instance itself.

References#

Bakır, Gökhan H., Jason Weston, and Bernhard Schölkopf. “Learning to find pre-images.” Advances in neural information processing systems 16 (2004): 449-456.

set_output(*, transform=None)#

Set output container.

See sphx_glr_auto_examples_miscellaneous_plot_set_output.py for an example on how to use the API.

Parameters#

transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

New in version 1.4: “polars” option was added.

Returns#

selfestimator instance

Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters#

**paramsdict

Estimator parameters.

Returns#

selfestimator instance

Estimator instance.

transform(X)#

Transform X.

Parameters#

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns#

X_newndarray of shape (n_samples, n_components)

Returns the instance itself.

pyod.models.lmdd module#

Linear Model Deviation-base outlier detection (LMDD).

class pyod.models.lmdd.LMDD(contamination=0.1, n_iter=50, dis_measure='aad', random_state=None)[source]#

Bases: BaseDetector

Linear Method for Deviation-based Outlier Detection.

LMDD employs the concept of the smoothing factor which indicates how much the dissimilarity can be reduced by removing a subset of elements from the data-set. Read more in the [BAAR96].

Note: this implementation has minor modification to make it output scores instead of labels.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_iterint, optional (default=50)

Number of iterations where in each iteration, the process is repeated after randomizing the order of the input. Note that n_iter is a very important factor that affects the accuracy. The higher the better the accuracy and the longer the execution.

dis_measure: str, optional (default=’aad’)

Dissimilarity measure to be used in calculating the smoothing factor for points, options available:

  • ‘aad’: Average Absolute Deviation

  • ‘var’: Variance

  • ‘iqr’: Interquartile Range

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.loda module#

Loda: Lightweight on-line detector of anomalies Adapted from tilitools (https://github.com/nicococo/tilitools) by

class pyod.models.loda.LODA(contamination=0.1, n_bins=10, n_random_cuts=100)[source]#

Bases: BaseDetector

Loda: Lightweight on-line detector of anomalies. See [BPevny16] for more information.

Two versions of LODA are supported: - Static number of bins: uses a static number of bins for all random cuts. - Automatic number of bins: every random cut uses a number of bins deemed

to be optimal according to the Birge-Rozenblac method ([BBirgeR06]).

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_binsint or string, optional (default = 10)

The number of bins for the histogram. If set to “auto”, the Birge-Rozenblac method will be used to automatically determine the optimal number of bins.

n_random_cutsint, optional (default = 100)

The number of random cuts.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.lof module#

Local Outlier Factor (LOF). Implemented on scikit-learn library.

class pyod.models.lof.LOF(n_neighbors=20, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, contamination=0.1, n_jobs=1, novelty=True)[source]#

Bases: BaseDetector

Wrapper of scikit-learn LOF Class with more functionalities. Unsupervised Outlier Detection using Local Outlier Factor (LOF).

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers. See [BBKNS00] for details.

Parameters#

n_neighborsint, optional (default=20)

Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors:

  • ‘ball_tree’ will use BallTree

  • ‘kd_tree’ will use KDTree

  • ‘brute’ will use a brute-force search.

  • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_sizeint, optional (default=30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

metricstring or callable, default ‘minkowski’

metric used for the distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If ‘precomputed’, the training input X is expected to be a distance matrix.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

Valid values for metric are:

  • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

  • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

See the documentation for scipy.spatial.distance for details on these metrics: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

pinteger, optional (default = 2)

Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances

metric_paramsdict, optional (default = None)

Additional keyword arguments for the metric function.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.

n_jobsint, optional (default = 1)

The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods.

noveltybool (default=False)

By default, LocalOutlierFactor is only meant to be used for outlier detection (novelty=False). Set novelty to True if you want to use LocalOutlierFactor for novelty detection. In this case be aware that that you should only use predict, decision_function and score_samples on new unseen data and not on the training set.

Attributes#

n_neighbors_int

The actual number of neighbors used for kneighbors queries.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.loci module#

Local Correlation Integral (LOCI). Part of the codes are adapted from https://github.com/Cloudy10/loci

class pyod.models.loci.LOCI(contamination=0.1, alpha=0.5, k=3)[source]#

Bases: BaseDetector

Local Correlation Integral.

LOCI is highly effective for detecting outliers and groups of outliers ( a.k.a.micro-clusters), which offers the following advantages and novelties: (a) It provides an automatic, data-dictated cut-off to determine whether a point is an outlier—in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection methods can match this feature, because they output only a single number for each point: its outlierness score.(c) It can be computed as quickly as the best previous methods Read more in the [BPKGF03].

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

alphaint, default = 0.5

The neighbourhood parameter measures how large of a neighbourhood should be considered “local”.

k: int, default = 3

An outlier cutoff threshold for determine whether or not a point should be considered an outlier.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Examples#

>>> from pyod.models.loci import LOCI
>>> from pyod.utils.data import generate_data
>>> n_train = 50
>>> n_test = 50
>>> contamination = 0.1
>>> X_train, y_train, X_test, y_test = generate_data(
...     n_train=n_train, n_test=n_test,
...     contamination=contamination, random_state=42)
>>> clf = LOCI()
>>> clf.fit(X_train)
LOCI(alpha=0.5, contamination=0.1, k=None)
decision_function(X)[source]#

Predict raw anomaly scores of X using the fitted detector.

The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit the model using X as training data.

Parameters#

Xarray, shape (n_samples, n_features)

Training data.

yIgnored

Not used, present for API consistency by convention.

Returns#

self : object

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.lunar module#

LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks

class pyod.models.lunar.LUNAR(model_type='WEIGHT', n_neighbours=5, negative_sampling='MIXED', val_size=0.1, scaler=MinMaxScaler(), epsilon=0.1, proportion=1.0, n_epochs=200, lr=0.001, wd=0.1, verbose=0, contamination=0.1)[source]#

Bases: BaseDetector

LUNAR class for outlier detection. See https://www.aaai.org/AAAI22Papers/AAAI-51.GoodgeA.pdf for details. For an observation, its ordered list of distances to its k nearest neighbours is input to a neural network, with one of the following outputs:

  1. SCORE_MODEL: network directly outputs the anomaly score.

  2. WEIGHT_MODEL: network outputs a set of weights for the k distances, the anomaly score is then the

    sum of weighted distances.

See [BGHNN22] for details.

Parameters#

model_type: str in [‘WEIGHT’, ‘SCORE’], optional (default = ‘WEIGHT’)

Whether to use WEIGHT_MODEL or SCORE_MODEL for anomaly scoring.

n_neighbors: int, optional (default = 5)

Number of neighbors to use by default for k neighbors queries.

negative_sampling: str in [‘UNIFORM’, ‘SUBSPACE’, MIXED’], optional (default = ‘MIXED)

Type of negative samples to use between:

  • ‘UNIFORM’: uniformly distributed samples

  • ‘SUBSPACE’: subspace perturbation (additive random noise in a subset of features)

  • ‘MIXED’: a combination of both types of samples

val_size: float in [0,1], optional (default = 0.1)

Proportion of samples to be used for model validation

scaler: object in {StandardScaler(), MinMaxScaler(), optional (default = MinMaxScaler())

Method of data normalization

epsilon: float, optional (default = 0.1)

Hyper-parameter for the generation of negative samples. A smaller epsilon results in negative samples more similar to normal samples.

proportion: float, optional (default = 1.0)

Hyper-parameter for the proprotion of negative samples to use relative to the number of normal training samples.

n_epochs: int, optional (default = 200)

Number of epochs to train neural network.

lr: float, optional (default = 0.001)

Learning rate.

wd: float, optional (default = 0.1)

Weight decay.

verbose: int in {0,1}, optional (default = 0):

To view or hide training progress

Attributes#

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector. For consistency, outliers are assigned with larger anomaly scores. Parameters ———- X : numpy array of shape (n_samples, n_features)

The training input samples.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is assumed to be 0 for all training samples. Parameters ———- X : numpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Overwritten with 0 for all training samples (assumed to be normal).

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.lscp module#

Locally Selective Combination of Parallel Outlier Ensembles (LSCP). Adapted from the original implementation.

class pyod.models.lscp.LSCP(detector_list, local_region_size=30, local_max_features=1.0, n_bins=10, random_state=None, contamination=0.1)[source]#

Bases: BaseDetector

Locally Selection Combination in Parallel Outlier Ensembles

LSCP is an unsupervised parallel outlier detection ensemble which selects competent detectors in the local region of a test instance. This implementation uses an Average of Maximum strategy. First, a heterogeneous list of base detectors is fit to the training data and then generates a pseudo ground truth for each train instance is generated by taking the maximum outlier score.

For each test instance: 1) The local region is defined to be the set of nearest training points in randomly sampled feature subspaces which occur more frequently than a defined threshold over multiple iterations.

2) Using the local region, a local pseudo ground truth is defined and the pearson correlation is calculated between each base detector’s training outlier scores and the pseudo ground truth.

3) A histogram is built out of pearson correlation scores; detectors in the largest bin are selected as competent base detectors for the given test instance.

4) The average outlier score of the selected competent detectors is taken to be the final score.

See [BZNHL19] for details.

Parameters#

detector_listList, length must be greater than 1

Base unsupervised outlier detectors from PyOD. (Note: requires fit and decision_function methods)

local_region_sizeint, optional (default=30)

Number of training points to consider in each iteration of the local region generation process (30 by default).

local_max_featuresfloat in (0.5, 1.), optional (default=1.0)

Maximum proportion of number of features to consider when defining the local region (1.0 by default).

n_binsint, optional (default=10)

Number of bins to use when selecting the local region

random_stateRandomState, optional (default=None)

A random number generator instance to define the state of the random permutations generator.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function (0.1 by default).

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Examples#

>>> from pyod.utils.data import generate_data
>>> from pyod.utils.utility import standardizer
>>> from pyod.models.lscp import LSCP
>>> from pyod.models.lof import LOF
>>> X_train, y_train, X_test, y_test = generate_data(
...     n_train=50, n_test=50,
...     contamination=0.1, random_state=42)
>>> X_train, X_test = standardizer(X_train, X_test)
>>> detector_list = [LOF(), LOF()]
>>> clf = LSCP(detector_list)
>>> clf.fit(X_train)
LSCP(...)
decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.mad module#

Median Absolute deviation (MAD) Algorithm. Strictly for Univariate Data.

class pyod.models.mad.MAD(threshold=3.5, contamination=0.1)[source]#

Bases: BaseDetector

Median Absolute Deviation: for measuring the distances between data points and the median in terms of median distance. See [BIH93] for details.

Parameters#

thresholdfloat, optional (default=3.5)

The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator. Note that n_features must equal 1.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples. Note that n_features must equal 1.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.mcd module#

Outlier Detection with Minimum Covariance Determinant (MCD)

class pyod.models.mcd.MCD(contamination=0.1, store_precision=True, assume_centered=False, support_fraction=None, random_state=None)[source]#

Bases: BaseDetector

Detecting outliers in a Gaussian distributed dataset using Minimum Covariance Determinant (MCD): robust estimator of covariance.

The Minimum Covariance Determinant covariance estimator is to be applied on Gaussian-distributed data, but could still be relevant on data drawn from a unimodal, symmetric distribution. It is not meant to be used with multi-modal data (the algorithm used to fit a MinCovDet object is likely to fail in such a case). One should consider projection pursuit methods to deal with multi-modal datasets.

First fit a minimum covariance determinant model and then compute the Mahalanobis distance as the outlier degree of the data

See [BHR04, BRD99] for details.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

store_precisionbool

Specify if the estimated precision is stored.

assume_centeredbool

If True, the support of the robust location and the covariance estimates is computed, and a covariance estimate is recomputed from it, without centering the data. Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If False, the robust location and covariance are directly computed with the FastMCD algorithm without additional treatment.

support_fractionfloat, 0 < support_fraction < 1

The proportion of points to be included in the support of the raw MCD estimate. Default is None, which implies that the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes#

raw_location_array-like, shape (n_features,)

The raw robust estimated location before correction and re-weighting.

raw_covariance_array-like, shape (n_features, n_features)

The raw robust estimated covariance before correction and re-weighting.

raw_support_array-like, shape (n_samples,)

A mask of the observations that have been used to compute the raw robust estimates of location and shape, before correction and re-weighting.

location_array-like, shape (n_features,)

Estimated robust location

covariance_array-like, shape (n_features, n_features)

Estimated robust covariance matrix

precision_array-like, shape (n_features, n_features)

Estimated pseudo inverse matrix. (stored only if store_precision is True)

support_array-like, shape (n_samples,)

A mask of the observations that have been used to compute the robust estimates of location and shape.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted. Mahalanobis distances of the training set (on which :meth:`fit is called) observations.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.mo_gaal module#

Multiple-Objective Generative Adversarial Active Learning. Part of the codes are adapted from https://github.com/leibinghe/GAAL-based-outlier-detection

class pyod.models.mo_gaal.MO_GAAL(k=10, stop_epochs=20, lr_d=0.01, lr_g=0.0001, momentum=0.9, contamination=0.1)[source]#

Bases: BaseDetector

Multi-Objective Generative Adversarial Active Learning.

MO_GAAL directly generates informative potential outliers to assist the classifier in describing a boundary that can separate outliers from normal data effectively. Moreover, to prevent the generator from falling into the mode collapsing problem, the network structure of SO-GAAL is expanded from a single generator (SO-GAAL) to multiple generators with different objectives (MO-GAAL) to generate a reasonable reference distribution for the whole dataset. Read more in the [BLLZ+19].

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

kint, optional (default=10)

The number of sub generators.

stop_epochsint, optional (default=20)

The number of epochs of training. The number of total epochs equals to three times of stop_epochs.

lr_dfloat, optional (default=0.01)

The learn rate of the discriminator.

lr_gfloat, optional (default=0.0001)

The learn rate of the generator.

momentumfloat, optional (default=0.9)

The momentum parameter for SGD.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.ocsvm module#

One-class SVM detector. Implemented on scikit-learn library.

class pyod.models.ocsvm.OCSVM(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, contamination=0.1)[source]#

Bases: BaseDetector

Wrapper of scikit-learn one-class SVM Class with more functionalities. Unsupervised Outlier Detection.

Estimate the support of a high-dimensional distribution.

The implementation is based on libsvm. See http://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection and [BScholkopfPST+01].

Parameters#

kernelstring, optional (default=’rbf’)

Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.

nufloat, optional

An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.

degreeint, optional (default=3)

Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.

gammafloat, optional (default=’auto’)

Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used instead.

coef0float, optional (default=0.0)

Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.

tolfloat, optional

Tolerance for stopping criterion.

shrinkingbool, optional

Whether to use the shrinking heuristic.

cache_sizefloat, optional

Specify the size of the kernel cache (in MB).

verbosebool, default: False

Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.

max_iterint, optional (default=-1)

Hard limit on iterations within solver, or -1 for no limit.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

Attributes#

support_array-like, shape = [n_SV]

Indices of support vectors.

support_vectors_array-like, shape = [nSV, n_features]

Support vectors.

dual_coef_array, shape = [1, n_SV]

Coefficients of the support vectors in the decision function.

coef_array, shape = [1, n_features]

Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.

coef_ is readonly property derived from dual_coef_ and support_vectors_

intercept_array, shape = [1,]

Constant in the decision function.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None, sample_weight=None, **params)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

sample_weightarray-like, shape (n_samples,)

Per-sample weights. Rescale C per sample. Higher weights force the classifier to put more emphasis on these points.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.pca module#

Principal Component Analysis (PCA) Outlier Detector

class pyod.models.pca.PCA(n_components=None, n_selected_components=None, contamination=0.1, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, weighted=True, standardization=True)[source]#

Bases: BaseDetector

Principal component analysis (PCA) can be used in detecting outliers. PCA is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

In this procedure, covariance matrix of the data can be decomposed to orthogonal vectors, called eigenvectors, associated with eigenvalues. The eigenvectors with high eigenvalues capture most of the variance in the data.

Therefore, a low dimensional hyperplane constructed by k eigenvectors can capture most of the variance in the data. However, outliers are different from normal data points, which is more obvious on the hyperplane constructed by the eigenvectors with small eigenvalues.

Therefore, outlier scores can be obtained as the sum of the projected distance of a sample on all eigenvectors. See [BAgg15, BSCSC03] for details.

Score(X) = Sum of weighted euclidean distance between each sample to the hyperplane constructed by the selected eigenvectors

Parameters#

n_componentsint, float, None or string

Number of components to keep. if n_components is not set all components are kept:

n_components == min(n_samples, n_features)

if n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components n_components cannot be equal to n_features for svd_solver == ‘arpack’.

n_selected_componentsint, optional (default=None)

Number of selected principal components for calculating the outlier scores. It is not necessarily equal to the total number of the principal components. If not set, use all principal components.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

copybool (default True)

If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.

whitenbool, optional (default False)

When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

svd_solverstring {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
auto :

the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

full :

run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

arpack :

run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < X.shape[1]

randomized :

run randomized SVD by the method of Halko et al.

tolfloat >= 0, optional (default .0)

Tolerance for singular values computed by svd_solver == ‘arpack’.

iterated_powerint >= 0, or ‘auto’, (default ‘auto’)

Number of iterations for the power method computed by svd_solver == ‘randomized’.

random_stateint, RandomState instance or None, optional (default None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.

weightedbool, optional (default=True)

If True, the eigenvalues are used in score computation. The eigenvectors with small eigenvalues comes with more importance in outlier score calculation.

standardizationbool, optional (default=True)

If True, perform standardization first to convert data to zero mean and unit variance. See http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

Attributes#

components_array, shape (n_components, n_features)

Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

explained_variance_array, shape (n_components,)

The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues of the covariance matrix of X.

explained_variance_ratio_array, shape (n_components,)

Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0.

singular_values_array, shape (n_components,)

The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

mean_array, shape (n_features,)

Per-feature empirical mean, estimated from the training set.

Equal to X.mean(axis=0).

n_components_int

The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or n_features if n_components is None.

noise_variance_float

The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

property explained_variance_#

The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues of the covariance matrix of X.

Decorator for scikit-learn PCA attributes.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

property noise_variance_#

The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

Decorator for scikit-learn PCA attributes.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.qmcd module#

Quasi-Monte Carlo Discrepancy outlier detection (QMCD)

class pyod.models.qmcd.QMCD(contamination=0.1)[source]#

Bases: BaseDetector

The Wrap-around Quasi-Monte Carlo discrepancy is a uniformity criterion

which is used to assess the space filling of a number of samples in a hypercube. It quantifies the distance between the continuous uniform distribution on a hypercube and the discrete uniform distribution on distinct sample points. Therefore, lower discrepancy values for a sample point indicates that it provides better coverage of the parameter space with regard to the rest of the samples. This method is kernel based and a higher discrepancy score is relative to the rest of the samples, the higher the likelihood of it being an outlier. Read more in the [BFM01].

Parameters#

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The independent and dependent/target samples with the target samples being the last column of the numpy array such that eg: X = np.append(x, y.reshape(-1,1), axis=1). Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.rgraph module#

R-graph

class pyod.models.rgraph.RGraph(transition_steps=10, n_nonzero=10, gamma=50.0, gamma_nz=True, algorithm='lasso_lars', tau=1.0, maxiter_lasso=1000, preprocessing=True, contamination=0.1, blocksize_test_data=10, support_init='L2', maxiter=40, support_size=100, active_support=True, fit_intercept_LR=False, verbose=True)[source]#

Bases: BaseDetector

Outlier Detection via R-graph. Paper: https://openaccess.thecvf.com/content_cvpr_2017/papers/You_Provable_Self-Representation_Based_CVPR_2017_paper.pdf See [BYRV17] for details.

Parameters#

transition_stepsint, optional (default=20)

Number of transition steps that are taken in the graph, after which the outlier scores are determined.

gamma : float

gamma_nzboolean, default True

gamma and gamma_nz together determines the parameter alpha. When gamma_nz = False, alpha = gamma. When gamma_nz = True, then alpha = gamma * alpha0, where alpha0 is the largest number such that the solution to the optimization problem with alpha = alpha0 is the zero vector (see Proposition 1 in [1]). Therefore, when gamma_nz = True, gamma should be a value greater than 1.0. A good choice is typically in the range [5, 500].

taufloat, default 1.0

Parameter for elastic net penalty term. When tau = 1.0, the method reduces to sparse subspace clustering with basis pursuit (SSC-BP) [2]. When tau = 0.0, the method reduces to least squares regression (LSR).

algorithmstring, default lasso_lars

Algorithm for computing the representation. Either lasso_lars or lasso_cd. Note: lasso_lars and lasso_cd only support tau = 1. For cases tau << 1 linear regression is used.

fit_intercept_LR: bool, optional (default=False)

For gamma > 10000 linear regression is used instead of lasso_lars or lasso_cd. This parameter determines whether the intercept for the model is calculated.

maxiter_lassoint, default 1000

The maximum number of iterations for lasso_lars and lasso_cd.

n_nonzeroint, default 50

This is an upper bound on the number of nonzero entries of each representation vector. If there are more than n_nonzero nonzero entries, only the top n_nonzero number of entries with largest absolute value are kept.

active_support: boolean, default True

Set to True to use the active support algorithm in [1] for solving the optimization problem. This should significantly reduce the running time when n_samples is large.

active_support_params: dictionary of string to any, optional

Parameters (keyword arguments) and values for the active support algorithm. It may be used to set the parameters support_init, support_size and maxiter, see active_support_elastic_net for details. Example: active_support_params={‘support_size’:50, ‘maxiter’:100} Ignored when active_support=False

preprocessingbool, optional (default=True)

If True, apply standardization on the data.

verboseint, optional (default=1)

Verbosity mode.

  • 0 = silent

  • 1 = progress bar

  • 2 = one line per epoch.

For verbose >= 1, model summary may be printed.

random_staterandom_state: int, RandomState instance or None, optional

(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.

blocksize_test_data: int, optional (default=10)

Test set is splitted into blocks of the size blocksize_test_data to at least partially separate test - and train set

Attributes#

transition_matrix_numpy array of shape (n_samples,)

Transition matrix from the last fitted data, this might include training + test data

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

active_support_elastic_net(X, y, alpha, tau=1.0, algorithm='lasso_lars', support_init='L2', support_size=100, maxiter=40, maxiter_lasso=1000)[source]#
Source: https://github.com/ChongYou/subspace-clustering/blob/master/cluster/selfrepresentation.py

An active support based algorithm for solving the elastic net optimization problem min_{c} tau ||c||_1 + (1-tau)/2 ||c||_2^2 + alpha / 2 ||y - c X ||_2^2.

Parameters#

X : array-like, shape (n_samples, n_features)

y : array-like, shape (1, n_features)

alpha : float

tau : float, default 1.0

algorithmstring, default spams

Algorithm for computing solving the subproblems. Either lasso_lars or lasso_cd or spams (installation of spams package is required). Note: lasso_lars and lasso_cd only support tau = 1.

support_init: string, default knn

This determines how the active support is initialized. It can be either knn or L2.

support_size: int, default 100

This determines the size of the working set. A small support_size decreases the runtime per iteration while increase the number of iterations.

maxiter: int default 40

Termination condition for active support update.

Returns#

cshape n_samples

The optimal solution to the optimization problem.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

elastic_net_subspace_clustering(X, gamma=50.0, gamma_nz=True, tau=1.0, algorithm='lasso_lars', fit_intercept_LR=False, active_support=True, active_support_params=None, n_nonzero=50, maxiter_lasso=1000)[source]#

Source: https://github.com/ChongYou/subspace-clustering/blob/master/cluster/selfrepresentation.py

Elastic net subspace clustering (EnSC) [1]. Compute self-representation matrix C from solving the following optimization problem min_{c_j} tau ||c_j||_1 + (1-tau)/2 ||c_j||_2^2 + alpha / 2 ||x_j - c_j X ||_2^2 s.t. c_jj = 0, where c_j and x_j are the j-th rows of C and X, respectively.

Parameter algorithm specifies the algorithm for solving the optimization problem. lasso_lars and lasso_cd are algorithms implemented in sklearn, spams refers to the same algorithm as lasso_lars but is implemented in spams package available at http://spams-devel.gforge.inria.fr/ (installation required) In principle, all three algorithms give the same result. For large scale data (e.g. with > 5000 data points), use any of these algorithms in conjunction with active_support=True. It adopts an efficient active support strategy that solves the optimization problem by breaking it into a sequence of small scale optimization problems as described in [1]. If tau = 1.0, the method reduces to sparse subspace clustering with basis pursuit (SSC-BP) [2]. If tau = 0.0, the method reduces to least squares regression (LSR) [3]. Note: lasso_lars and lasso_cd only support tau = 1. Parameters ———– X : array-like, shape (n_samples, n_features)

Input data to be clustered

gamma : float gamma_nz : boolean, default True

gamma and gamma_nz together determines the parameter alpha. When gamma_nz = False, alpha = gamma. When gamma_nz = True, then alpha = gamma * alpha0, where alpha0 is the largest number such that the solution to the optimization problem with alpha = alpha0 is the zero vector (see Proposition 1 in [1]). Therefore, when gamma_nz = True, gamma should be a value greater than 1.0. A good choice is typically in the range [5, 500].

taufloat, default 1.0

Parameter for elastic net penalty term. When tau = 1.0, the method reduces to sparse subspace clustering with basis pursuit (SSC-BP) [2]. When tau = 0.0, the method reduces to least squares regression (LSR) [3].

algorithmstring, default lasso_lars

Algorithm for computing the representation. Either lasso_lars or lasso_cd or spams (installation of spams package is required). Note: lasso_lars and lasso_cd only support tau = 1.

n_nonzeroint, default 50

This is an upper bound on the number of nonzero entries of each representation vector. If there are more than n_nonzero nonzero entries, only the top n_nonzero number of entries with largest absolute value are kept.

active_support: boolean, default True

Set to True to use the active support algorithm in [1] for solving the optimization problem. This should significantly reduce the running time when n_samples is large.

active_support_params: dictionary of string to any, optional

Parameters (keyword arguments) and values for the active support algorithm. It may be used to set the parameters support_init, support_size and maxiter, see active_support_elastic_net for details. Example: active_support_params={‘support_size’:50, ‘maxiter’:100} Ignored when active_support=False

Returns#

representation_matrix_csr matrix, shape: n_samples by n_samples

The self-representation matrix.

References#

[1] C. You, C.-G. Li, D. Robinson, R. Vidal, Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering, CVPR 2016 [2] E. Elhaifar, R. Vidal, Sparse Subspace Clustering: Algorithm, Theory, and Applications, TPAMI 2013 [3] C. Lu, et al. Robust and efficient subspace segmentation via least squares regression, ECCV 2012

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.rod module#

Rotation-based Outlier Detector (ROD)

class pyod.models.rod.ROD(contamination=0.1, parallel_execution=False)[source]#

Bases: BaseDetector

Rotation-based Outlier Detection (ROD), is a robust and parameter-free algorithm that requires no statistical distribution assumptions and works intuitively in three-dimensional space, where the 3D-vectors, representing the data points, are rotated about the geometric median two times counterclockwise using Rodrigues rotation formula. The results of the rotation are parallelepipeds where their volumes are mathematically analyzed as cost functions and used to calculate the Median Absolute Deviations to obtain the outlying score. For high dimensions > 3, the overall score is calculated by taking the average of the overall 3D-subspaces scores, that were resulted from decomposing the original data space. See [BABC20] for details.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

parallel_execution: bool, optional (default=False).

If set to True, the algorithm will run in parallel, for a better execution time. It is recommended to set this parameter to True ONLY for high dimensional data > 10, and if a proper hardware is available.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.rod.angle(v1, v2)[source]#

find the angle between two 3D vectors.

Parameters#

v1 : list, first vector v2 : list, second vector

Returns#

angle : float, the angle

pyod.models.rod.euclidean(v1, v2, c=False)[source]#

Find the euclidean distance between two vectors or between a vector and a collection of vectors.

Parameters#

v1 : list, first 3D vector or collection of vectors v2 : list, second 3D vector c : bool (default=False), if True, it means the v1 is a list of vectors.

Returns#

list of list of euclidean distances if c==True. Otherwise float: the euclidean distance

pyod.models.rod.geometric_median(x, eps=1e-05)[source]#

Find the multivariate geometric L1-median by applying Vardi and Zhang algorithm.

Parameters#

x : array-like, the data points eps: float (default=1e-5), a threshold to indicate when to stop

Returns#

gm : array, Geometric L1-median

pyod.models.rod.mad(costs, median=None)[source]#

Apply the robust median absolute deviation (MAD) to measure the inconsistency/variability of the rotation costs.

Parameters#

costs : list of rotation costs median: float (default=None), MAD median

Returns#

zfloat

the modified z scores

pyod.models.rod.process_sub(subspace, gm, median, scaler1, scaler2)[source]#

Apply ROD on a 3D subSpace then process it with sigmoid to compare apples to apples

Parameters#

subspace : array-like, 3D subspace of the data gm: list, the geometric median median: float, MAD median scaler1: obj, MinMaxScaler of Angles group 1 scaler2: obj, MinMaxScaler of Angles group 2

Returns#

ROD decision scores with sigmoid applied, gm, scaler1, scaler2

pyod.models.rod.rod_3D(x, gm=None, median=None, scaler1=None, scaler2=None)[source]#

Find ROD scores for 3D Data. note that gm, scaler1 and scaler2 will be returned “as they are” and without being changed if the model has been fit already

Parameters#

x : array-like, 3D data points. gm: list (default=None), the geometric median median: float (default=None), MAD median scaler1: obj (default=None), MinMaxScaler of Angles group 1 scaler2: obj (default=None), MinMaxScaler of Angles group 2

Returns#

decision_scores, gm, scaler1, scaler2

pyod.models.rod.rod_nD(X, parallel, gm=None, median=None, data_scaler=None, angles_scalers1=None, angles_scalers2=None)[source]#
Find ROD overall scores when Data is higher than 3D:

# scale dataset using Robust Scaler # decompose the full space into a combinations of 3D subspaces, # Apply ROD on each combination, # squish scores per subspace, so we compare apples to apples, # calculate average of ROD scores of all subspaces per observation.

Note that if gm, data_scaler, angles_scalers1, angles_scalers2 are None, that means it is a fit() process and they will be calculated and returned to the class to be saved for future prediction. Otherwise, if they are not None, then it is a prediction process.

Parameters#

X : array-like, data points parallel: bool, True runs the algorithm in parallel gm: list (default=None), the geometric median median: list (default=None), MAD medians data_scaler: obj (default=None), RobustScaler of data angles_scalers1: list (default=None), MinMaxScalers of Angles group 1 angles_scalers2: list (default=None), MinMaxScalers of Angles group 2

Returns#

ROD decision scores, gm, median, data_scaler, angles_scalers1, angles_scalers2

pyod.models.rod.scale_angles(gammas, scaler1=None, scaler2=None)[source]#

Scale all angles in which angles <= 90 degree will be scaled within [0 - 54.7] and angles > 90 will be scaled within [90 - 126]

Parameters#

gammas : list, angles scaler1: obj (default=None), MinMaxScaler of Angles group 1 scaler2: obj (default=None), MinMaxScaler of Angles group 2

Returns#

scaled angles, scaler1, scaler2

pyod.models.rod.sigmoid(x)[source]#

Implementation of Sigmoid function

Parameters#

x : array-like, decision scores

Returns#

array-like, x after applying sigmoid

pyod.models.sampling module#

Outlier detection based on Sampling (SP)

class pyod.models.sampling.Sampling(contamination=0.1, subset_size=20, metric='minkowski', metric_params=None, random_state=None)[source]#

Bases: BaseDetector

Sampling class for outlier detection.

Sugiyama, M., Borgwardt, K. M.: Rapid Distance-Based Outlier Detection via Sampling, Advances in Neural Information Processing Systems (NIPS 2013), 467-475, 2013.

See [BSB13] for details.

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

subset_sizefloat in (0., 1.0) or int (0, n_samples), optional (default=20)

The size of subset of the data set. Sampling subset from the data set is performed only once.

metricstring or callable, default ‘minkowski’

metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

Distance matrices are not supported.

Valid values for metric are:

  • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

  • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

See the documentation for scipy.spatial.distance for details on these metrics.

metric_paramsdict, optional (default = None)

Additional keyword arguments for the metric function.

random_stateint, RandomState instance or None, optional (default None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The test input samples.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.sod module#

Subspace Outlier Detection (SOD)

class pyod.models.sod.SOD(contamination=0.1, n_neighbors=20, ref_set=10, alpha=0.8)[source]#

Bases: BaseDetector

Subspace outlier detection (SOD) schema aims to detect outlier in varying subspaces of a high dimensional feature space. For each data object, SOD explores the axis-parallel subspace spanned by the data object’s neighbors and determines how much the object deviates from the neighbors in this subspace.

See [BKKrogerSZ09] for details.

Parameters#

n_neighborsint, optional (default=20)

Number of neighbors to use by default for k neighbors queries.

ref_set: int, optional (default=10)

specifies the number of shared nearest neighbors to create the reference set. Note that ref_set must be smaller than n_neighbors.

alpha: float in (0., 1.), optional (default=0.8)

specifies the lower limit for selecting subspace. 0.8 is set as default as suggested in the original paper.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.so_gaal module#

Single-Objective Generative Adversarial Active Learning. Part of the codes are adapted from https://github.com/leibinghe/GAAL-based-outlier-detection

class pyod.models.so_gaal.SO_GAAL(stop_epochs=20, lr_d=0.01, lr_g=0.0001, momentum=0.9, contamination=0.1)[source]#

Bases: BaseDetector

Single-Objective Generative Adversarial Active Learning.

SO-GAAL directly generates informative potential outliers to assist the classifier in describing a boundary that can separate outliers from normal data effectively. Moreover, to prevent the generator from falling into the mode collapsing problem, the network structure of SO-GAAL is expanded from a single generator (SO-GAAL) to multiple generators with different objectives (MO-GAAL) to generate a reasonable reference distribution for the whole dataset. Read more in the [BLLZ+19].

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

stop_epochsint, optional (default=20)

The number of epochs of training. The number of total epochs equals to three times of stop_epochs.

lr_dfloat, optional (default=0.01)

The learn rate of the discriminator.

lr_gfloat, optional (default=0.0001)

The learn rate of the generator.

momentumfloat, optional (default=0.9)

The momentum parameter for SGD.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.sos module#

Stochastic Outlier Selection (SOS). Part of the codes are adapted from https://github.com/jeroenjanssens/scikit-sos

class pyod.models.sos.SOS(contamination=0.1, perplexity=4.5, metric='euclidean', eps=1e-05)[source]#

Bases: BaseDetector

Stochastic Outlier Selection.

SOS employs the concept of affinity to quantify the relationship from one data point to another data point. Affinity is proportional to the similarity between two data points. So, a data point has little affinity with a dissimilar data point. A data point is selected as an outlier when all the other data points have insufficient affinity with it. Read more in the [BJHuszarPvdH12].

Parameters#

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

perplexityfloat, optional (default=4.5)

A smooth measure of the effective number of neighbours. The perplexity parameter is similar to the parameter k in kNN algorithm (the number of nearest neighbors). The range of perplexity can be any real number between 1 and n-1, where n is the number of samples.

metric: str, default ‘euclidean’

Metric used for the distance computation. Any metric from scipy.spatial.distance can be used.

Valid values for metric are:

  • ‘euclidean’

  • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

See the documentation for scipy.spatial.distance for details on these metrics: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

epsfloat, optional (default = 1e-5)

Tolerance threshold for floating point errors.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Examples#

>>> from pyod.models.sos import SOS
>>> from pyod.utils.data import generate_data
>>> n_train = 50
>>> n_test = 50
>>> contamination = 0.1
>>> X_train, y_train, X_test, y_test = generate_data(
...     n_train=n_train, n_test=n_test,
...     contamination=contamination, random_state=42)
>>>
>>> clf = SOS()
>>> clf.fit(X_train)
SOS(contamination=0.1, eps=1e-05, metric='euclidean', perplexity=4.5)
decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.suod module#

SUOD

class pyod.models.suod.SUOD(base_estimators=None, contamination=0.1, combination='average', n_jobs=None, rp_clf_list=None, rp_ng_clf_list=None, rp_flag_global=True, target_dim_frac=0.5, jl_method='basic', bps_flag=True, approx_clf_list=None, approx_ng_clf_list=None, approx_flag_global=True, approx_clf=None, verbose=False)[source]#

Bases: BaseDetector

SUOD (Scalable Unsupervised Outlier Detection) is an acceleration framework for large scale unsupervised outlier detector training and prediction. See [BZHC+21] for details.

Parameters#

base_estimatorslist, length must be greater than 1

A list of base estimators. Certain methods must be present, e.g., fit and predict.

combinationstr, optional (default=’average’)

Decide how to aggregate the results from multiple models:

  • “average” : average the results from all base detectors

  • “maximization” : output the max value across all base detectors

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

n_jobsoptional (default=1)

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the the number of jobs that can actually run in parallel.

rp_clf_listlist, optional (default=None)

The list of outlier detection models to use random projection. The detector name should be consistent with PyOD.

rp_ng_clf_listlist, optional (default=None)

The list of outlier detection models NOT to use random projection. The detector name should be consistent with PyOD.

rp_flag_globalbool, optional (default=True)

If set to False, random projection is turned off for all base models.

target_dim_fracfloat in (0., 1), optional (default=0.5)

The target compression ratio.

jl_methodstring, optional (default = ‘basic’)

The JL projection method:

  • “basic”: each component of the transformation matrix is taken at random in N(0,1).

  • “discrete”, each component of the transformation matrix is taken at random in {-1,1}.

  • “circulant”: the first row of the transformation matrix is taken at random in N(0,1), and each row is obtained from the previous one by a one-left shift.

  • “toeplitz”: the first row and column of the transformation matrix is taken at random in N(0,1), and each diagonal has a constant value taken from these first vector.

bps_flagbool, optional (default=True)

If set to False, balanced parallel scheduling is turned off.

approx_clf_listlist, optional (default=None)

The list of outlier detection models to use pseudo-supervised approximation. The detector name should be consistent with PyOD.

approx_ng_clf_listlist, optional (default=None)

The list of outlier detection models NOT to use pseudo-supervised approximation. The detector name should be consistent with PyOD.

approx_flag_globalbool, optional (default=True)

If set to False, pseudo-supervised approximation is turned off.

approx_clfobject, optional (default: sklearn RandomForestRegressor)

The supervised model used to approximate unsupervised models.

cost_forecast_loc_fitstr, optional

The location of the pretrained cost prediction forecast for training.

cost_forecast_loc_predstr, optional

The location of the pretrained cost prediction forecast for prediction.

verboseint, optional (default=0)

Controls the verbosity of the building process.

Attributes#

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detectors.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is ignored in unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

selfobject

Fitted estimator.

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

pyod.models.thresholds module#

pyod.models.thresholds.AUCP(**kwargs)[source]#

AUCP class for Area Under Curve Precentage thresholder.

Use the area under the curve to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond where the auc of the kde is less than the (mean + abs(mean-median)) percent of the total kde auc.

pyod.models.thresholds.BOOT(**kwargs)[source]#

BOOT class for Bootstrapping thresholder.

Use a boostrapping based method to find a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the mean of the confidence intervals.

Parameters#

random_stateint, optional (default=1234)

Random seed for bootstrapping a confidence interval. Can also be set to None.

pyod.models.thresholds.CHAU(**kwargs)[source]#

CHAU class for Chauvenet’s criterion thresholder.

Use the Chauvenet’s criterion to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value below the Chauvenet’s criterion.

Parameters#

method{‘mean’, ‘median’, ‘gmean’}, optional (default=’mean’)

Calculate the area normal to distance using a scaler

  • ‘mean’: Construct a scaler with the mean of the scores

  • ‘median: Construct a scaler with the median of the scores

  • ‘gmean’: Construct a scaler with the geometric mean of the scores

pyod.models.thresholds.CLF(**kwargs)[source]#

CLF class for Trained Classifier thresholder.

Use the trained linear classifier to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond 0.

Parameters#

method{‘simple’, ‘complex’}, optional (default=’complex’)

Type of linear model

  • ‘simple’: Uses only the scores

  • ‘complex’: Uses the scores, log of the scores, and the scores’ PDF

pyod.models.thresholds.CLUST(**kwargs)[source]#

CLUST class for clustering type thresholders.

Use the clustering methods to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value not labelled as part of the main cluster.

Parameters#

method{‘agg’, ‘birch’, ‘bang’, ‘bgm’, ‘bsas’, ‘dbscan’, ‘ema’, ‘kmeans’, ‘mbsas’, ‘mshift’, ‘optics’, ‘somsc’, ‘spec’, ‘xmeans’}, optional (default=’spec’)

Clustering method

  • ‘agg’: Agglomerative

  • ‘birch’: Balanced Iterative Reducing and Clustering using Hierarchies

  • ‘bang’: BANG

  • ‘bgm’: Bayesian Gaussian Mixture

  • ‘bsas’: Basic Sequential Algorithmic Scheme

  • ‘dbscan’: Density-based spatial clustering of applications with noise

  • ‘ema’: Expectation-Maximization clustering algorithm for Gaussian Mixture Model

  • ‘kmeans’: K-means

  • ‘mbsas’: Modified Basic Sequential Algorithmic Scheme

  • ‘mshift’: Mean shift

  • ‘optics’: Ordering Points To Identify Clustering Structure

  • ‘somsc’: Self-organized feature map

  • ‘spec’: Clustering to a projection of the normalized Laplacian

  • ‘xmeans’: X-means

random_stateint, optional (default=1234)

Random seed for the BayesianGaussianMixture clustering (method=’bgm’). Can also be set to None.

pyod.models.thresholds.CPD(**kwargs)[source]#

CPD class for Change Point Detection thresholder.

Use change point detection to find a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the detected change point.

Parameters#

method{‘Dynp’, ‘KernelCPD’, ‘Binseg’, ‘BottomUp’}, optional (default=’Dynp’)

Method for change point detection

  • ‘Dynp’: Dynamic programming (optimal minimum sum of errors per partition)

  • ‘KernelCPD’: RBF kernel function (optimal minimum sum of errors per partition)

  • ‘Binseg’: Binary segmentation

  • ‘BottomUp’: Bottom-up segmentation

transform{‘cdf’, ‘kde’}, optional (default=’cdf’)

Data transformation method prior to fit

  • ‘cdf’: Use the cumulative distribution function

  • ‘kde’: Use the kernel density estimation

pyod.models.thresholds.DECOMP(**kwargs)[source]#

DECOMP class for Decomposition based thresholders.

Use decomposition to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the maximum of the decomposed matrix that results from decomposing the cumulative distribution function of the decision scores.

Parameters#

method{‘NMF’, ‘PCA’, ‘GRP’, ‘SRP’}, optional (default=’PCA’)

Method to use for decomposition

  • ‘NMF’: Non-Negative Matrix Factorization

  • ‘PCA’: Principal Component Analysis

  • ‘GRP’: Gaussian Random Projection

  • ‘SRP’: Sparse Random Projection

random_stateint, optional (default=1234)

Random seed for the decomposition algorithm. Can also be set to None.

pyod.models.thresholds.DSN(**kwargs)[source]#

DSN class for Distance Shift from Normal thresholder.

Use the distance shift from normal to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the distance calculated by the selected metric.

Parameters#

metric{‘JS’, ‘WS’, ‘ENG’, ‘BHT’, ‘HLL’, ‘HI’, ‘LK’, ‘LP’, ‘MAH’, ‘TMT’, ‘RES’, ‘KS’, ‘INT’, ‘MMD’}, optional (default=’MAH’)

Metric to use for distance computation

  • ‘JS’: Jensen-Shannon distance

  • ‘WS’: Wasserstein or Earth Movers distance

  • ‘ENG’: Energy distance

  • ‘BHT’: Bhattacharyya distance

  • ‘HLL’: Hellinger distance

  • ‘HI’: Histogram intersection distance

  • ‘LK’: Lukaszyk-Karmowski metric for normal distributions

  • ‘LP’: Levy-Prokhorov metric

  • ‘MAH’: Mahalanobis distance

  • ‘TMT’: Tanimoto distance

  • ‘RES’: Studentized residual distance

  • ‘KS’: Kolmogorov-Smirnov distance

  • ‘INT’: Weighted spline interpolated distance

  • ‘MMD’: Maximum Mean Discrepancy distance

random_stateint, optional (default=1234)

Random seed for the normal distribution. Can also be set to None.

pyod.models.thresholds.EB(**kwargs)[source]#

EB class for Elliptical Boundary thresholder.

Use pseudo-random elliptical boundaries to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond a pseudo-random elliptical boundary set between inliers and outliers.

pyod.models.thresholds.FGD(**kwargs)[source]#

FGD class for Fixed Gradient Descent thresholder.

Use the fixed gradient descent to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond where the first derivative of the kde with respect to the decision scores passes the mean of the first and second inflection points.

pyod.models.thresholds.FILTER(**kwargs)[source]#

FILTER class for Filtering based thresholders.

Use the filtering based methods to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the maximum filter value. See [] for details.

Parameters#

method{‘gaussian’, ‘savgol’, ‘hilbert’, ‘wiener’, ‘medfilt’, ‘decimate’,’detrend’, ‘resample’}, optional (default=’savgol’)

Method to filter the scores

  • ‘gaussian’: use a gaussian based filter

  • ‘savgol’: use the savgol based filter

  • ‘hilbert’: use the hilbert based filter

  • ‘wiener’: use the wiener based filter

  • ‘medfilt: use a median based filter

  • ‘decimate’: use a decimate based filter

  • ‘detrend’: use a detrend based filter

  • ‘resample’: use a resampling based filter

sigmaint, optional (default=’auto’)

Variable specific to each filter type, default sets sigma to len(scores)*np.std(scores)

  • ‘gaussian’: standard deviation for Gaussian kernel

  • ‘savgol’: savgol filter window size

  • ‘hilbert’: number of Fourier components

  • ‘medfilt: kernel size

  • ‘decimate’: downsampling factor

  • ‘detrend’: number of break points

  • ‘resample’: resampling window size

pyod.models.thresholds.FWFM(**kwargs)[source]#

FWFM class for Full Width at Full Minimum thresholder.

Use the full width at full minimum (aka base width) to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the base width.

pyod.models.thresholds.GESD(**kwargs)[source]#

GESD class for Generalized Extreme Studentized Deviate thresholder.

Use the generalized extreme studentized deviate to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any less than the smallest detected outlier.

Parameters#

max_outliersint, optional (default=’auto’)

mamiximum number of outliers that the dataset may have. Default sets max_outliers to be half the size of the dataset

alphafloat, optional (default=0.05)

significance level

pyod.models.thresholds.HIST(**kwargs)[source]#

HIST class for Histogram based thresholders.

Use histograms methods as described in scikit-image.filters to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set by histogram generated thresholds depending on the selected methods.

Parameters#

nbinsint, optional (default=’auto’)

Number of bins to use in the hostogram, default set to int(len(scores)**0.7)

method{‘otsu’, ‘yen’, ‘isodata’, ‘li’, ‘minimum’, ‘triangle’}, optional (default=’triangle’)

Histogram filtering based method

  • ‘otsu’: OTSU’s method for filtering

  • ‘yen’: Yen’s method for filtering

  • ‘isodata’: Ridler-Calvard or inter-means method for filtering

  • ‘li’: Li’s iterative Minimum Cross Entropy method for filtering

  • ‘minimum’: Minimum between two maxima via smoothing method for filtering

  • ‘triangle’: Triangle algorithm method for filtering

pyod.models.thresholds.IQR(**kwargs)[source]#

IQR class for Inter-Qaurtile Region thresholder.

Use the inter-quartile region to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the third quartile plus 1.5 times the inter-quartile region.

pyod.models.thresholds.KARCH(**kwargs)[source]#

KARCH class for Riemannian Center of Mass thresholder.

Use the Karcher mean (Riemannian Center of Mass) to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the Karcher mean + one standard deviation of the decision_scores.

Parameters#

ndimint, optional (default=2)

Number of dimensions to construct the Euclidean manifold

method{‘simple’, ‘complex’}, optional (default=’complex’)

Method for computing the Karcher mean

  • ‘simple’: Compute the Karcher mean using the 1D array of scores

  • ‘complex’: Compute the Karcher mean between a 2D array dot product of the scores and the sorted scores arrays

pyod.models.thresholds.MAD(**kwargs)[source]#

MAD class for Median Absolute Deviation thresholder.

Use the median absolute deviation to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the mean plus the median absolute deviation over the standard deviation.

pyod.models.thresholds.MCST(**kwargs)[source]#

MCST class for Monte Carlo Shapiro Tests thresholder.

Use uniform random sampling and statstical testing to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the minimum value left after iterative Shapiro-Wilk tests have occured. Note** accuracy decreases with array size. For good results the should be array<1000. However still this threshold method may fail at any array size.

Parameters#

random_stateint, optional (default=1234)

Random seed for the uniform distribution. Can also be set to None.

pyod.models.thresholds.META(**kwargs)[source]#

META class for Meta-modelling thresholder.

Use a trained meta-model to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set based on the trained meta-model classifier.

Parameters#

method{‘LIN’, ‘GNB’, ‘GNBC’, ‘GNBM’}, optional (default=’GNBM’)

select

  • ‘LIN’: RidgeCV trained linear classifier meta-model on true labels

  • ‘GNB’: Gaussian Naive Bayes trained classifier meta-model on true labels

  • ‘GNBC’: Gaussian Naive Bayes trained classifier meta-model on best contamination

  • ‘GNBM’: Gaussian Naive Bayes multivariate trained classifier meta-model

pyod.models.thresholds.MOLL(**kwargs)[source]#

MOLL class for Friedrichs’ mollifier thresholder.

Use the Friedrichs’ mollifier to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond one minus the maximum of the smoothed dataset via convolution.

pyod.models.thresholds.MTT(**kwargs)[source]#

MTT class for Modified Thompson Tau test thresholder.

Use the modified Thompson Tau test to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the smallest outlier detected by the test.

Parameters#

strictness[1,2,3,4,5], optional (default=4)

Level of strictness corresponding to the t-Student distribution map to sample

pyod.models.thresholds.OCSVM(**kwargs)[source]#

OCSVM class for One-Class Support Vector Machine thresholder.

Use a one-class svm to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are determined by the one-class svm using a polynomial kernel with the polynomial degree either set or determined by regression internally.

Parameters#

model{‘poly’, ‘sgd’}, optional (default=’sgd’)

OCSVM model to apply

  • ‘poly’: Use a polynomial kernel with a regular OCSVM

  • ‘sgd’: Used the Additive Chi2 kernel approximation with a SGDOneClassSVM

degreeint, optional (default=’auto’)

Polynomial degree to use for the one-class svm. Default ‘auto’ finds the optimal degree with linear regression

gammafloat, optional (default=’auto’)

Kernel coefficient for polynomial fit for the one-class svm. Default ‘auto’ uses 1 / n_features

criterion{‘aic’, ‘bic’}, optional (default=’bic’)

regression performance metric. AIC is the Akaike Information Criterion, and BIC is the Bayesian Information Criterion. This only applies when degree is set to ‘auto’

nufloat, optional (default=’auto’)

An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Default ‘auto’ sets nu as the ratio between the any point that is less than or equal to the median plus the absolute difference between the mean and geometric mean over the the number of points in the entire dataset

tolfloat, optional (default=1e-3)

The stopping criterion for the one-class svm

random_stateint, optional (default=1234)

Random seed for the SVM’s data sampling. Can also be set to None.

pyod.models.thresholds.QMCD(**kwargs)[source]#

QMCD class for Quasi-Monte Carlo Discreprancy thresholder.

Use the quasi-Monte Carlo discreprancy to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond and percentile or quantile of one minus the descreperancy (Note** A discrepancy quantifies the distance between the continuous uniform distribution on a hypercube and the discrete uniform distribution on distinct sample points).

Parameters#

method{‘CD’, ‘WD’, ‘MD’, ‘L2-star’}, optional (default=’WD’)

Type of discrepancy

  • ‘CD’: Centered Discrepancy

  • ‘WD’: Wrap-around Discrepancy

  • ‘MD’: Mix between CD/WD

  • ‘L2-star’: L2-star discrepancy

lim{‘Q’, ‘P’}, optional (default=’P’)

Filtering method to threshold scores using 1 - discrepancy

  • ‘Q’: Use quntile limiting

  • ‘P’: Use percentile limiting

pyod.models.thresholds.REGR(**kwargs)[source]#

REGR class for Regression based thresholder.

Use the regression to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the y-intercept value of the linear fit.

Parameters#

method{‘siegel’, ‘theil’}, optional (default=’siegel’)

Regression based method to calculate the y-intercept

  • ‘siegel’: implements a method for robust linear regression using repeated medians

  • ‘theil’: implements a method for robust linear regression using paired values

random_stateint, optional (default=1234)

random seed for the normal distribution. Can also be set to None

pyod.models.thresholds.VAE(**kwargs)[source]#

VAE class for Variational AutoEncoder thresholder.

Use a VAE to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the maximum minus the minimum of the reconstructed distribution probabilities after encoding.

Parameters#

verbosebool, optional (default=False)

display training progress

devicestr, optional (default=’cpu’)

device for pytorch

latent_dimsint, optional (default=’auto’)

number of latent dimensions the encoder will map the scores to. Default ‘auto’ applies automatic dimensionality selection using a profile likelihood.

random_stateint, optional (default=1234)

random seed for the normal distribution. Can also be set to None

epochsint, optional (default=100)

number of epochs to train the VAE

batch_sizeint, optional (default=64)

batch size for the dataloader during training

lossstr, optional (default=’kl’)

Loss function during training

  • ‘kl’ : use the combined negative log likelihood and Kullback-Leibler divergence

  • ‘mmd’: use the combined negative log likelihood and maximum mean discrepancy

Attributes#

thresh_ : threshold value that separates inliers from outliers

pyod.models.thresholds.WIND(**kwargs)[source]#

WIND class for topological Winding number thresholder.

Use the topological winding number (with respect to the origin) to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the mean intersection point calculated from the winding number.

Parameters#

random_stateint, optional (default=1234)

Random seed for the normal distribution. Can also be set to None.

pyod.models.thresholds.YJ(**kwargs)[source]#

YJ class for Yeo-Johnson transformation thresholder.

Use the Yeo-Johnson transformation to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond the max value in the YJ transformed data.

pyod.models.thresholds.ZSCORE(**kwargs)[source]#

ZSCORE class for ZSCORE thresholder.

Use the zscore to evaluate a non-parametric means to threshold scores generated by the decision_scores where outliers are set to any value beyond a zscore of one.

pyod.models.vae module#

Variational Auto Encoder (VAE) and beta-VAE for Unsupervised Outlier Detection

Reference:

[BKW13] Kingma, Diederik, Welling ‘Auto-Encodeing Variational Bayes’ https://arxiv.org/abs/1312.6114

[BBHP+18] Burges et al ‘Understanding disentangling in beta-VAE’ https://arxiv.org/pdf/1804.03599.pdf

class pyod.models.vae.VAE(encoder_neurons=None, decoder_neurons=None, latent_dim=2, hidden_activation='relu', output_activation='sigmoid', loss=<function mean_squared_error>, optimizer='adam', epochs=100, batch_size=32, dropout_rate=0.2, l2_regularizer=0.1, validation_size=0.1, preprocessing=True, verbose=1, random_state=None, contamination=0.1, gamma=1.0, capacity=0.0)[source]#

Bases: BaseDetector

Variational auto encoder Encoder maps X onto a latent space Z Decoder samples Z from N(0,1) VAE_loss = Reconstruction_loss + KL_loss

Reference See [BKW13] Kingma, Diederik, Welling ‘Auto-Encodeing Variational Bayes’ https://arxiv.org/abs/1312.6114 for details.

beta VAE In Loss, the emphasis is on KL_loss and capacity of a bottleneck: VAE_loss = Reconstruction_loss + gamma*KL_loss

Reference See [BBHP+18] Burges et al ‘Understanding disentangling in beta-VAE’ https://arxiv.org/pdf/1804.03599.pdf for details.

Parameters#

encoder_neuronslist, optional (default=[128, 64, 32])

The number of neurons per hidden layer in encoder.

decoder_neuronslist, optional (default=[32, 64, 128])

The number of neurons per hidden layer in decoder.

hidden_activationstr, optional (default=’relu’)

Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://keras.io/activations/

output_activationstr, optional (default=’sigmoid’)

Activation function to use for output layer. See https://keras.io/activations/

lossstr or obj, optional (default=keras.losses.mean_squared_error

String (name of objective function) or objective function. See https://keras.io/losses/

gammafloat, optional (default=1.0)

Coefficient of beta VAE regime. Default is regular VAE.

capacityfloat, optional (default=0.0)

Maximum capacity of a loss bottle neck.

optimizerstr, optional (default=’adam’)

String (name of optimizer) or optimizer instance. See https://keras.io/optimizers/

epochsint, optional (default=100)

Number of epochs to train the model.

batch_sizeint, optional (default=32)

Number of samples per gradient update.

dropout_ratefloat in (0., 1), optional (default=0.2)

The dropout to be used across all layers.

l2_regularizerfloat in (0., 1), optional (default=0.1)

The regularization strength of activity_regularizer applied on each layer. By default, l2 regularizer is used. See https://keras.io/regularizers/

validation_sizefloat in (0., 1), optional (default=0.1)

The percentage of data to be used for validation.

preprocessingbool, optional (default=True)

If True, apply standardization on the data.

verboseint, optional (default=1)

verbose mode.

  • 0 = silent

  • 1 = progress bar

  • 2 = one line per epoch.

For verbose >= 1, model summary may be printed.

random_staterandom_state: int, RandomState instance or None, opti

(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the r number generator; If None, the random number generator is the RandomState instance used by np.random.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is to define the threshold on the decision function.

Attributes#

encoding_dim_int

The number of neurons in the encoding layer.

compression_rate_float

The ratio between the original feature and the number of neurons in the encoding layer.

model_Keras Object

The underlying AutoEncoder in Keras.

history_: Keras Object

The AutoEncoder training history.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

threshold_float

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly score of X using the fitted detector.

The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y=None)[source]#

Fit detector. y is optional for unsupervised methods.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

ynumpy array of shape (n_samples,), optional (default=None)

The ground truth of the input samples (labels).

fit_predict(X, y=None)#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X, return_confidence=False)#

Predict if a particular sample is an outlier or not.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

confidencenumpy array of shape (n_samples,).

Only if return_confidence is set to True.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X, method='linear', return_confidence=False)#

Predict the probability of a sample being outlier. Two approaches are possible:

  1. simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.

  2. use unifying scores, see [BKKSZ11].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

methodstr, optional (default=’linear’)

probability conversion method. It must be one of ‘linear’ or ‘unify’.

return_confidenceboolean, optional(default=False)

If True, also return the confidence of prediction.

Returns#

outlier_probabilitynumpy array of shape (n_samples, n_classes)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).

sampling(args)[source]#

Reparametrisation by sampling from Gaussian, N(0,I) To sample from epsilon = Norm(0,I) instead of from likelihood Q(z|X) with latent variables z: z = z_mean + sqrt(var) * epsilon

Parameters#

argstensor

Mean and log of variance of Q(z|X).

Returns#

ztensor

Sampled latent variable.

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

vae_loss(inputs, outputs, z_mean, z_log)[source]#

Loss = Recreation loss + Kullback-Leibler loss for probability function divergence (ELBO). gamma > 1 and capacity != 0 for beta-VAE

pyod.models.xgbod module#

XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. A semi-supervised outlier detection framework.

class pyod.models.xgbod.XGBOD(estimator_list=None, standardization_flag_list=None, max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, **kwargs)[source]#

Bases: BaseDetector

XGBOD class for outlier detection. It first uses the passed in unsupervised outlier detectors to extract richer representation of the data and then concatenates the newly generated features to the original feature for constructing the augmented feature space. An XGBoost classifier is then applied on this augmented feature space. Read more in the [BZH18].

Parameters#

estimator_listlist, optional (default=None)

The list of pyod detectors passed in for unsupervised learning

standardization_flag_listlist, optional (default=None)

The list of boolean flags for indicating whether to perform standardization for each detector.

max_depthint

Maximum tree depth for base learners.

learning_ratefloat

Boosting learning rate (xgb’s “eta”)

n_estimatorsint

Number of boosted trees to fit.

silentbool

Whether to print messages while running boosting.

objectivestring or callable

Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

boosterstring

Specify which booster to use: gbtree, gblinear or dart.

n_jobsint

Number of parallel threads used to run xgboost. (replaces nthread)

gammafloat

Minimum loss reduction required to make a further partition on a leaf node of the tree.

min_child_weightint

Minimum sum of instance weight(hessian) needed in a child.

max_delta_stepint

Maximum delta step we allow each tree’s weight estimation to be.

subsamplefloat

Subsample ratio of the training instance.

colsample_bytreefloat

Subsample ratio of columns when constructing each tree.

colsample_bylevelfloat

Subsample ratio of columns for each split, in each level.

reg_alphafloat (xgb’s alpha)

L1 regularization term on weights.

reg_lambdafloat (xgb’s lambda)

L2 regularization term on weights.

scale_pos_weightfloat

Balancing of positive and negative weights.

base_score:

The initial prediction score of all instances, global bias.

random_stateint

Random number seed. (replaces seed)

# missing : float, optional # Value in the data which needs to be present as a missing value. If # None, defaults to np.nan.

importance_type: string, default “gain”

The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

**kwargsdict, optional

Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note: **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Attributes#

n_detector_int

The number of unsupervised of detectors used.

clf_object

The XGBoost classifier.

decision_scores_numpy array of shape (n_samples,)

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

labels_int, either 0 or 1

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

decision_function(X)[source]#

Predict raw anomaly scores of X using the fitted detector.

The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples. Sparse matrices are accepted only if they are supported by the base estimator.

Returns#

anomaly_scoresnumpy array of shape (n_samples,)

The anomaly score of the input samples.

fit(X, y)[source]#

Fit the model using X and y as training data.

Parameters#

Xnumpy array of shape (n_samples, n_features)

Training data.

ynumpy array of shape (n_samples,)

The ground truth (binary label)

  • 0 : inliers

  • 1 : outliers

Returns#

self : object

fit_predict(X, y)[source]#

Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.

fit_predict_score(X, y, scoring='roc_auc_score')[source]#

Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

yIgnored

Not used, present for API consistency by convention.

scoringstr, optional (default=’roc_auc_score’)

Evaluation metric:

  • ‘roc_auc_score’: ROC score

  • ‘prc_n_score’: Precision @ rank n score

Returns#

score : float

get_params(deep=True)#

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters#

deepbool, optional (default=True)

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns#

paramsmapping of string to any

Parameter names mapped to their values.

predict(X)[source]#

Predict if a particular sample is an outlier or not. Calling xgboost predict function.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.

predict_confidence(X)#

Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

confidencenumpy array of shape (n_samples,)

For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].

predict_proba(X)[source]#

Predict the probability of a sample being outlier. Calling xgboost predict_proba function.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

set_params(**params)#

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns#

self : object

Module contents#

References

[BAgg15] (1,2,3,4)

Charu C Aggarwal. Outlier analysis. In Data mining, 75–79. Springer, 2015.

[BAS15] (1,2)

Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1):24–47, 2015.

[BABC20]

Yahya Almardeny, Noureddine Boujnah, and Frances Cleary. A novel outlier detection method for multivariate data. IEEE Transactions on Knowledge and Data Engineering, 2020.

[BAP02]

Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery, 15–27. Springer, 2002.

[BAAR96]

Andreas Arning, Rakesh Agrawal, and Prabhakar Raghavan. A linear method for deviation detection in large databases. In KDD, volume 1141, 972–981. 1996.

[BBTA+18]

Tharindu R Bandaragoda, Kai Ming Ting, David Albrecht, Fei Tony Liu, Ye Zhu, and Jonathan R Wells. Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, 34(4):968–998, 2018.

[BBirgeR06] (1,2)

Lucien Birgé and Yves Rozenholc. How many bins should be put in a regular histogram. ESAIM: Probability and Statistics, 10:24–45, 2006.

[BBKNS00]

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In ACM sigmod record, volume 29, 93–104. ACM, 2000.

[BBHP+18] (1,2)

Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in betvae. arXiv preprint arXiv:1804.03599, 2018.

[BCoo77]

R Dennis Cook. Detection of influential observation in linear regression. Technometrics, 19(1):15–18, 1977.

[BFM01]

Kai-Tai Fang and Chang-Xing Ma. Wrap-around l2-discrepancy of random sampling, latin hypercube and uniform designs. Journal of complexity, 17(4):608–624, 2001.

[BGD12]

Markus Goldstein and Andreas Dengel. Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm. KI-2012: Poster and Demo Track, pages 59–63, 2012.

[BGHNN22]

Adam Goodge, Bryan Hooi, See-Kiong Ng, and Wee Siong Ng. Lunar: unifying local outlier detection methods via graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 6737–6745. 2022.

[BHR04]

Johanna Hardin and David M Rocke. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4):625–638, 2004.

[BHXD03]

Zengyou He, Xiaofei Xu, and Shengchun Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9-10):1641–1650, 2003.

[BHof07]

Heiko Hoffmann. Kernel pca for novelty detection. Pattern recognition, 40(3):863–874, 2007.

[BIH93]

Boris Iglewicz and David Caster Hoaglin. How to detect and handle outliers. Volume 16. Asq Press, 1993.

[BJHuszarPvdH12]

JHM Janssens, Ferenc Huszár, EO Postma, and HJ van den Herik. Stochastic outlier selection. Technical Report, Technical report TiCC TR 2012-001, Tilburg University, Tilburg Center for Cognition and Communication, Tilburg, The Netherlands, 2012.

[BKW13] (1,2)

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[BKKSZ11] (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40)

Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, 13–24. SIAM, 2011.

[BKKrogerSZ09]

Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 831–838. Springer, 2009.

[BKZ+08]

Hans-Peter Kriegel, Arthur Zimek, and others. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 444–452. ACM, 2008.

[BLLP07]

Longin Jan Latecki, Aleksandar Lazarevic, and Dragoljub Pokrajac. Outlier detection with kernel density functions. In International Workshop on Machine Learning and Data Mining in Pattern Recognition, 61–75. Springer, 2007.

[BLK05]

Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 157–166. ACM, 2005.

[BLZB+20]

Zheng Li, Yue Zhao, Nicola Botta, Cezar Ionescu, and Xiyang Hu. COPOD: copula-based outlier detection. In IEEE International Conference on Data Mining (ICDM). IEEE, 2020.

[BLTZ08]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on, 413–422. IEEE, 2008.

[BLTZ12]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3, 2012.

[BLLZ+19] (1,2)

Yezheng Liu, Zhe Li, Chong Zhou, Yuanchun Jiang, Jianshan Sun, Meng Wang, and Xiangnan He. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering, 2019.

[BPKGF03]

Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B Gibbons, and Christos Faloutsos. Loci: fast outlier detection using the local correlation integral. In Data Engineering, 2003. Proceedings. 19th International Conference on, 315–326. IEEE, 2003.

[BPVD20] (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41)

Lorenzo Perini, Vincent Vercruyssen, and Jesse Davis. Quantifying the confidence of anomaly detectors in their example-wise predictions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 227–243. Springer, 2020.

[BPevny16]

Tomáš Pevn\`y. Loda: lightweight on-line detector of anomalies. Machine Learning, 102(2):275–304, 2016.

[BRRS00]

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In ACM Sigmod Record, volume 29, 427–438. ACM, 2000.

[BRD99]

Peter J Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212–223, 1999.

[BRVG+18]

Lukas Ruff, Robert Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. International conference on machine learning, 2018.

[BSSeebockW+17]

Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, 146–157. Springer, 2017.

[BScholkopfPST+01]

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.

[BSCSC03]

Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003.

[BSB13]

Mahito Sugiyama and Karsten Borgwardt. Rapid distance-based outlier detection via sampling. Advances in neural information processing systems, 2013.

[BTCFC02]

Jian Tang, Zhixiang Chen, Ada Wai-Chee Fu, and David W Cheung. Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 535–548. Springer, 2002.

[BXPWW23]

Hongzuo Xu, Guansong Pang, Yijie Wang, and Yongjun Wang. Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering, ():1–14, 2023. doi:10.1109/TKDE.2023.3270293.

[BYRV17]

Chong You, Daniel P Robinson, and René Vidal. Provable self-representation based outlier detection in a union of subspaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3395–3404. 2017.

[BZRF+18]

Houssam Zenati, Manon Romain, Chuan-Sheng Foo, Bruno Lecouat, and Vijay Chandrasekhar. Adversarially learned anomaly detection. In 2018 IEEE International conference on data mining (ICDM), 727–736. IEEE, 2018.

[BZH18]

Yue Zhao and Maciej K Hryniewicki. Xgbod: improving supervised outlier detection with unsupervised representation learning. In International Joint Conference on Neural Networks (IJCNN). IEEE, 2018.

[BZHC+21]

Yue Zhao, Xiyang Hu, Cheng Cheng, Cong Wang, Changlin Wan, Wen Wang, Jianing Yang, Haoping Bai, Zheng Li, Cao Xiao, Yunlong Wang, Zhi Qiao, Jimeng Sun, and Leman Akoglu. Suod: accelerating large-scale unsupervised heterogeneous outlier detection. Proceedings of Machine Learning and Systems, 2021.

[BZNHL19]

Yue Zhao, Zain Nasrullah, Maciej K Hryniewicki, and Zheng Li. LSCP: locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining, SDM 2019, 585–593. Calgary, Canada, May 2019. SIAM. URL: https://doi.org/10.1137/1.9781611975673.66, doi:10.1137/1.9781611975673.66.