All Models¶
pyod.models.abod module¶
Angle-based Outlier Detector (ABOD)
- class pyod.models.abod.ABOD(contamination=0.1, n_neighbors=5, method='fast')[source]¶
Bases:
BaseDetector
ABOD class for Angle-base Outlier Detection. For an observation, the variance of its weighted cosine scores to all neighbors could be viewed as the outlying score. See [BKZ+08] for details.
Two version of ABOD are supported:
Fast ABOD: use k nearest neighbors to approximate.
Original ABOD: consider all training points with high time complexity at O(n^3).
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- n_neighborsint, optional (default=10)
Number of neighbors to use by default for k neighbors queries.
- method: str, optional (default=’fast’)
Valid values for metric are:
‘fast’: fast ABOD. Only consider n_neighbors of training points
‘default’: original ABOD with all training points, which could be slow
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.ae1svm module¶
Using AE-1SVM with Outlier Detection (PyTorch) Source: https://arxiv.org/pdf/1804.04888 There is another implementation of this model by Minh Nghia: https://github.com/minh-nghia/AE-1SVM (Tensorflow)
- class pyod.models.ae1svm.AE1SVM(hidden_neurons=None, hidden_activation='relu', batch_norm=True, learning_rate=0.001, epochs=50, batch_size=32, dropout_rate=0.2, weight_decay=1e-05, preprocessing=True, loss_fn=None, contamination=0.1, alpha=1.0, sigma=1.0, nu=0.1, kernel_approx_features=1000)[source]¶
Bases:
BaseDetector
Auto Encoder with One-class SVM for anomaly detection.
Note: self.device is needed or all tensors may not be on the same device (if device w/ GPU running)
Parameters¶
- hidden_neuronslist, optional (default=[64, 32])
Number of neurons in each hidden layer.
- hidden_activationstr, optional (default=’relu’)
Activation function for the hidden layers.
- batch_normbool, optional (default=True)
Whether to use batch normalization.
- learning_ratefloat, optional (default=1e-3)
Learning rate for training the model.
- epochsint, optional (default=50)
Number of training epochs.
- batch_sizeint, optional (default=32)
Size of each training batch.
- dropout_ratefloat, optional (default=0.2)
Dropout rate for regularization.
- weight_decayfloat, optional (default=1e-5)
Weight decay (L2 penalty) for the optimizer.
- preprocessingbool, optional (default=True)
Whether to apply standard scaling to the input data.
- loss_fncallable, optional (default=torch.nn.MSELoss)
Loss function to use for reconstruction loss.
- contaminationfloat, optional (default=0.1)
Proportion of outliers in the data.
- alphafloat, optional (default=1.0)
Weight for the reconstruction loss in the final loss computation.
- sigmafloat, optional (default=1.0)
Scaling factor for the random Fourier features.
- nufloat, optional (default=0.1)
Parameter for the SVM loss.
- kernel_approx_featuresint, optional (default=1000)
Number of random Fourier features to approximate the kernel.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
Parameters¶
- Xnumpy.ndarray
The input samples.
Returns¶
- numpy.ndarray
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit the model to the data.
Parameters¶
- Xnumpy.ndarray
Input data.
- yNone
Ignored, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.alad module¶
Using Adversarially Learned Anomaly Detection
- class pyod.models.alad.ALAD(activation_hidden_gen='tanh', activation_hidden_disc='tanh', output_activation=None, dropout_rate=0.2, latent_dim=2, dec_layers=[5, 10, 25], enc_layers=[25, 10, 5], disc_xx_layers=[25, 10, 5], disc_zz_layers=[25, 10, 5], disc_xz_layers=[25, 10, 5], learning_rate_gen=0.0001, learning_rate_disc=0.0001, add_recon_loss=False, lambda_recon_loss=0.1, epochs=200, verbose=0, preprocessing=False, add_disc_zz_loss=True, spectral_normalization=False, batch_size=32, contamination=0.1, device=None)[source]¶
Bases:
BaseDetector
Adversarially Learned Anomaly Detection (ALAD). Paper: https://arxiv.org/pdf/1812.02288.pdf
See [BZRF+18] for details.
Parameters¶
- output_activationstr, optional (default=None)
Activation function to use for output layers for encoder and dector.
- activation_hidden_discstr, optional (default=’tanh’)
Activation function to use for hidden layers in discrimators.
- activation_hidden_genstr, optional (default=’tanh’)
Activation function to use for hidden layers in encoder and decoder (i.e. generator).
- epochsint, optional (default=500)
Number of epochs to train the model.
- batch_sizeint, optional (default=32)
Number of samples per gradient update.
- dropout_ratefloat in (0., 1), optional (default=0.2)
The dropout to be used across all layers.
- dec_layerslist, optional (default=[5,10,25])
List that indicates the number of nodes per hidden layer for the d ecoder network. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.
- enc_layerslist, optional (default=[25,10,5])
List that indicates the number of nodes per hidden layer for the encoder network. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.
- disc_xx_layerslist, optional (default=[25,10,5])
List that indicates the number of nodes per hidden layer for discriminator_xx. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.
- disc_zz_layerslist, optional (default=[25,10,5])
List that indicates the number of nodes per hidden layer for discriminator_zz. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.
- disc_xz_layerslist, optional (default=[25,10,5])
List that indicates the number of nodes per hidden layer for discriminator_xz. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.
- learning_rate_gen: float in (0., 1), optional (default=0.001)
learning rate of training the encoder and decoder
- learning_rate_disc: float in (0., 1), optional (default=0.001)
learning rate of training the discriminators
- add_recon_loss: bool optional (default=False)
add an extra loss for encoder and decoder based on the reconstruction error
- lambda_recon_loss: float in (0., 1), optional (default=0.1)
if
add_recon_loss= True
, the reconstruction loss gets multiplied bylambda_recon_loss
and added to the total loss for the generator(i.e. encoder and decoder).
- preprocessingbool, optional (default=True)
If True, apply standardization on the data.
- verboseint, optional (default=1)
Verbosity mode. - 0 = silent - 1 = progress bar
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.
- devicestr or None, optional (default=None)
The device to use for computation. If None, the default device will be used. Possible values include ‘cpu’ or ‘gpu’. This parameter allows the user to specify the preferred device for running the model.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data [0,1]. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. Parameters ———- X : numpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None, noise_std=0.1)[source]¶
Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.anogan module¶
Anomaly Detection with Generative Adversarial Networks (AnoGAN) Paper: https://arxiv.org/pdf/1703.05921.pdf Note, that this is another implementation of AnoGAN as the one from https://github.com/fuchami/ANOGAN
- class pyod.models.anogan.AnoGAN(activation_hidden='tanh', dropout_rate=0.2, latent_dim_G=2, G_layers=[20, 10, 3, 10, 20], verbose=0, D_layers=[20, 10, 5], index_D_layer_for_recon_error=1, epochs=500, preprocessing=False, learning_rate=0.001, learning_rate_query=0.01, epochs_query=20, batch_size=32, output_activation=None, contamination=0.1, device=None)[source]¶
Bases:
BaseDetector
Anomaly Detection with Generative Adversarial Networks (AnoGAN). See the original paper “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery”.
See [BSSeebockW+17] for details.
Parameters¶
- output_activationstr, optional (default=None)
Activation function to use for output layer.
- activation_hiddenstr, optional (default=’tanh’)
Activation function to use for output layer.
- epochsint, optional (default=500)
Number of epochs to train the model.
- batch_sizeint, optional (default=32)
Number of samples per gradient update.
- dropout_ratefloat in (0., 1), optional (default=0.2)
The dropout to be used across all layers.
- G_layerslist, optional (default=[20,10,3,10,20])
List that indicates the number of nodes per hidden layer for the generator. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.
- D_layerslist, optional (default=[20,10,5])
List that indicates the number of nodes per hidden layer for the discriminator. Thus, [10,10] indicates 2 hidden layers having each 10 nodes.
- learning_rate: float in (0., 1), optional (default=0.001)
learning rate of training the network
- index_D_layer_for_recon_error: int, optional (default = 1)
This is the index of the hidden layer in the discriminator for which the reconstruction error will be determined between query sample and the sample created from the latent space.
- learning_rate_query: float in (0., 1), optional (default=0.001)
learning rate for the backpropagation steps needed to find a point in the latent space of the generator that approximate the query sample
- epochs_query: int, optional (default=20)
Number of epochs to approximate the query sample in the latent space of the generator
- preprocessingbool, optional (default=True)
If True, apply standardization on the data.
- verboseint, optional (default=1)
Verbosity mode. - 0 = silent - 1 = progress bar
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data [0,1]. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.auto_encoder module¶
Using AutoEncoder with Outlier Detection
- class pyod.models.auto_encoder.AutoEncoder(contamination=0.1, preprocessing=True, lr=0.001, epoch_num=10, batch_size=32, optimizer_name='adam', device=None, random_state=42, use_compile=False, compile_mode='default', verbose=1, optimizer_params: dict = {'weight_decay': 1e-05}, hidden_neuron_list=[64, 32], hidden_activation_name='relu', batch_norm=True, dropout_rate=0.2)[source]¶
Bases:
BaseDeepLearningDetector
Auto Encoder (AE) is a type of neural networks for learning useful data representations in an unsupervised manner. Similar to PCA, AE could be used to detect outlying objects in the data by calculating the reconstruction errors. See [BAgg15] Chapter 3 for details.
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- preprocessingbool, optional (default=True)
If True, apply the preprocessing procedure before training models.
- lrfloat, optional (default=1e-3)
The initial learning rate for the optimizer.
- epoch_numint, optional (default=10)
The number of epochs for training.
- batch_sizeint, optional (default=32)
The batch size for training.
- optimizer_namestr, optional (default=’adam’)
The name of theoptimizer used to train the model.
- devicestr, optional (default=None)
The device to use for the model. If None, it will be decided automatically. If you want to use MPS, set it to ‘mps’.
- random_stateint, optional (default=42)
The random seed for reproducibility.
- use_compilebool, optional (default=False)
Whether to compile the model. If True, the model will be compiled before training. This is only available for PyTorch version >= 2.0.0. and Python < 3.12.
- compile_modestr, optional (default=’default’)
The mode to compile the model. Can be either “default”, “reduce-overhead”, “max-autotune” or “max-autotune-no-cudagraphs”. See https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile for details.
- verboseint, optional (default=1)
Verbosity mode. - 0 = silent - 1 = progress bar - 2 = one line per epoch.
- optimizer_paramsdict, optional (default={‘weight_decay’: 1e-5})
Additional parameters for the optimizer. For example, optimizer_params={‘weight_decay’: 1e-5}.
- hidden_neuron_listlist, optional (default=[64, 32])
The number of neurons per hidden layers. So the network has the structure as [feature_size, 64, 32, 32, 64, feature_size].
- hidden_activation_namestr, optional (default=’relu’)
The activation function used in hidden layers.
- batch_normboolean, optional (default=True)
Whether to apply Batch Normalization, See https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
- dropout_ratefloat in (0., 1), optional (default=0.2)
The dropout to be used across all layers.
Attributes¶
- modeltorch.nn.Module
The underlying AutoEncoder model.
- optimizertorch.optim
The optimizer used to train the model.
- criteriontorch.nn.modules
The loss function used to train the model.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- build_model()[source]¶
Need to define model in this method. self.feature_size is the number of features in the input data.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X, batch_size=None)¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores. Parameters ———- X : numpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
- batch_sizeint, optional (default=None)
The batch size for processing the input samples. If not specified, the default batch size is used.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- evaluate(data_loader)¶
Evaluate the deep learning model.
Parameters¶
- data_loadertorch.utils.data.DataLoader
The data loader for evaluating the model.
Returns¶
- outlier_scoresnumpy array of shape (n_samples,)
The outlier scores of the input samples.
- fit(X, y=None)¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- ynumpy array of shape (n_samples,), optional (default=None)
The ground truth of input samples. Not used in unsupervised methods.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- train(train_loader)¶
Train the deep learning model.
Parameters¶
- train_loadertorch.utils.data.DataLoader
The data loader for training the model.
- training_forward(batch_data)[source]¶
Forward pass for training the model. Abstract method to be implemented.
Parameters¶
- batch_datatuple
The batch data for training the model.
Returns¶
- lossfloat or tuple of float
The loss.item of the model, or a tuple of loss.item if there are multiple losses.
- training_prepare()¶
pyod.models.auto_encoder_torch module¶
pyod.models.cblof module¶
Clustering Based Local Outlier Factor (CBLOF)
- class pyod.models.cblof.CBLOF(n_clusters=8, contamination=0.1, clustering_estimator=None, alpha=0.9, beta=5, use_weights=False, check_estimator=False, random_state=None, n_jobs=1)[source]¶
Bases:
BaseDetector
The CBLOF operator calculates the outlier score based on cluster-based local outlier factor.
CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It classifies the clusters into small clusters and large clusters using the parameters alpha and beta. The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster.
Use weighting for outlier factor based on the sizes of the clusters as proposed in the original publication. Since this might lead to unexpected behavior (outliers close to small clusters are not found), it is disabled by default.Outliers scores are solely computed based on their distance to the closest large cluster center.
By default, kMeans is used for clustering algorithm instead of Squeezer algorithm mentioned in the original paper for multiple reasons.
See [BHXD03] for details.
Parameters¶
- n_clustersint, optional (default=8)
The number of clusters to form as well as the number of centroids to generate.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- clustering_estimatorEstimator, optional (default=None)
The base clustering algorithm for performing data clustering. A valid clustering algorithm should be passed in. The estimator should have standard sklearn APIs, fit() and predict(). The estimator should have attributes
labels_
andcluster_centers_
. Ifcluster_centers_
is not in the attributes once the model is fit, it is calculated as the mean of the samples in a cluster.If not set, CBLOF uses KMeans for scalability. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
- alphafloat in (0.5, 1), optional (default=0.9)
Coefficient for deciding small and large clusters. The ratio of the number of samples in large clusters to the number of samples in small clusters.
- betaint or float in (1,), optional (default=5).
Coefficient for deciding small and large clusters. For a list sorted clusters by size |C1|, |C2|, …, |Cn|, beta = |Ck|/|Ck-1|
- use_weightsbool, optional (default=False)
If set to True, the size of clusters are used as weights in outlier score calculation.
- check_estimatorbool, optional (default=False)
If set to True, check whether the base estimator is consistent with sklearn standard.
Warning
check_estimator may throw errors with scikit-learn 0.20 above.
- random_stateint, RandomState or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Attributes¶
- clustering_estimator_Estimator, sklearn instance
Base estimator for clustering.
- cluster_labels_list of shape (n_samples,)
Cluster assignment for the training samples.
- n_clusters_int
Actual number of clusters (possibly different from n_clusters).
- cluster_sizes_list of shape (n_clusters_,)
The size of each cluster once fitted with the training data.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- cluster_centers_numpy array of shape (n_clusters_, n_features)
The center of each cluster.
- small_cluster_labels_list of clusters numbers
The cluster assignments belonging to small clusters.
- large_cluster_labels_list of clusters numbers
The cluster assignments belonging to large clusters.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.cof module¶
Connectivity-Based Outlier Factor (COF) Algorithm
- class pyod.models.cof.COF(contamination=0.1, n_neighbors=20, method='fast')[source]¶
Bases:
BaseDetector
Connectivity-Based Outlier Factor (COF) COF uses the ratio of average chaining distance of data point and the average of average chaining distance of k nearest neighbor of the data point, as the outlier score for observations.
See [BTCFC02] for details.
Two version of COF are supported:
Fast COF: computes the entire pairwise distance matrix at the cost of a O(n^2) memory requirement.
Memory efficient COF: calculates pairwise distances incrementally. Use this implementation when it is not feasible to fit the n-by-n distance in memory. This leads to a linear overhead because many distances will have to be recalculated.
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- n_neighborsint, optional (default=20)
Number of neighbors to use by default for k neighbors queries. Note that n_neighbors should be less than the number of samples. If n_neighbors is larger than the number of samples provided, all samples will be used.
- methodstring, optional (default=’fast’)
Valid values for method are:
‘fast’ Fast COF, computes the full pairwise distance matrix up front.
‘memory’ Memory-efficient COF, computes pairwise distances only when needed at the cost of computational speed.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.- n_neighbors_: int
Number of neighbors to use by default for k neighbors queries.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.combination module¶
A collection of model combination functionalities.
- pyod.models.combination.aom(scores, n_buckets=5, method='static', bootstrap_estimators=False, random_state=None)[source]¶
Average of Maximum - An ensemble method for combining multiple estimators. See [BAS15] for details.
First dividing estimators into subgroups, take the maximum score as the subgroup score. Finally, take the average of all subgroup outlier scores.
Parameters¶
- scoresnumpy array of shape (n_samples, n_estimators)
The score matrix outputted from various estimators
- n_bucketsint, optional (default=5)
The number of subgroups to build
- methodstr, optional (default=’static’)
{‘static’, ‘dynamic’}, if ‘dynamic’, build subgroups randomly with dynamic bucket size.
- bootstrap_estimatorsbool, optional (default=False)
Whether estimators are drawn with replacement.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Returns¶
- combined_scoresNumpy array of shape (n_samples,)
The combined outlier scores.
- pyod.models.combination.average(scores, estimator_weights=None)[source]¶
Combination method to merge the outlier scores from multiple estimators by taking the average.
Parameters¶
- scoresnumpy array of shape (n_samples, n_estimators)
Score matrix from multiple estimators on the same samples.
- estimator_weightslist of shape (1, n_estimators)
If specified, using weighted average
Returns¶
- combined_scoresnumpy array of shape (n_samples, )
The combined outlier scores.
- pyod.models.combination.majority_vote(scores, weights=None)[source]¶
Combination method to merge the scores from multiple estimators by majority vote.
Parameters¶
- scoresnumpy array of shape (n_samples, n_estimators)
Score matrix from multiple estimators on the same samples.
- weightsnumpy array of shape (1, n_estimators)
If specified, using weighted majority weight.
Returns¶
- combined_scoresnumpy array of shape (n_samples, )
The combined scores.
- pyod.models.combination.maximization(scores)[source]¶
Combination method to merge the outlier scores from multiple estimators by taking the maximum.
Parameters¶
- scoresnumpy array of shape (n_samples, n_estimators)
Score matrix from multiple estimators on the same samples.
Returns¶
- combined_scoresnumpy array of shape (n_samples, )
The combined outlier scores.
- pyod.models.combination.median(scores)[source]¶
Combination method to merge the scores from multiple estimators by taking the median.
Parameters¶
- scoresnumpy array of shape (n_samples, n_estimators)
Score matrix from multiple estimators on the same samples.
Returns¶
- combined_scoresnumpy array of shape (n_samples, )
The combined scores.
- pyod.models.combination.moa(scores, n_buckets=5, method='static', bootstrap_estimators=False, random_state=None)[source]¶
Maximization of Average - An ensemble method for combining multiple estimators. See [BAS15] for details.
First dividing estimators into subgroups, take the average score as the subgroup score. Finally, take the maximization of all subgroup outlier scores.
Parameters¶
- scoresnumpy array of shape (n_samples, n_estimators)
The score matrix outputted from various estimators
- n_bucketsint, optional (default=5)
The number of subgroups to build
- methodstr, optional (default=’static’)
{‘static’, ‘dynamic’}, if ‘dynamic’, build subgroups randomly with dynamic bucket size.
- bootstrap_estimatorsbool, optional (default=False)
Whether estimators are drawn with replacement.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Returns¶
- combined_scoresNumpy array of shape (n_samples,)
The combined outlier scores.
pyod.models.cd module¶
Cook’s distance outlier detection (CD)
- class pyod.models.cd.CD(contamination=0.1, model=LinearRegression())[source]¶
Bases:
BaseDetector
- Cook’s distance can be used to identify points that negatively
affect a regression model. A combination of each observation’s leverage and residual values are used in the measurement. Higher leverage and residuals relate to higher Cook’s distances. Note that this method is unsupervised and requires at least two features for X with which to calculate the mean Cook’s distance for each datapoint. Read more in the [BCoo77].
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- modelobject, optional (default=LinearRegression())
Regression model used to calculate the Cook’s distance
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.
- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
“Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.copod module¶
Copula Based Outlier Detector (COPOD)
- class pyod.models.copod.COPOD(contamination=0.1, n_jobs=1)[source]¶
Bases:
BaseDetector
COPOD class for Copula Based Outlier Detector. COPOD is a parameter-free, highly interpretable outlier detection algorithm based on empirical copula models. See [BLZB+20] for details.
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- n_jobsoptional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
- Predict raw anomaly score of X using the fitted detector.
For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- explain_outlier(ind, columns=None, cutoffs=None, feature_names=None, file_name=None, file_type=None)[source]¶
Plot dimensional outlier graph for a given data point within the dataset.
Parameters¶
- indint
The index of the data point one wishes to obtain a dimensional outlier graph for.
- columnslist
Specify a list of features/dimensions for plotting. If not specified, use all features.
- cutoffslist of floats in (0., 1), optional (default=[0.95, 0.99])
The significance cutoff bands of the dimensional outlier graph.
- feature_nameslist of strings
The display names of all columns of the dataset, to show on the x-axis of the plot.
- file_namestring
The name to save the figure
- file_typestring
The file type to save the figure
Returns¶
- Plotmatplotlib plot
The dimensional outlier graph for data point with index ind.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.deep_svdd module¶
Deep One-Class Classification for outlier detection
- class pyod.models.deep_svdd.DeepSVDD(n_features, c=None, use_ae=False, hidden_neurons=None, hidden_activation='relu', output_activation='sigmoid', optimizer='adam', epochs=100, batch_size=32, dropout_rate=0.2, l2_regularizer=0.1, validation_size=0.1, preprocessing=True, verbose=1, random_state=None, contamination=0.1)[source]¶
Bases:
BaseDetector
Deep One-Class Classifier with AutoEncoder (AE) is a type of neural networks for learning useful data representations in an unsupervised way. DeepSVDD trains a neural network while minimizing the volume of a hypersphere that encloses the network representations of the data, forcing the network to extract the common factors of variation. Similar to PCA, DeepSVDD could be used to detect outlying objects in the data by calculating the distance from center See [BRVG+18] for details.
Parameters¶
- n_features: int,
Number of features in the input data.
- c: float, optional (default=’forwad_nn_pass’)
Deep SVDD center, the default will be calculated based on network initialization first forward pass. To get repeated results set random_state if c is set to None.
- use_ae: bool, optional (default=False)
The AutoEncoder type of DeepSVDD it reverse neurons from hidden_neurons if set to True.
- hidden_neuronslist, optional (default=[64, 32])
The number of neurons per hidden layers. if use_ae is True, neurons will be reversed eg. [64, 32] -> [64, 32, 32, 64, n_features]
- hidden_activationstr, optional (default=’relu’)
Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://keras.io/activations/
- output_activationstr, optional (default=’sigmoid’)
Activation function to use for output layer. See https://keras.io/activations/
- optimizerstr, optional (default=’adam’)
String (name of optimizer) or optimizer instance. See https://keras.io/optimizers/
- epochsint, optional (default=100)
Number of epochs to train the model.
- batch_sizeint, optional (default=32)
Number of samples per gradient update.
- dropout_ratefloat in (0., 1), optional (default=0.2)
The dropout to be used across all layers.
- l2_regularizerfloat in (0., 1), optional (default=0.1)
The regularization strength of activity_regularizer applied on each layer. By default, l2 regularizer is used. See https://keras.io/regularizers/
- validation_sizefloat in (0., 1), optional (default=0.1)
The percentage of data to be used for validation.
- preprocessingbool, optional (default=True)
If True, apply standardization on the data.
- random_staterandom_state: int, RandomState instance or None, optional
(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.devnet module¶
Deep anomaly detection with deviation networks Part of the codes are adapted from https://github.com/GuansongPang/deviation-network
- class pyod.models.devnet.DevNet(network_depth=2, batch_size=512, epochs=50, nb_batch=20, known_outliers=30, cont_rate=0.02, data_format=0, random_seed=42, device=None, contamination=0.1)[source]¶
Bases:
BaseDetector
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly scores of X using the fitted detector.
The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')[source]¶
Fit the detector with labels, predict on samples, and evaluate the model by predefined metrics.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- ynumpy array of shape (n_samples,)
The labels or target values corresponding to X.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric: - ‘roc_auc_score’: ROC score - ‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.dif module¶
Deep Isolation Forest for Anomaly Detection (DIF)
- class pyod.models.dif.DIF(batch_size=1000, representation_dim=20, hidden_neurons=None, hidden_activation='tanh', skip_connection=False, n_ensemble=50, n_estimators=6, max_samples=256, contamination=0.1, random_state=None, device=None)[source]¶
Bases:
BaseDetector
Deep Isolation Forest (DIF) is an extension of iForest. It uses deep representation ensemble to achieve non-linear isolation on original data space. See [BXPWW23] for details.
Parameters¶
- batch_sizeint, optional (default=1000)
Number of samples per gradient update.
- representation_dim, int, optional (default=20)
Dimensionality of the representation space.
- hidden_neurons, list, optional (default=[64, 32])
The number of neurons per hidden layers. So the network has the structure as [n_features, hidden_neurons[0], hidden_neurons[1], …, representation_dim]
- hidden_activation, str, optional (default=’tanh’)
Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://pytorch.org/docs/stable/nn.html for details. Currently only ‘relu’: nn.ReLU() ‘sigmoid’: nn.Sigmoid() ‘tanh’: nn.Tanh() are supported. See pyod/utils/torch_utility.py for details.
- skip_connection, boolean, optional (default=False)
If True, apply skip-connection in the neural network structure.
- n_ensemble, int, optional (default=50)
The number of deep representation ensemble members.
- n_estimators, int, optional (default=6)
The number of isolation forest of each representation.
- max_samples, int, optional (default=256)
The number of samples to draw from X to train each base isolation tree.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- random_stateint or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- device, ‘cuda’, ‘cpu’, or None, optional (default=None)
if ‘cuda’, use GPU acceleration in torch if ‘cpu’, use cpu in torch if None, automatically determine whether GPU is available
Attributes¶
- net_lstlist of torch.Module
The list of representation neural networks.
- iForest_lstlist of iForest
The list of instantiated iForest model.
- x_reduced_lst: list of numpy array
The list of training data representations
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.ecod module¶
Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions (ECOD)
- class pyod.models.ecod.ECOD(contamination=0.1, n_jobs=1)[source]¶
Bases:
BaseDetector
ECOD class for Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions (ECOD) ECOD is a parameter-free, highly interpretable outlier detection algorithm based on empirical CDF functions. See [] for details.
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- n_jobsoptional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
- Predict raw anomaly score of X using the fitted detector.
For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- explain_outlier(ind, columns=None, cutoffs=None, feature_names=None, file_name=None, file_type=None)[source]¶
Plot dimensional outlier graph for a given data point within the dataset.
Parameters¶
- indint
The index of the data point one wishes to obtain a dimensional outlier graph for.
- columnslist
Specify a list of features/dimensions for plotting. If not specified, use all features.
- cutoffslist of floats in (0., 1), optional (default=[0.95, 0.99])
The significance cutoff bands of the dimensional outlier graph.
- feature_nameslist of strings
The display names of all columns of the dataset, to show on the x-axis of the plot.
- file_namestring
The name to save the figure
- file_typestring
The file type to save the figure
Returns¶
- Plotmatplotlib plot
The dimensional outlier graph for data point with index ind.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.feature_bagging module¶
Feature bagging detector
- class pyod.models.feature_bagging.FeatureBagging(base_estimator=None, n_estimators=10, contamination=0.1, max_features=1.0, bootstrap_features=False, check_detector=True, check_estimator=False, n_jobs=1, random_state=None, combination='average', verbose=0, estimator_params=None)[source]¶
Bases:
BaseDetector
A feature bagging detector is a meta estimator that fits a number of base detectors on various sub-samples of the dataset and use averaging or other combination methods to improve the predictive accuracy and control over-fitting.
The sub-sample size is always the same as the original input sample size but the features are randomly sampled from half of the features to all features.
By default, LOF is used as the base estimator. However, any estimator could be used as the base estimator, such as kNN and ABOD.
Feature bagging first construct n subsamples by random selecting a subset of features, which induces the diversity of base estimators.
Finally, the prediction score is generated by averaging/taking the maximum of all base detectors. See [BLK05] for details.
Parameters¶
- base_estimatorobject or None, optional (default=None)
The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a LOF detector.
- n_estimatorsint, optional (default=10)
The number of base estimators in the ensemble.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- max_featuresint or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
If int, then draw max_features features.
If float, then draw max_features * X.shape[1] features.
- bootstrap_featuresbool, optional (default=False)
Whether features are drawn with replacement.
- check_detectorbool, optional (default=True)
If set to True, check whether the base estimator is consistent with pyod standard.
- check_estimatorbool, optional (default=False)
If set to True, check whether the base estimator is consistent with sklearn standard.
Deprecated since version 0.6.9: check_estimator will be removed in pyod 0.8.0.; it will be replaced by check_detector.
- n_jobsoptional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
- random_stateint, RandomState or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- combinationstr, optional (default=’average’)
The method of combination:
if ‘average’: take the average of all detectors
if ‘max’: take the maximum scores of all detectors
- verboseint, optional (default=0)
Controls the verbosity of the building process.
- estimator_paramsdict, optional (default=None)
The list of attributes to use as parameters when instantiating a new base estimator. If none are given, default parameters are used.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.gmm module¶
Outlier detection based on Gaussian Mixture Model (GMM).
- class pyod.models.gmm.GMM(n_components=1, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, contamination=0.1)[source]¶
Bases:
BaseDetector
Wrapper of scikit-learn Gaussian Mixture Model with more functionalities. Unsupervised Outlier Detection.
See [BAgg15] Chapter 2 for details.
Parameters¶
- n_componentsint, default=1
The number of mixture components.
- covariance_type{‘full’, ‘tied’, ‘diag’, ‘spherical’}, default=’full’
String describing the type of covariance parameters to use.
- tolfloat, default=1e-3
The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.
- reg_covarfloat, default=1e-6
Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.
- max_iterint, default=100
The number of EM iterations to perform.
- n_initint, default=1
The number of initializations to perform. The best results are kept.
- init_params{‘kmeans’, ‘random’}, default=’kmeans’
The method used to initialize the weights, the means and the precisions.
- weights_initarray-like of shape (n_components, ), default=None
The user-provided initial weights. If it is None, weights are initialized using the init_params method.
- means_initarray-like of shape (n_components, n_features), default=None
The user-provided initial means, If it is None, means are initialized using the init_params method.
- precisions_initarray-like, default=None
The user-provided initial precisions (inverse of the covariance matrices). If it is None, precisions are initialized using the ‘init_params’ method.
- random_stateint, RandomState instance or None, default=None
Controls the random seed given to the method chosen to initialize the parameters.
- warm_startbool, default=False
If ‘warm_start’ is True, the solution of the last fitting is used as initialization for the next call of fit().
- verboseint, default=0
Enable verbose output.
- verbose_intervalint, default=10
Number of iteration done before the next print.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set.
Attributes¶
- weights_array-like of shape (n_components,)
The weights of each mixture components.
- means_array-like of shape (n_components, n_features)
The mean of each mixture component.
- covariances_array-like
The covariance of each mixture component.
- precisions_array-like
The precision matrices for each component in the mixture.
- precisions_cholesky_array-like
The cholesky decomposition of the precision matrices of each mixture component.
- converged_bool
True when convergence was reached in fit(), False otherwise.
- n_iter_int
Number of step used by the best fit of EM to reach the convergence.
- lower_bound_float
Lower bound value on the log-likelihood (of the training data with respect to the model) of the best fit of EM.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- sample_weightarray-like, shape (n_samples,)
Per-sample weights. Rescale C per sample. Higher weights force the classifier to put more emphasis on these points.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- property precisions_¶
The precision matrices for each component in the mixture. Decorator for scikit-learn Gaussian Mixture Model attributes.
- property precisions_cholesky_¶
- The cholesky decomposition of the precision matrices
of each mixture component.
Decorator for scikit-learn Gaussian Mixture Model attributes.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.hbos module¶
Histogram-based Outlier Detection (HBOS)
- class pyod.models.hbos.HBOS(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)[source]¶
Bases:
BaseDetector
Histogram- based outlier detection (HBOS) is an efficient unsupervised method. It assumes the feature independence and calculates the degree of outlyingness by building histograms. See [BGD12] for details.
Two versions of HBOS are supported: - Static number of bins: uses a static number of bins for all features. - Automatic number of bins: every feature uses a number of bins deemed to
be optimal according to the Birge-Rozenblac method ([BBirgeR06]).
Parameters¶
- n_binsint or string, optional (default=10)
The number of bins. “auto” uses the birge-rozenblac method for automatic selection of the optimal number of bins for each feature.
- alphafloat in (0, 1), optional (default=0.1)
The regularizer for preventing overflow.
- tolfloat in (0, 1), optional (default=0.5)
The parameter to decide the flexibility while dealing the samples falling outside the bins.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
Attributes¶
- bin_edges_numpy array of shape (n_bins + 1, n_features )
The edges of the bins.
- hist_numpy array of shape (n_bins, n_features)
The density of each histogram.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.iforest module¶
IsolationForest Outlier Detector. Implemented on scikit-learn library.
- class pyod.models.iforest.IForest(n_estimators=100, max_samples='auto', contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=1, behaviour='old', random_state=None, verbose=0)[source]¶
Bases:
BaseDetector
Wrapper of scikit-learn Isolation Forest with more functionalities.
The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. See [BLTZ08, BLTZ12] for details.
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
Parameters¶
- n_estimatorsint, optional (default=100)
The number of base estimators in the ensemble.
- max_samplesint or float, optional (default=”auto”)
The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples.
If “auto”, then max_samples=min(256, n_samples).
If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- max_featuresint or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
If int, then draw max_features features.
If float, then draw max_features * X.shape[1] features.
- bootstrapbool, optional (default=False)
If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
- n_jobsinteger, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
- behaviourstr, default=’old’
Behaviour of the
decision_function
which can be either ‘old’ or ‘new’. Passingbehaviour='new'
makes thedecision_function
change to match other anomaly detection algorithm API which will be the default behaviour in the future. As explained in details in theoffset_
attribute documentation, thedecision_function
becomes dependent on the contamination parameter, in such a way that 0 becomes its natural threshold to detect outliers.Added in version 0.7.0:
behaviour
is added in 0.7.0 for back-compatibility purpose.Deprecated since version 0.20:
behaviour='old'
is deprecated in sklearn 0.20 and will not be possible in 0.22.Deprecated since version 0.22:
behaviour
parameter will be deprecated in sklearn 0.22 and removed in 0.24.Warning
Only applicable for sklearn 0.20 above.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- verboseint, optional (default=0)
Controls the verbosity of the tree building process.
Attributes¶
- estimators_list of DecisionTreeClassifier
The collection of fitted sub-estimators.
- estimators_samples_list of arrays
The subset of drawn samples (i.e., the in-bag samples) for each base estimator.
- max_samples_integer
The actual number of samples
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- property feature_importances_¶
The impurity-based feature importance. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
impurity-based feature importance can be misleading for high cardinality features (many unique values). See https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html as an alternative.
Returns¶
- feature_importances_ndarray of shape (n_features,)
The values of this array sum to 1, unless all trees are single node trees consisting of only the root node, in which case it will be an array of zeros.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- property max_samples_¶
The actual number of samples. Decorator for scikit-learn Isolation Forest attributes.
- property n_features_in_¶
The number of features seen during the fit. Decorator for scikit-learn Isolation Forest attributes.
- property offset_¶
Offset used to define the decision function from the raw scores. Decorator for scikit-learn Isolation Forest attributes.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
pyod.models.inne module¶
Isolation-based anomaly detection using nearest-neighbor ensembles. Part of the codes are adapted from https://github.com/xhan97/inne
- class pyod.models.inne.INNE(n_estimators=200, max_samples='auto', contamination=0.1, random_state=None)[source]¶
Bases:
BaseDetector
Isolation-based anomaly detection using nearest-neighbor ensembles.
The INNE algorithm uses the nearest neighbour ensemble to isolate anomalies. It partitions the data space into regions using a subsample and determines an isolation score for each region. As each region adapts to local distribution, the calculated isolation score is a local measure that is relative to the local neighbourhood, enabling it to detect both global and local anomalies. INNE has linear time complexity to efficiently handle large and high-dimensional datasets with complex distributions.
See [BBTA+18] for details.
Parameters¶
- n_estimatorsint, default=200
The number of base estimators in the ensemble.
- max_samplesint or float, optional (default=”auto”)
The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0]` samples.
If “auto”, then max_samples=min(8, n_samples).
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Attributes¶
- max_samples_integer
The actual number of samples
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.kde module¶
Kernel Density Estimation (KDE) for Unsupervised Outlier Detection.
- class pyod.models.kde.KDE(contamination=0.1, bandwidth=1.0, algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None)[source]¶
Bases:
BaseDetector
KDE class for outlier detection.
For an observation, its negative log probability density could be viewed as the outlying score.
See [BLLP07] for details.
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- bandwidthfloat, optional (default=1.0)
The bandwidth of the kernel.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’}, optional
Algorithm used to compute the kernel density estimator:
‘ball_tree’ will use BallTree
‘kd_tree’ will use KDTree
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to
fit()
method.
- leaf_sizeint, optional (default = 30)
Leaf size passed to BallTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
- metricstring or callable, default ‘minkowski’
metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.
Distance matrices are not supported.
Valid values for metric are:
from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
See the documentation for scipy.spatial.distance for details on these metrics.
- metric_paramsdict, optional (default = None)
Additional keyword arguments for the metric function.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.knn module¶
k-Nearest Neighbors Detector (kNN)
- class pyod.models.knn.KNN(contamination=0.1, n_neighbors=5, method='largest', radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, **kwargs)[source]¶
Bases:
BaseDetector
kNN class for outlier detection. For an observation, its distance to its kth nearest neighbor could be viewed as the outlying score. It could be viewed as a way to measure the density. See [BAP02, BRRS00] for details.
Three kNN detectors are supported: largest: use the distance to the kth neighbor as the outlier score mean: use the average of all k neighbors as the outlier score median: use the median of the distance to k neighbors as the outlier score
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- n_neighborsint, optional (default = 5)
Number of neighbors to use by default for k neighbors queries.
- methodstr, optional (default=’largest’)
{‘largest’, ‘mean’, ‘median’}
‘largest’: use the distance to the kth neighbor as the outlier score
‘mean’: use the average of all k neighbors as the outlier score
‘median’: use the median of the distance to k neighbors as the outlier score
- radiusfloat, optional (default = 1.0)
Range of parameter space to use by default for radius_neighbors queries.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
Algorithm used to compute the nearest neighbors:
‘ball_tree’ will use BallTree
‘kd_tree’ will use KDTree
‘brute’ will use a brute-force search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to
fit()
method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
Deprecated since version 0.74:
algorithm
is deprecated in PyOD 0.7.4 and will not be possible in 0.7.6. It has to use BallTree for consistency.- leaf_sizeint, optional (default = 30)
Leaf size passed to BallTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
- metricstring or callable, default ‘minkowski’
metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.
Distance matrices are not supported.
Valid values for metric are:
from scikit-learn: [‘cityblock’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
See the documentation for scipy.spatial.distance for details on these metrics.
- pinteger, optional (default = 2)
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances
- metric_paramsdict, optional (default = None)
Additional keyword arguments for the metric function.
- n_jobsint, optional (default = 1)
The number of parallel jobs to run for neighbors search. If
-1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.kpca module¶
Kernel Principal Component Analysis (KPCA) Outlier Detector
- class pyod.models.kpca.KPCA(contamination=0.1, n_components=None, n_selected_components=None, kernel='rbf', gamma=None, degree=3, coef0=1, kernel_params=None, alpha=1.0, eigen_solver='auto', tol=0, max_iter=None, remove_zero_eig=False, copy_X=True, n_jobs=None, sampling=False, subset_size=20, random_state=None)[source]¶
Bases:
BaseDetector
KPCA class for outlier detection.
PCA is performed on the feature space uniquely determined by the kernel, and the reconstruction error on the feature space is used as the anomaly score.
See [BHof07] Heiko Hoffmann, “Kernel PCA for novelty detection,” Pattern Recognition, vol.40, no.3, pp. 863-874, 2007. https://www.sciencedirect.com/science/article/pii/S0031320306003414 for details.
Parameters¶
- n_componentsint, optional (default=None)
Number of components. If None, all non-zero components are kept.
- n_selected_componentsint, optional (default=None)
Number of selected principal components for calculating the outlier scores. It is not necessarily equal to the total number of the principal components. If not set, use all principal components.
- kernelstring {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’,
‘cosine’, ‘precomputed’}, optional (default=’rbf’)
Kernel used for PCA.
- gammafloat, optional (default=None)
Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If
gamma
isNone
, then it is set to1/n_features
.- degreeint, optional (default=3)
Degree for poly kernels. Ignored by other kernels.
- coef0float, optional (default=1)
Independent term in poly and sigmoid kernels. Ignored by other kernels.
- kernel_paramsdict, optional (default=None)
Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.
- alphafloat, optional (default=1.0)
Hyperparameter of the ridge regression that learns the inverse transform (when inverse_transform=True).
- eigen_solverstring, {‘auto’, ‘dense’, ‘arpack’, ‘randomized’}, default=’auto’
Select eigensolver to use. If n_components is much less than the number of training samples, randomized (or arpack to a smaller extend) may be more efficient than the dense eigensolver. Randomized SVD is performed according to the method of Halko et al.
- auto :
the solver is selected by a default policy based on n_samples (the number of training samples) and n_components: if the number of components to extract is less than 10 (strict) and the number of samples is more than 200 (strict), the ‘arpack’ method is enabled. Otherwise the exact full eigenvalue decomposition is computed and optionally truncated afterwards (‘dense’ method).
- dense :
run exact full eigenvalue decomposition calling the standard LAPACK solver via scipy.linalg.eigh, and select the components by postprocessing.
- arpack :
run SVD truncated to n_components calling ARPACK solver using scipy.sparse.linalg.eigsh. It requires strictly 0 < n_components < n_samples
- randomized :
run randomized SVD. implementation selects eigenvalues based on their module; therefore using this method can lead to unexpected results if the kernel is not positive semi-definite.
- tolfloat, optional (default=0)
Convergence tolerance for arpack. If 0, optimal value will be chosen by arpack.
- max_iterint, optional (default=None)
Maximum number of iterations for arpack. If None, optimal value will be chosen by arpack.
- remove_zero_eigbool, optional (default=False)
If True, then all components with zero eigenvalues are removed, so that the number of components in the output may be < n_components (and sometimes even zero due to numerical instability). When n_components is None, this parameter is ignored and components with zero eigenvalues are removed regardless.
- copy_Xbool, optional (default=True)
If True, input X is copied and stored by the model in the X_fit_ attribute. If no further changes will be done to X, setting copy_X=False saves memory by storing a reference.
- n_jobsint, optional (default=None)
The number of parallel jobs to run.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.- samplingbool, optional (default=False)
If True, sampling subset from the dataset is performed only once, in order to reduce time complexity while keeping detection performance.
- subset_sizefloat in (0., 1.0) or int (0, n_samples), optional (default=20)
If sampling is True, the size of subset is specified.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.lmdd module¶
Linear Model Deviation-base outlier detection (LMDD).
- class pyod.models.lmdd.LMDD(contamination=0.1, n_iter=50, dis_measure='aad', random_state=None)[source]¶
Bases:
BaseDetector
Linear Method for Deviation-based Outlier Detection.
LMDD employs the concept of the smoothing factor which indicates how much the dissimilarity can be reduced by removing a subset of elements from the data-set. Read more in the [BAAR96].
Note: this implementation has minor modification to make it output scores instead of labels.
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- n_iterint, optional (default=50)
Number of iterations where in each iteration, the process is repeated after randomizing the order of the input. Note that n_iter is a very important factor that affects the accuracy. The higher the better the accuracy and the longer the execution.
- dis_measure: str, optional (default=’aad’)
Dissimilarity measure to be used in calculating the smoothing factor for points, options available:
‘aad’: Average Absolute Deviation
‘var’: Variance
‘iqr’: Interquartile Range
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.loda module¶
Loda: Lightweight on-line detector of anomalies Adapted from tilitools (https://github.com/nicococo/tilitools) by
- class pyod.models.loda.LODA(contamination=0.1, n_bins=10, n_random_cuts=100)[source]¶
Bases:
BaseDetector
Loda: Lightweight on-line detector of anomalies. See [BPevny16] for more information.
Two versions of LODA are supported: - Static number of bins: uses a static number of bins for all random cuts. - Automatic number of bins: every random cut uses a number of bins deemed
to be optimal according to the Birge-Rozenblac method ([BBirgeR06]).
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- n_binsint or string, optional (default = 10)
The number of bins for the histogram. If set to “auto”, the Birge-Rozenblac method will be used to automatically determine the optimal number of bins.
- n_random_cutsint, optional (default = 100)
The number of random cuts.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.lof module¶
Local Outlier Factor (LOF). Implemented on scikit-learn library.
- class pyod.models.lof.LOF(n_neighbors=20, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, contamination=0.1, n_jobs=1, novelty=True)[source]¶
Bases:
BaseDetector
Wrapper of scikit-learn LOF Class with more functionalities. Unsupervised Outlier Detection using Local Outlier Factor (LOF).
The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers. See [BBKNS00] for details.
Parameters¶
- n_neighborsint, optional (default=20)
Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
Algorithm used to compute the nearest neighbors:
‘ball_tree’ will use BallTree
‘kd_tree’ will use KDTree
‘brute’ will use a brute-force search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to
fit()
method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
- leaf_sizeint, optional (default=30)
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
- metricstring or callable, default ‘minkowski’
metric used for the distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.
If ‘precomputed’, the training input X is expected to be a distance matrix.
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.
Valid values for metric are:
from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
See the documentation for scipy.spatial.distance for details on these metrics: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
- pinteger, optional (default = 2)
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances
- metric_paramsdict, optional (default = None)
Additional keyword arguments for the metric function.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.
- n_jobsint, optional (default = 1)
The number of parallel jobs to run for neighbors search. If
-1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods.- noveltybool (default=False)
By default, LocalOutlierFactor is only meant to be used for outlier detection (novelty=False). Set novelty to True if you want to use LocalOutlierFactor for novelty detection. In this case be aware that that you should only use predict, decision_function and score_samples on new unseen data and not on the training set.
Attributes¶
- n_neighbors_int
The actual number of neighbors used for kneighbors queries.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.loci module¶
Local Correlation Integral (LOCI). Part of the codes are adapted from https://github.com/Cloudy10/loci
- class pyod.models.loci.LOCI(contamination=0.1, alpha=0.5, k=3)[source]¶
Bases:
BaseDetector
Local Correlation Integral.
LOCI is highly effective for detecting outliers and groups of outliers ( a.k.a.micro-clusters), which offers the following advantages and novelties: (a) It provides an automatic, data-dictated cut-off to determine whether a point is an outlier—in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection methods can match this feature, because they output only a single number for each point: its outlierness score.(c) It can be computed as quickly as the best previous methods Read more in the [BPKGF03].
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- alphaint, default = 0.5
The neighbourhood parameter measures how large of a neighbourhood should be considered “local”.
- k: int, default = 3
An outlier cutoff threshold for determine whether or not a point should be considered an outlier.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
Examples¶
>>> from pyod.models.loci import LOCI >>> from pyod.utils.data import generate_data >>> n_train = 50 >>> n_test = 50 >>> contamination = 0.1 >>> X_train, y_train, X_test, y_test = generate_data( ... n_train=n_train, n_test=n_test, ... contamination=contamination, random_state=42) >>> clf = LOCI() >>> clf.fit(X_train) LOCI(alpha=0.5, contamination=0.1, k=None)
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly scores of X using the fitted detector.
The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit the model using X as training data.
Parameters¶
- Xarray, shape (n_samples, n_features)
Training data.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
self : object
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.lunar module¶
LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks
- class pyod.models.lunar.LUNAR(model_type='WEIGHT', n_neighbours=5, negative_sampling='MIXED', val_size=0.1, scaler=MinMaxScaler(), epsilon=0.1, proportion=1.0, n_epochs=200, lr=0.001, wd=0.1, verbose=0, contamination=0.1)[source]¶
Bases:
BaseDetector
LUNAR class for outlier detection. See https://www.aaai.org/AAAI22Papers/AAAI-51.GoodgeA.pdf for details. For an observation, its ordered list of distances to its k nearest neighbours is input to a neural network, with one of the following outputs:
SCORE_MODEL: network directly outputs the anomaly score.
- WEIGHT_MODEL: network outputs a set of weights for the k distances, the anomaly score is then the
sum of weighted distances.
See [BGHNN22] for details.
Parameters¶
- model_type: str in [‘WEIGHT’, ‘SCORE’], optional (default = ‘WEIGHT’)
Whether to use WEIGHT_MODEL or SCORE_MODEL for anomaly scoring.
- n_neighbors: int, optional (default = 5)
Number of neighbors to use by default for k neighbors queries.
- negative_sampling: str in [‘UNIFORM’, ‘SUBSPACE’, MIXED’], optional (default = ‘MIXED)
Type of negative samples to use between:
‘UNIFORM’: uniformly distributed samples
‘SUBSPACE’: subspace perturbation (additive random noise in a subset of features)
‘MIXED’: a combination of both types of samples
- val_size: float in [0,1], optional (default = 0.1)
Proportion of samples to be used for model validation
- scaler: object in {StandardScaler(), MinMaxScaler(), optional (default = MinMaxScaler())
Method of data normalization
- epsilon: float, optional (default = 0.1)
Hyper-parameter for the generation of negative samples. A smaller epsilon results in negative samples more similar to normal samples.
- proportion: float, optional (default = 1.0)
Hyper-parameter for the proprotion of negative samples to use relative to the number of normal training samples.
- n_epochs: int, optional (default = 200)
Number of epochs to train neural network.
- lr: float, optional (default = 0.001)
Learning rate.
- wd: float, optional (default = 0.1)
Weight decay.
- verbose: int in {0,1}, optional (default = 0):
To view or hide training progress
Attributes¶
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector. For consistency, outliers are assigned with larger anomaly scores. Parameters ———- X : numpy array of shape (n_samples, n_features)
The training input samples.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is assumed to be 0 for all training samples. Parameters ———- X : numpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Overwritten with 0 for all training samples (assumed to be normal).
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.lscp module¶
Locally Selective Combination of Parallel Outlier Ensembles (LSCP). Adapted from the original implementation.
- class pyod.models.lscp.LSCP(detector_list, local_region_size=30, local_max_features=1.0, n_bins=10, random_state=None, contamination=0.1)[source]¶
Bases:
BaseDetector
Locally Selection Combination in Parallel Outlier Ensembles
LSCP is an unsupervised parallel outlier detection ensemble which selects competent detectors in the local region of a test instance. This implementation uses an Average of Maximum strategy. First, a heterogeneous list of base detectors is fit to the training data and then generates a pseudo ground truth for each train instance is generated by taking the maximum outlier score.
For each test instance: 1) The local region is defined to be the set of nearest training points in randomly sampled feature subspaces which occur more frequently than a defined threshold over multiple iterations.
2) Using the local region, a local pseudo ground truth is defined and the pearson correlation is calculated between each base detector’s training outlier scores and the pseudo ground truth.
3) A histogram is built out of pearson correlation scores; detectors in the largest bin are selected as competent base detectors for the given test instance.
4) The average outlier score of the selected competent detectors is taken to be the final score.
See [BZNHL19] for details.
Parameters¶
- detector_listList, length must be greater than 1
Base unsupervised outlier detectors from PyOD. (Note: requires fit and decision_function methods)
- local_region_sizeint, optional (default=30)
Number of training points to consider in each iteration of the local region generation process (30 by default).
- local_max_featuresfloat in (0.5, 1.), optional (default=1.0)
Maximum proportion of number of features to consider when defining the local region (1.0 by default).
- n_binsint, optional (default=10)
Number of bins to use when selecting the local region
- random_stateRandomState, optional (default=None)
A random number generator instance to define the state of the random permutations generator.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function (0.1 by default).
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
Examples¶
>>> from pyod.utils.data import generate_data >>> from pyod.utils.utility import standardizer >>> from pyod.models.lscp import LSCP >>> from pyod.models.lof import LOF >>> X_train, y_train, X_test, y_test = generate_data( ... n_train=50, n_test=50, ... contamination=0.1, random_state=42) >>> X_train, X_test = standardizer(X_train, X_test) >>> detector_list = [LOF(), LOF()] >>> clf = LSCP(detector_list) >>> clf.fit(X_train) LSCP(...)
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.mad module¶
Median Absolute deviation (MAD) Algorithm. Strictly for Univariate Data.
- class pyod.models.mad.MAD(threshold=3.5, contamination=0.1)[source]¶
Bases:
BaseDetector
Median Absolute Deviation: for measuring the distances between data points and the median in terms of median distance. See [BIH93] for details.
Parameters¶
- thresholdfloat, optional (default=3.5)
The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.
- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector. The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator. Note that n_features must equal 1.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples. Note that n_features must equal 1.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.mcd module¶
Outlier Detection with Minimum Covariance Determinant (MCD)
- class pyod.models.mcd.MCD(contamination=0.1, store_precision=True, assume_centered=False, support_fraction=None, random_state=None)[source]¶
Bases:
BaseDetector
Detecting outliers in a Gaussian distributed dataset using Minimum Covariance Determinant (MCD): robust estimator of covariance.
The Minimum Covariance Determinant covariance estimator is to be applied on Gaussian-distributed data, but could still be relevant on data drawn from a unimodal, symmetric distribution. It is not meant to be used with multi-modal data (the algorithm used to fit a MinCovDet object is likely to fail in such a case). One should consider projection pursuit methods to deal with multi-modal datasets.
First fit a minimum covariance determinant model and then compute the Mahalanobis distance as the outlier degree of the data
See [BHR04, BRD99] for details.
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- store_precisionbool
Specify if the estimated precision is stored.
- assume_centeredbool
If True, the support of the robust location and the covariance estimates is computed, and a covariance estimate is recomputed from it, without centering the data. Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If False, the robust location and covariance are directly computed with the FastMCD algorithm without additional treatment.
- support_fractionfloat, 0 < support_fraction < 1
The proportion of points to be included in the support of the raw MCD estimate. Default is None, which implies that the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Attributes¶
- raw_location_array-like, shape (n_features,)
The raw robust estimated location before correction and re-weighting.
- raw_covariance_array-like, shape (n_features, n_features)
The raw robust estimated covariance before correction and re-weighting.
- raw_support_array-like, shape (n_samples,)
A mask of the observations that have been used to compute the raw robust estimates of location and shape, before correction and re-weighting.
- location_array-like, shape (n_features,)
Estimated robust location
- covariance_array-like, shape (n_features, n_features)
Estimated robust covariance matrix
- precision_array-like, shape (n_features, n_features)
Estimated pseudo inverse matrix. (stored only if store_precision is True)
- support_array-like, shape (n_samples,)
A mask of the observations that have been used to compute the robust estimates of location and shape.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted. Mahalanobis distances of the training set (on which :meth:`fit is called) observations.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.mo_gaal module¶
Multiple-Objective Generative Adversarial Active Learning. Part of the codes are adapted from https://github.com/leibinghe/GAAL-based-outlier-detection
- class pyod.models.mo_gaal.MO_GAAL(k=10, stop_epochs=20, lr_d=0.01, lr_g=0.0001, momentum=0.9, contamination=0.1)[source]¶
Bases:
BaseDetector
Multi-Objective Generative Adversarial Active Learning.
MO_GAAL directly generates informative potential outliers to assist the classifier in describing a boundary that can separate outliers from normal data effectively. Moreover, to prevent the generator from falling into the mode collapsing problem, the network structure of SO-GAAL is expanded from a single generator (SO-GAAL) to multiple generators with different objectives (MO-GAAL) to generate a reasonable reference distribution for the whole dataset. Read more in the [BLLZ+19].
Parameters¶
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- kint, optional (default=10)
The number of sub generators.
- stop_epochsint, optional (default=20)
The number of epochs of training. The number of total epochs equals to three times of stop_epochs.
- lr_dfloat, optional (default=0.01)
The learn rate of the discriminator.
- lr_gfloat, optional (default=0.0001)
The learn rate of the generator.
- momentumfloat, optional (default=0.9)
The momentum parameter for SGD.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.ocsvm module¶
One-class SVM detector. Implemented on scikit-learn library.
- class pyod.models.ocsvm.OCSVM(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, contamination=0.1)[source]¶
Bases:
BaseDetector
Wrapper of scikit-learn one-class SVM Class with more functionalities. Unsupervised Outlier Detection.
Estimate the support of a high-dimensional distribution.
The implementation is based on libsvm. See http://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection and [BScholkopfPST+01].
Parameters¶
- kernelstring, optional (default=’rbf’)
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.
- nufloat, optional
An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
- degreeint, optional (default=3)
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
- gammafloat, optional (default=’auto’)
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used instead.
- coef0float, optional (default=0.0)
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
- tolfloat, optional
Tolerance for stopping criterion.
- shrinkingbool, optional
Whether to use the shrinking heuristic.
- cache_sizefloat, optional
Specify the size of the kernel cache (in MB).
- verbosebool, default: False
Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.
- max_iterint, optional (default=-1)
Hard limit on iterations within solver, or -1 for no limit.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
Attributes¶
- support_array-like, shape = [n_SV]
Indices of support vectors.
- support_vectors_array-like, shape = [nSV, n_features]
Support vectors.
- dual_coef_array, shape = [1, n_SV]
Coefficients of the support vectors in the decision function.
- coef_array, shape = [1, n_features]
Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.
coef_ is readonly property derived from dual_coef_ and support_vectors_
- intercept_array, shape = [1,]
Constant in the decision function.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None, sample_weight=None, **params)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- sample_weightarray-like, shape (n_samples,)
Per-sample weights. Rescale C per sample. Higher weights force the classifier to put more emphasis on these points.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.pca module¶
Principal Component Analysis (PCA) Outlier Detector
- class pyod.models.pca.PCA(n_components=None, n_selected_components=None, contamination=0.1, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, weighted=True, standardization=True)[source]¶
Bases:
BaseDetector
Principal component analysis (PCA) can be used in detecting outliers. PCA is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
In this procedure, covariance matrix of the data can be decomposed to orthogonal vectors, called eigenvectors, associated with eigenvalues. The eigenvectors with high eigenvalues capture most of the variance in the data.
Therefore, a low dimensional hyperplane constructed by k eigenvectors can capture most of the variance in the data. However, outliers are different from normal data points, which is more obvious on the hyperplane constructed by the eigenvectors with small eigenvalues.
Therefore, outlier scores can be obtained as the sum of the projected distance of a sample on all eigenvectors. See [BAgg15, BSCSC03] for details.
Score(X) = Sum of weighted euclidean distance between each sample to the hyperplane constructed by the selected eigenvectors
Parameters¶
- n_componentsint, float, None or string
Number of components to keep. if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
if n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension if
0 < n_components < 1
and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components n_components cannot be equal to n_features for svd_solver == ‘arpack’.- n_selected_componentsint, optional (default=None)
Number of selected principal components for calculating the outlier scores. It is not necessarily equal to the total number of the principal components. If not set, use all principal components.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
- copybool (default True)
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
- whitenbool, optional (default False)
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
- svd_solverstring {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
- auto :
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
- full :
run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
- arpack :
run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < X.shape[1]
- randomized :
run randomized SVD by the method of Halko et al.
- tolfloat >= 0, optional (default .0)
Tolerance for singular values computed by svd_solver == ‘arpack’.
- iterated_powerint >= 0, or ‘auto’, (default ‘auto’)
Number of iterations for the power method computed by svd_solver == ‘randomized’.
- random_stateint, RandomState instance or None, optional (default None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
svd_solver
== ‘arpack’ or ‘randomized’.- weightedbool, optional (default=True)
If True, the eigenvalues are used in score computation. The eigenvectors with small eigenvalues comes with more importance in outlier score calculation.
- standardizationbool, optional (default=True)
If True, perform standardization first to convert data to zero mean and unit variance. See http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
Attributes¶
- components_array, shape (n_components, n_features)
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by
explained_variance_
.- explained_variance_array, shape (n_components,)
The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
- explained_variance_ratio_array, shape (n_components,)
Percentage of variance explained by each of the selected components.
If
n_components
is not set then all components are stored and the sum of explained variances is equal to 1.0.- singular_values_array, shape (n_components,)
The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the
n_components
variables in the lower-dimensional space.- mean_array, shape (n_features,)
Per-feature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
- n_components_int
The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or n_features if n_components is None.
- noise_variance_float
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- property explained_variance_¶
The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
Decorator for scikit-learn PCA attributes.
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- property noise_variance_¶
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to computed the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.
Decorator for scikit-learn PCA attributes.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.qmcd module¶
Quasi-Monte Carlo Discrepancy outlier detection (QMCD)
- class pyod.models.qmcd.QMCD(contamination=0.1)[source]¶
Bases:
BaseDetector
- The Wrap-around Quasi-Monte Carlo discrepancy is a uniformity criterion
which is used to assess the space filling of a number of samples in a hypercube. It quantifies the distance between the continuous uniform distribution on a hypercube and the discrete uniform distribution on distinct sample points. Therefore, lower discrepancy values for a sample point indicates that it provides better coverage of the parameter space with regard to the rest of the samples. This method is kernel based and a higher discrepancy score is relative to the rest of the samples, the higher the likelihood of it being an outlier. Read more in the [BFM01].
Parameters¶
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers.
- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The independent and dependent/target samples with the target samples being the last column of the numpy array such that eg: X = np.append(x, y.reshape(-1,1), axis=1). Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- fit(X, y=None)[source]¶
Fit detector
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Returns¶
self : object
pyod.models.rgraph module¶
R-graph
- class pyod.models.rgraph.RGraph(transition_steps=10, n_nonzero=10, gamma=50.0, gamma_nz=True, algorithm='lasso_lars', tau=1.0, maxiter_lasso=1000, preprocessing=True, contamination=0.1, blocksize_test_data=10, support_init='L2', maxiter=40, support_size=100, active_support=True, fit_intercept_LR=False, verbose=True)[source]¶
Bases:
BaseDetector
Outlier Detection via R-graph. Paper: https://openaccess.thecvf.com/content_cvpr_2017/papers/You_Provable_Self-Representation_Based_CVPR_2017_paper.pdf See [BYRV17] for details.
Parameters¶
- transition_stepsint, optional (default=20)
Number of transition steps that are taken in the graph, after which the outlier scores are determined.
gamma : float
- gamma_nzboolean, default True
gamma and gamma_nz together determines the parameter alpha. When
gamma_nz = False
, alpha = gamma. Whengamma_nz = True
, then alpha = gamma * alpha0, where alpha0 is the largest number such that the solution to the optimization problem with alpha = alpha0 is the zero vector (see Proposition 1 in [1]). Therefore, whengamma_nz = True
, gamma should be a value greater than 1.0. A good choice is typically in the range [5, 500].- taufloat, default 1.0
Parameter for elastic net penalty term. When tau = 1.0, the method reduces to sparse subspace clustering with basis pursuit (SSC-BP) [2]. When tau = 0.0, the method reduces to least squares regression (LSR).
- algorithmstring, default
lasso_lars
Algorithm for computing the representation. Either lasso_lars or lasso_cd. Note:
lasso_lars
andlasso_cd
only support tau = 1. For cases tau << 1 linear regression is used.- fit_intercept_LR: bool, optional (default=False)
For
gamma
> 10000 linear regression is used instead oflasso_lars
orlasso_cd
. This parameter determines whether the intercept for the model is calculated.- maxiter_lassoint, default 1000
The maximum number of iterations for
lasso_lars
andlasso_cd
.- n_nonzeroint, default 50
This is an upper bound on the number of nonzero entries of each representation vector. If there are more than n_nonzero nonzero entries, only the top n_nonzero number of entries with largest absolute value are kept.
- active_support: boolean, default True
Set to True to use the active support algorithm in [1] for solving the optimization problem. This should significantly reduce the running time when n_samples is large.
- active_support_params: dictionary of string to any, optional
Parameters (keyword arguments) and values for the active support algorithm. It may be used to set the parameters
support_init
,support_size
andmaxiter
, seeactive_support_elastic_net
for details. Example: active_support_params={‘support_size’:50, ‘maxiter’:100} Ignored whenactive_support=False
- preprocessingbool, optional (default=True)
If True, apply standardization on the data.
- verboseint, optional (default=1)
Verbosity mode.
0 = silent
1 = progress bar
2 = one line per epoch.
For verbose >= 1, model summary may be printed.
- random_staterandom_state: int, RandomState instance or None, optional
(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- contaminationfloat in (0., 0.5), optional (default=0.1)
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function.
- blocksize_test_data: int, optional (default=10)
Test set is splitted into blocks of the size
blocksize_test_data
to at least partially separate test - and train set
Attributes¶
- transition_matrix_numpy array of shape (n_samples,)
Transition matrix from the last fitted data, this might include training + test data
- decision_scores_numpy array of shape (n_samples,)
The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.
- threshold_float
The threshold is based on
contamination
. It is then_samples * contamination
most abnormal samples indecision_scores_
. The threshold is calculated for generating binary outlier labels.- labels_int, either 0 or 1
The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying
threshold_
ondecision_scores_
.
- active_support_elastic_net(X, y, alpha, tau=1.0, algorithm='lasso_lars', support_init='L2', support_size=100, maxiter=40, maxiter_lasso=1000)[source]¶
- Source: https://github.com/ChongYou/subspace-clustering/blob/master/cluster/selfrepresentation.py
An active support based algorithm for solving the elastic net optimization problem min_{c} tau ||c||_1 + (1-tau)/2 ||c||_2^2 + alpha / 2 ||y - c X ||_2^2.
Parameters¶
X : array-like, shape (n_samples, n_features)
y : array-like, shape (1, n_features)
alpha : float
tau : float, default 1.0
- algorithmstring, default
spams
Algorithm for computing solving the subproblems. Either lasso_lars or lasso_cd or spams (installation of spams package is required). Note:
lasso_lars
andlasso_cd
only support tau = 1.- support_init: string, default
knn
This determines how the active support is initialized. It can be either
knn
orL2
.- support_size: int, default 100
This determines the size of the working set. A small support_size decreases the runtime per iteration while increase the number of iterations.
- maxiter: int default 40
Termination condition for active support update.
Returns¶
- cshape n_samples
The optimal solution to the optimization problem.
- compute_rejection_stats(T=32, delta=0.1, c_fp=1, c_fn=1, c_r=-1, verbose=False)¶
- Add reject option into the unsupervised detector.
This comes with guarantees: an estimate of the expected rejection rate (return_rejectrate=True), an upper bound of the rejection rate (return_ub_rejectrate= True), and an upper bound on the cost (return_ub_cost=True).
Parameters¶
- T: int, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive),
optional (default = [1,1, contamination]) costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
- verbose: bool, optional (default = False)
If true, it prints the expected rejection rate, the upper bound rejection rate, and the upper bound of the cost.
Returns¶
expected_rejection_rate: float, the expected rejection rate; upperbound_rejection_rate: float, the upper bound for the rejection rate
satisfied with probability 1-delta;
upperbound_cost: float, the upper bound for the cost;
- decision_function(X)[source]¶
Predict raw anomaly score of X using the fitted detector.
The anomaly score of an input sample is computed based on different detector algorithms. For consistency, outliers are assigned with larger anomaly scores.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
The anomaly score of the input samples.
- elastic_net_subspace_clustering(X, gamma=50.0, gamma_nz=True, tau=1.0, algorithm='lasso_lars', fit_intercept_LR=False, active_support=True, active_support_params=None, n_nonzero=50, maxiter_lasso=1000)[source]¶
Source: https://github.com/ChongYou/subspace-clustering/blob/master/cluster/selfrepresentation.py
Elastic net subspace clustering (EnSC) [1]. Compute self-representation matrix C from solving the following optimization problem min_{c_j} tau ||c_j||_1 + (1-tau)/2 ||c_j||_2^2 + alpha / 2 ||x_j - c_j X ||_2^2 s.t. c_jj = 0, where c_j and x_j are the j-th rows of C and X, respectively.
Parameter
algorithm
specifies the algorithm for solving the optimization problem.lasso_lars
andlasso_cd
are algorithms implemented in sklearn,spams
refers to the same algorithm aslasso_lars
but is implemented in spams package available at http://spams-devel.gforge.inria.fr/ (installation required) In principle, all three algorithms give the same result. For large scale data (e.g. with > 5000 data points), use any of these algorithms in conjunction withactive_support=True
. It adopts an efficient active support strategy that solves the optimization problem by breaking it into a sequence of small scale optimization problems as described in [1]. If tau = 1.0, the method reduces to sparse subspace clustering with basis pursuit (SSC-BP) [2]. If tau = 0.0, the method reduces to least squares regression (LSR) [3]. Note:lasso_lars
andlasso_cd
only support tau = 1. Parameters ———– X : array-like, shape (n_samples, n_features)Input data to be clustered
gamma : float gamma_nz : boolean, default True
gamma and gamma_nz together determines the parameter alpha. When
gamma_nz = False
, alpha = gamma. Whengamma_nz = True
, then alpha = gamma * alpha0, where alpha0 is the largest number such that the solution to the optimization problem with alpha = alpha0 is the zero vector (see Proposition 1 in [1]). Therefore, whengamma_nz = True
, gamma should be a value greater than 1.0. A good choice is typically in the range [5, 500].- taufloat, default 1.0
Parameter for elastic net penalty term. When tau = 1.0, the method reduces to sparse subspace clustering with basis pursuit (SSC-BP) [2]. When tau = 0.0, the method reduces to least squares regression (LSR) [3].
- algorithmstring, default
lasso_lars
Algorithm for computing the representation. Either lasso_lars or lasso_cd or spams (installation of spams package is required). Note:
lasso_lars
andlasso_cd
only support tau = 1.- n_nonzeroint, default 50
This is an upper bound on the number of nonzero entries of each representation vector. If there are more than n_nonzero nonzero entries, only the top n_nonzero number of entries with largest absolute value are kept.
- active_support: boolean, default True
Set to True to use the active support algorithm in [1] for solving the optimization problem. This should significantly reduce the running time when n_samples is large.
- active_support_params: dictionary of string to any, optional
Parameters (keyword arguments) and values for the active support algorithm. It may be used to set the parameters
support_init
,support_size
andmaxiter
, seeactive_support_elastic_net
for details. Example: active_support_params={‘support_size’:50, ‘maxiter’:100} Ignored whenactive_support=False
Returns¶
- representation_matrix_csr matrix, shape: n_samples by n_samples
The self-representation matrix.
References¶
[1] C. You, C.-G. Li, D. Robinson, R. Vidal, Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering, CVPR 2016 [2] E. Elhaifar, R. Vidal, Sparse Subspace Clustering: Algorithm, Theory, and Applications, TPAMI 2013 [3] C. Lu, et al. Robust and efficient subspace segmentation via least squares regression, ECCV 2012
- fit(X, y=None)[source]¶
Fit detector. y is ignored in unsupervised methods. Parameters ———- X : numpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- selfobject
Fitted estimator.
- fit_predict(X, y=None)¶
Fit detector first and then predict whether a particular sample is an outlier or not. y is ignored in unsupervised models.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Deprecated since version 0.6.9: fit_predict will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency.
- fit_predict_score(X, y, scoring='roc_auc_score')¶
Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- yIgnored
Not used, present for API consistency by convention.
- scoringstr, optional (default=’roc_auc_score’)
Evaluation metric:
‘roc_auc_score’: ROC score
‘prc_n_score’: Precision @ rank n score
Returns¶
score : float
Deprecated since version 0.6.9: fit_predict_score will be removed in pyod 0.8.0.; it will be replaced by calling fit function first and then accessing labels_ attribute for consistency. Scoring could be done by calling an evaluation method, e.g., AUC ROC.
- get_params(deep=True)¶
Get parameters for this estimator.
See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.
Parameters¶
- deepbool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns¶
- paramsmapping of string to any
Parameter names mapped to their values.
- predict(X, return_confidence=False)¶
Predict if a particular sample is an outlier or not.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
- confidencenumpy array of shape (n_samples,).
Only if return_confidence is set to True.
- predict_confidence(X)¶
Predict the model’s confidence in making the same prediction under slightly different training sets. See [BPVD20].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
Returns¶
- confidencenumpy array of shape (n_samples,)
For each observation, tells how consistently the model would make the same prediction if the training set was perturbed. Return a probability, ranging in [0,1].
- predict_proba(X, method='linear', return_confidence=False)¶
Predict the probability of a sample being outlier. Two approaches are possible:
simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- methodstr, optional (default=’linear’)
probability conversion method. It must be one of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional(default=False)
If True, also return the confidence of prediction.
Returns¶
- outlier_probabilitynumpy array of shape (n_samples, n_classes)
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1]. Note it depends on the number of classes, which is by default 2 classes ([proba of normal, proba of outliers]).
- predict_with_rejection(X, T=32, return_stats=False, delta=0.1, c_fp=1, c_fn=1, c_r=-1)¶
- Predict if a particular sample is an outlier or not,
allowing the detector to reject (i.e., output = -2) low confidence predictions.
Parameters¶
- Xnumpy array of shape (n_samples, n_features)
The input samples.
- Tint, optional(default=32)
It allows to set the rejection threshold to 1-2exp(-T). The higher the value of T, the more rejections are made.
- return_stats: bool, optional (default = False)
If true, it returns also three additional float values: the estimated rejection rate, the upper bound rejection rate, and the upper bound of the cost.
- delta: float, optional (default = 0.1)
The upper bound rejection rate holds with probability 1-delta.
- c_fp, c_fn, c_r: floats (positive), optional (default = [1,1, contamination])
costs for false positive predictions (c_fp), false negative predictions (c_fn) and rejections (c_r).
Returns¶
- outlier_labelsnumpy array of shape (n_samples,)
For each observation, it tells whether it should be considered as an outlier according to the fitted model. 0 stands for inliers, 1 for outliers and -2 for rejection.
expected_rejection_rate: float, if return_stats is True; upperbound_rejection_rate: float, if return_stats is True; upperbound_cost: float, if return_stats is True;
- set_params(**params)¶
Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form