Utility Functions#

pyod.utils.data module#

Utility functions for manipulating data

pyod.utils.data.check_consistent_shape(X_train, y_train, X_test, y_test, y_train_pred, y_test_pred)[source]#

Internal shape to check input data shapes are consistent.

Parameters#

X_trainnumpy array of shape (n_samples, n_features)

The training samples.

y_trainlist or array of shape (n_samples,)

The ground truth of training samples.

X_testnumpy array of shape (n_samples, n_features)

The test samples.

y_testlist or array of shape (n_samples,)

The ground truth of test samples.

y_train_prednumpy array of shape (n_samples, n_features)

The predicted binary labels of the training samples.

y_test_prednumpy array of shape (n_samples, n_features)

The predicted binary labels of the test samples.

Returns#

X_trainnumpy array of shape (n_samples, n_features)

The training samples.

y_trainlist or array of shape (n_samples,)

The ground truth of training samples.

X_testnumpy array of shape (n_samples, n_features)

The test samples.

y_testlist or array of shape (n_samples,)

The ground truth of test samples.

y_train_prednumpy array of shape (n_samples, n_features)

The predicted binary labels of the training samples.

y_test_prednumpy array of shape (n_samples, n_features)

The predicted binary labels of the test samples.

pyod.utils.data.evaluate_print(clf_name, y, y_pred)[source]#

Utility function for evaluating and printing the results for examples. Default metrics include ROC and Precision @ n

Parameters#

clf_namestr

The name of the detector.

ylist or numpy array of shape (n_samples,)

The ground truth. Binary (0: inliers, 1: outliers).

y_predlist or numpy array of shape (n_samples,)

The raw outlier scores as returned by a fitted model.

pyod.utils.data.generate_data(n_train=1000, n_test=500, n_features=2, contamination=0.1, train_only=False, offset=10, behaviour='new', random_state=None, n_nan=0, n_inf=0)[source]#

Utility function to generate synthesized data. Normal data is generated by a multivariate Gaussian distribution and outliers are generated by a uniform distribution. “X_train, X_test, y_train, y_test” are returned.

Parameters#

n_trainint, (default=1000)

The number of training points to generate.

n_testint, (default=500)

The number of test points to generate.

n_featuresint, optional (default=2)

The number of features (dimensions).

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

train_onlybool, optional (default=False)

If true, generate train data only.

offsetint, optional (default=10)

Adjust the value range of Gaussian and Uniform.

behaviourstr, default=’new’

Behaviour of the returned datasets which can be either ‘old’ or ‘new’. Passing behaviour='new' returns “X_train, X_test, y_train, y_test”, while passing behaviour='old' returns “X_train, y_train, X_test, y_test”.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_nanint

The number of values that are missing (np.NaN). Defaults to zero.

n_infint

The number of values that are infinite. (np.infty). Defaults to zero.

Returns#

X_trainnumpy array of shape (n_train, n_features)

Training data.

X_testnumpy array of shape (n_test, n_features)

Test data.

y_trainnumpy array of shape (n_train,)

Training ground truth.

y_testnumpy array of shape (n_test,)

Test ground truth.

pyod.utils.data.generate_data_categorical(n_train=1000, n_test=500, n_features=2, n_informative=2, n_category_in=2, n_category_out=2, contamination=0.1, shuffle=True, random_state=None)[source]#

Utility function to generate synthesized categorical data.

Parameters#

n_trainint, (default=1000)

The number of training points to generate.

n_testint, (default=500)

The number of test points to generate.

n_featuresint, optional (default=2)

The number of features for each sample.

n_informativeint in (1, n_features), optional (default=2)

The number of informative features in the outlier points. The higher the easier the outlier detection should be. Note that n_informative should not be less than or equal n_features.

n_category_inint in (1, n_inliers), optional (default=2)

The number of categories in the inlier points.

n_category_outint in (1, n_outliers), optional (default=2)

The number of categories in the outlier points.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set.

shuffle: bool, optional(default=True)

If True, inliers will be shuffled which makes more noisy distribution.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns#

X_trainnumpy array of shape (n_train, n_features)

Training data.

y_trainnumpy array of shape (n_train,)

Training ground truth.

X_testnumpy array of shape (n_test, n_features)

Test data.

y_testnumpy array of shape (n_test,)

Test ground truth.

pyod.utils.data.generate_data_clusters(n_train=1000, n_test=500, n_clusters=2, n_features=2, contamination=0.1, size='same', density='same', dist=0.25, random_state=None, return_in_clusters=False)[source]#
Utility function to generate synthesized data in clusters.

Generated data can involve the low density pattern problem and global outliers which are considered as difficult tasks for outliers detection algorithms.

Parameters#

n_trainint, (default=1000)

The number of training points to generate.

n_testint, (default=500)

The number of test points to generate.

n_clustersint, optional (default=2)

The number of centers (i.e. clusters) to generate.

n_featuresint, optional (default=2)

The number of features for each sample.

contaminationfloat in (0., 0.5), optional (default=0.1)

The amount of contamination of the data set, i.e. the proportion of outliers in the data set.

sizestr, optional (default=’same’)

Size of each cluster: ‘same’ generates clusters with same size, ‘different’ generate clusters with different sizes.

densitystr, optional (default=’same’)

Density of each cluster: ‘same’ generates clusters with same density, ‘different’ generate clusters with different densities.

dist: float, optional (default=0.25)

Distance between clusters. Should be between 0. and 1.0 It is used to avoid clusters overlapping as much as possible. However, if number of samples and number of clusters are too high, it is unlikely to separate them fully even if dist set to 1.0

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

return_in_clustersbool, optional (default=False)

If True, the function returns x_train, y_train, x_test, y_test each as a list of numpy arrays where each index represents a cluster. If False, it returns x_train, y_train, x_test, y_test each as numpy array after joining the sequence of clusters arrays,

Returns#

X_trainnumpy array of shape (n_train, n_features)

Training data.

y_trainnumpy array of shape (n_train,)

Training ground truth.

X_testnumpy array of shape (n_test, n_features)

Test data.

y_testnumpy array of shape (n_test,)

Test ground truth.

pyod.utils.data.get_outliers_inliers(X, y)[source]#

Internal method to separate inliers from outliers.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The input samples

ylist or array of shape (n_samples,)

The ground truth of input samples.

Returns#

X_outliersnumpy array of shape (n_samples, n_features)

Outliers.

X_inliersnumpy array of shape (n_samples, n_features)

Inliers.

pyod.utils.example module#

Utility functions for running examples

pyod.utils.example.data_visualize(X_train, y_train, show_figure=True, save_figure=False)[source]#

Utility function for visualizing the synthetic samples generated by generate_data_cluster function.

Parameters#

X_trainnumpy array of shape (n_samples, n_features)

The training samples.

y_trainlist or array of shape (n_samples,)

The ground truth of training samples.

show_figurebool, optional (default=True)

If set to True, show the figure.

save_figurebool, optional (default=False)

If set to True, save the figure to the local.

pyod.utils.example.visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)[source]#

Utility function for visualizing the results in examples. Internal use only.

Parameters#

clf_namestr

The name of the detector.

X_trainnumpy array of shape (n_samples, n_features)

The training samples.

y_trainlist or array of shape (n_samples,)

The ground truth of training samples.

X_testnumpy array of shape (n_samples, n_features)

The test samples.

y_testlist or array of shape (n_samples,)

The ground truth of test samples.

y_train_prednumpy array of shape (n_samples, n_features)

The predicted binary labels of the training samples.

y_test_prednumpy array of shape (n_samples, n_features)

The predicted binary labels of the test samples.

show_figurebool, optional (default=True)

If set to True, show the figure.

save_figurebool, optional (default=False)

If set to True, save the figure to the local.

pyod.utils.stat_models module#

A collection of statistical models

pyod.utils.stat_models.column_ecdf(matrix: ndarray) ndarray[source]#

Utility function to compute the column wise empirical cumulative distribution of a 2D feature matrix, where the rows are samples and the columns are features per sample. The accumulation is done in the positive direction of the sample axis.

E.G. p(1) = 0.2, p(0) = 0.3, p(2) = 0.1, p(6) = 0.4 ECDF E(5) = p(x <= 5) ECDF E would be E(-1) = 0, E(0) = 0.3, E(1) = 0.5, E(2) = 0.6, E(3) = 0.6, E(4) = 0.6, E(5) = 0.6, E(6) = 1

Similar to and tested against: https://www.statsmodels.org/stable/generated/statsmodels.distributions.empirical_distribution.ECDF.html

Returns#

pyod.utils.stat_models.ecdf_terminate_equals_inplace(matrix: ndarray, probabilities: ndarray)[source]#

This is a helper function for computing the ecdf of an array. It has been outsourced from the original function in order to be able to use the njit compiler of numpy for increased speeds, as it unfortunately needs a loop over all rows and columns of a matrix. It acts in place on the probabilities’ matrix.

Parameters#

matrix : a feature matrix where the rows are samples and each column is a feature !(expected to be sorted)!

probabilitiesa probability matrix that will be used building the ecdf. It has values between 0 and 1 and

is also sorted.

Returns#

pyod.utils.stat_models.pairwise_distances_no_broadcast(X, Y)[source]#

Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast.

For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4).

Parameters#

Xarray of shape (n_samples, n_features)

First input samples

Yarray of shape (n_samples, n_features)

Second input samples

Returns#

distancearray of shape (n_samples,)

Row-wise euclidean distance of X and Y

pyod.utils.stat_models.pearsonr_mat(mat, w=None)[source]#

Utility function to calculate pearson matrix (row-wise).

Parameters#

matnumpy array of shape (n_samples, n_features)

Input matrix.

wnumpy array of shape (n_features,)

Weights.

Returns#

pear_matnumpy array of shape (n_samples, n_samples)

Row-wise pearson score matrix.

pyod.utils.stat_models.wpearsonr(x, y, w=None)[source]#

Utility function to calculate the weighted Pearson correlation of two samples.

See https://stats.stackexchange.com/questions/221246/such-thing-as-a-weighted-correlation for more information

Parameters#

xarray, shape (n,)

Input x.

yarray, shape (n,)

Input y.

warray, shape (n,)

Weights w.

Returns#

scoresfloat in range of [-1,1]

Weighted Pearson Correlation between x and y.

pyod.utils.utility module#

A set of utility functions to support outlier detection.

pyod.utils.utility.argmaxn(value_list, n, order='desc')[source]#

Return the index of top n elements in the list if order is set to ‘desc’, otherwise return the index of n smallest ones.

Parameters#

value_listlist, array, numpy array of shape (n_samples,)

A list containing all values.

nint

The number of elements to select.

orderstr, optional (default=’desc’)

The order to sort {‘desc’, ‘asc’}:

  • ‘desc’: descending

  • ‘asc’: ascending

Returns#

index_listnumpy array of shape (n,)

The index of the top n elements.

pyod.utils.utility.check_detector(detector)[source]#

Checks if fit and decision_function methods exist for given detector

Parameters#

detectorpyod.models

Detector instance for which the check is performed.

pyod.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)[source]#

Check if an input is within the defined range.

Parameters#

paramint, float

The input parameter to check.

lowint, float

The lower bound of the range.

highint, float

The higher bound of the range.

param_namestr, optional (default=’’)

The name of the parameter.

include_leftbool, optional (default=False)

Whether includes the lower bound (lower bound <=).

include_rightbool, optional (default=False)

Whether includes the higher bound (<= higher bound).

Returns#

within_rangebool or raise errors

Whether the parameter is within the range of (low, high)

pyod.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)[source]#

Randomly draw feature indices. Internal use only.

Modified from sklearn/ensemble/bagging.py

Parameters#

random_stateRandomState

A random number generator instance to define the state of the random permutations generator.

bootstrap_featuresbool

Specifies whether to bootstrap indice generation

n_featuresint

Specifies the population size when generating indices

min_featuresint

Lower limit for number of features to randomly sample

max_featuresint

Upper limit for number of features to randomly sample

Returns#

feature_indicesnumpy array, shape (n_samples,)

Indices for features to bag

pyod.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)[source]#

Draw randomly sampled indices. Internal use only.

See sklearn/ensemble/bagging.py

Parameters#

random_stateRandomState

A random number generator instance to define the state of the random permutations generator.

bootstrapbool

Specifies whether to bootstrap indice generation

n_populationint

Specifies the population size when generating indices

n_samplesint

Specifies number of samples to draw

Returns#

indicesnumpy array, shape (n_samples,)

randomly drawn indices

pyod.utils.utility.get_diff_elements(li1, li2)[source]#

get the elements in li1 but not li2, and vice versa

Parameters#

li1list or numpy array

Input list 1.

li2list or numpy array

Input list 2.

Returns#

differencelist

The difference between li1 and li2.

pyod.utils.utility.get_intersection(lst1, lst2)[source]#

get the overlapping between two lists

Parameters#

li1list or numpy array

Input list 1.

li2list or numpy array

Input list 2.

Returns#

differencelist

The overlapping between li1 and li2.

pyod.utils.utility.get_label_n(y, y_pred, n=None)[source]#

Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores.

Parameters#

ylist or numpy array of shape (n_samples,)

The ground truth. Binary (0: inliers, 1: outliers).

y_predlist or numpy array of shape (n_samples,)

The raw outlier scores as returned by a fitted model.

nint, optional (default=None)

The number of outliers. if not defined, infer using ground truth.

Returns#

labelsnumpy array of shape (n_samples,)

binary labels 0: normal points and 1: outliers

Examples#

>>> from pyod.utils.utility import get_label_n
>>> y = [0, 1, 1, 0, 0]
>>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7]
>>> get_label_n(y, y_pred)
array([0, 1, 0, 0, 1])
pyod.utils.utility.get_list_diff(li1, li2)[source]#

get the elements in li1 but not li2. li1-li2

Parameters#

li1list or numpy array

Input list 1.

li2list or numpy array

Input list 2.

Returns#

differencelist

The difference between li1 and li2.

pyod.utils.utility.get_optimal_n_bins(X, upper_bound=None, epsilon=1)[source]#

Determine optimal number of bins for a histogram using the Birge Rozenblac method (see [BBirgeR06] for details.)

See https://doi.org/10.1051/ps:2006001

Parameters#

Xarray-like of shape (n_samples, n_features)

The samples to determine the optimal number of bins for.

upper_boundint, default=None

The maximum value of n_bins to be considered. If set to None, np.sqrt(X.shape[0]) will be used as upper bound.

epsilonfloat, default = 1

A stabilizing term added to the logarithm to prevent division by zero.

Returns#

optimal_n_binsint

The optimal value of n_bins according to the Birge Rozenblac method

pyod.utils.utility.invert_order(scores, method='multiplication')[source]#

Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different.

Parameters#

scoreslist, array or numpy array with shape (n_samples,)

The list of values to be inverted

methodstr, optional (default=’multiplication’)

Methods used for order inversion. Valid methods are:

  • ‘multiplication’: multiply by -1

  • ‘subtraction’: max(scores) - scores

Returns#

inverted_scoresnumpy array of shape (n_samples,)

The inverted list

Examples#

>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1]
>>> invert_order(scores1)
array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1])
>>> invert_order(scores1, method='subtraction')
array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])
pyod.utils.utility.precision_n_scores(y, y_pred, n=None)[source]#

Utility function to calculate precision @ rank n.

Parameters#

ylist or numpy array of shape (n_samples,)

The ground truth. Binary (0: inliers, 1: outliers).

y_predlist or numpy array of shape (n_samples,)

The raw outlier scores as returned by a fitted model.

nint, optional (default=None)

The number of outliers. if not defined, infer using ground truth.

Returns#

precision_at_rank_nfloat

Precision at rank n score.

pyod.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)[source]#

Turn raw outlier outlier scores to binary labels (0 or 1).

Parameters#

pred_scoreslist or numpy array of shape (n_samples,)

Raw outlier scores. Outliers are assumed have larger values.

outliers_fractionfloat in (0,1)

Percentage of outliers.

Returns#

outlier_labelsnumpy array of shape (n_samples,)

For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

pyod.utils.utility.standardizer(X, X_t=None, keep_scalar=False)[source]#

Conduct Z-normalization on data to turn input samples become zero-mean and unit variance.

Parameters#

Xnumpy array of shape (n_samples, n_features)

The training samples

X_tnumpy array of shape (n_samples_new, n_features), optional (default=None)

The data to be converted

keep_scalarbool, optional (default=False)

The flag to indicate whether to return the scalar

Returns#

X_normnumpy array of shape (n_samples, n_features)

X after the Z-score normalization

X_t_normnumpy array of shape (n_samples, n_features)

X_t after the Z-score normalization

scalarsklearn scalar object

The scalar used in conversion