Utility Functions

pyod.utils.data module

Utility functions for manipulating data

pyod.utils.data.check_consistent_shape(X_train, y_train, X_test, y_test, y_train_pred, y_test_pred)[source]

Internal shape to check input data shapes are consistent.

Parameters
  • X_train (numpy array of shape (n_samples, n_features)) – The training samples.

  • y_train (list or array of shape (n_samples,)) – The ground truth of training samples.

  • X_test (numpy array of shape (n_samples, n_features)) – The test samples.

  • y_test (list or array of shape (n_samples,)) – The ground truth of test samples.

  • y_train_pred (numpy array of shape (n_samples, n_features)) – The predicted binary labels of the training samples.

  • y_test_pred (numpy array of shape (n_samples, n_features)) – The predicted binary labels of the test samples.

Returns

  • X_train (numpy array of shape (n_samples, n_features)) – The training samples.

  • y_train (list or array of shape (n_samples,)) – The ground truth of training samples.

  • X_test (numpy array of shape (n_samples, n_features)) – The test samples.

  • y_test (list or array of shape (n_samples,)) – The ground truth of test samples.

  • y_train_pred (numpy array of shape (n_samples, n_features)) – The predicted binary labels of the training samples.

  • y_test_pred (numpy array of shape (n_samples, n_features)) – The predicted binary labels of the test samples.

pyod.utils.data.evaluate_print(clf_name, y, y_pred)[source]

Utility function for evaluating and printing the results for examples. Default metrics include ROC and Precision @ n

Parameters
  • clf_name (str) – The name of the detector.

  • y (list or numpy array of shape (n_samples,)) – The ground truth. Binary (0: inliers, 1: outliers).

  • y_pred (list or numpy array of shape (n_samples,)) – The raw outlier scores as returned by a fitted model.

pyod.utils.data.generate_data(n_train=1000, n_test=500, n_features=2, contamination=0.1, train_only=False, offset=10, behaviour='old', random_state=None)[source]

Utility function to generate synthesized data. Normal data is generated by a multivariate Gaussian distribution and outliers are generated by a uniform distribution.

Parameters
  • n_train (int, (default=1000)) – The number of training points to generate.

  • n_test (int, (default=500)) – The number of test points to generate.

  • n_features (int, optional (default=2)) – The number of features (dimensions).

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

  • train_only (bool, optional (default=False)) – If true, generate train data only.

  • offset (int, optional (default=10)) – Adjust the value range of Gaussian and Uniform.

  • behaviour (str, default='old') –

    Behaviour of the returned datasets which can be either ‘old’ or ‘new’. Passing behaviour='new' returns “X_train, y_train, X_test, y_test”, while passing behaviour='old' returns “X_train, X_test, y_train, y_test”.

    New in version 0.7.0: behaviour is added in 0.7.0 for back-compatibility purpose.

    Deprecated since version 0.7.0: behaviour='old' is deprecated in 0.20 and will not be possible in 0.7.2.

    Deprecated since version 0.7.2.: behaviour parameter will be deprecated in 0.7.2 and removed in 0.8.0.

  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns

  • X_train (numpy array of shape (n_train, n_features)) – Training data.

  • y_train (numpy array of shape (n_train,)) – Training ground truth.

  • X_test (numpy array of shape (n_test, n_features)) – Test data.

  • y_test (numpy array of shape (n_test,)) – Test ground truth.

pyod.utils.data.generate_data_clusters(n_train=1000, n_test=500, n_clusters=2, n_features=2, contamination=0.1, size='same', density='same', dist=0.25, random_state=None, return_in_clusters=False)[source]
Utility function to generate synthesized data in clusters.

Generated data can involve the low density pattern problem and global outliers which are considered as difficult tasks for outliers detection algorithms.

Parameters
  • n_train (int, (default=1000)) – The number of training points to generate.

  • n_test (int, (default=500)) – The number of test points to generate.

  • n_clusters (int, optional (default=2)) – The number of centers (i.e. clusters) to generate.

  • n_features (int, optional (default=2)) – The number of features for each sample.

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set.

  • size (str, optional (default='same')) – Size of each cluster: ‘same’ generates clusters with same size, ‘different’ generate clusters with different sizes.

  • density (str, optional (default='same')) – Density of each cluster: ‘same’ generates clusters with same density, ‘different’ generate clusters with different densities.

  • dist (float, optional (default=0.25)) – Distance between clusters. Should be between 0. and 1.0 It is used to avoid clusters overlapping as much as possible. However, if number of samples and number of clusters are too high, it is unlikely to separate them fully even if dist set to 1.0

  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • return_in_clusters (bool, optional (default=False)) – If True, the function returns x_train, y_train, x_test, y_test each as a list of numpy arrays where each index represents a cluster. If False, it returns x_train, y_train, x_test, y_test each as numpy array after joining the sequence of clusters arrays,

Returns

  • X_train (numpy array of shape (n_train, n_features)) – Training data.

  • y_train (numpy array of shape (n_train,)) – Training ground truth.

  • X_test (numpy array of shape (n_test, n_features)) – Test data.

  • y_test (numpy array of shape (n_test,)) – Test ground truth.

pyod.utils.data.get_outliers_inliers(X, y)[source]

Internal method to separate inliers from outliers.

Parameters
  • X (numpy array of shape (n_samples, n_features)) – The input samples

  • y (list or array of shape (n_samples,)) – The ground truth of input samples.

Returns

  • X_outliers (numpy array of shape (n_samples, n_features)) – Outliers.

  • X_inliers (numpy array of shape (n_samples, n_features)) – Inliers.

pyod.utils.example module

Utility functions for running examples

pyod.utils.example.data_visualize(X_train, y_train, show_figure=True, save_figure=False)[source]

Utility function for visualizing the synthetic samples generated by generate_data_cluster function.

Parameters
  • X_train (numpy array of shape (n_samples, n_features)) – The training samples.

  • y_train (list or array of shape (n_samples,)) – The ground truth of training samples.

  • show_figure (bool, optional (default=True)) – If set to True, show the figure.

  • save_figure (bool, optional (default=False)) – If set to True, save the figure to the local.

pyod.utils.example.visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)[source]

Utility function for visualizing the results in examples. Internal use only.

Parameters
  • clf_name (str) – The name of the detector.

  • X_train (numpy array of shape (n_samples, n_features)) – The training samples.

  • y_train (list or array of shape (n_samples,)) – The ground truth of training samples.

  • X_test (numpy array of shape (n_samples, n_features)) – The test samples.

  • y_test (list or array of shape (n_samples,)) – The ground truth of test samples.

  • y_train_pred (numpy array of shape (n_samples, n_features)) – The predicted binary labels of the training samples.

  • y_test_pred (numpy array of shape (n_samples, n_features)) – The predicted binary labels of the test samples.

  • show_figure (bool, optional (default=True)) – If set to True, show the figure.

  • save_figure (bool, optional (default=False)) – If set to True, save the figure to the local.

pyod.utils.stat_models module

A collection of statistical models

pyod.utils.stat_models.pairwise_distances_no_broadcast(X, Y)[source]

Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast.

For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4).

Parameters
  • X (array of shape (n_samples, n_features)) – First input samples

  • Y (array of shape (n_samples, n_features)) – Second input samples

Returns

distance – Row-wise euclidean distance of X and Y

Return type

array of shape (n_samples,)

pyod.utils.stat_models.pearsonr_mat(mat, w=None)[source]

Utility function to calculate pearson matrix (row-wise).

Parameters
  • mat (numpy array of shape (n_samples, n_features)) – Input matrix.

  • w (numpy array of shape (n_features,)) – Weights.

Returns

pear_mat – Row-wise pearson score matrix.

Return type

numpy array of shape (n_samples, n_samples)

pyod.utils.stat_models.wpearsonr(x, y, w=None)[source]

Utility function to calculate the weighted Pearson correlation of two samples.

See https://stats.stackexchange.com/questions/221246/such-thing-as-a-weighted-correlation for more information

Parameters
  • x (array, shape (n,)) – Input x.

  • y (array, shape (n,)) – Input y.

  • w (array, shape (n,)) – Weights w.

Returns

scores – Weighted Pearson Correlation between x and y.

Return type

float in range of [-1,1]

pyod.utils.utility module

A set of utility functions to support outlier detection.

pyod.utils.utility.argmaxn(value_list, n, order='desc')[source]

Return the index of top n elements in the list if order is set to ‘desc’, otherwise return the index of n smallest ones.

Parameters
  • value_list (list, array, numpy array of shape (n_samples,)) – A list containing all values.

  • n (int) – The number of elements to select.

  • order (str, optional (default='desc')) –

    The order to sort {‘desc’, ‘asc’}:

    • ’desc’: descending

    • ’asc’: ascending

Returns

index_list – The index of the top n elements.

Return type

numpy array of shape (n,)

pyod.utils.utility.check_detector(detector)[source]

Checks if fit and decision_function methods exist for given detector

Parameters

detector (pyod.models) – Detector instance for which the check is performed.

pyod.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)[source]

Check if an input is within the defined range.

Parameters
  • param (int, float) – The input parameter to check.

  • low (int, float) – The lower bound of the range.

  • high (int, float) – The higher bound of the range.

  • param_name (str, optional (default='')) – The name of the parameter.

  • include_left (bool, optional (default=False)) – Whether includes the lower bound (lower bound <=).

  • include_right (bool, optional (default=False)) – Whether includes the higher bound (<= higher bound).

Returns

within_range – Whether the parameter is within the range of (low, high)

Return type

bool or raise errors

pyod.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)[source]

Randomly draw feature indices. Internal use only.

Modified from sklearn/ensemble/bagging.py

Parameters
  • random_state (RandomState) – A random number generator instance to define the state of the random permutations generator.

  • bootstrap_features (bool) – Specifies whether to bootstrap indice generation

  • n_features (int) – Specifies the population size when generating indices

  • min_features (int) – Lower limit for number of features to randomly sample

  • max_features (int) – Upper limit for number of features to randomly sample

Returns

feature_indices – Indices for features to bag

Return type

numpy array, shape (n_samples,)

pyod.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)[source]

Draw randomly sampled indices. Internal use only.

See sklearn/ensemble/bagging.py

Parameters
  • random_state (RandomState) – A random number generator instance to define the state of the random permutations generator.

  • bootstrap (bool) – Specifies whether to bootstrap indice generation

  • n_population (int) – Specifies the population size when generating indices

  • n_samples (int) – Specifies number of samples to draw

Returns

indices – randomly drawn indices

Return type

numpy array, shape (n_samples,)

pyod.utils.utility.get_label_n(y, y_pred, n=None)[source]

Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores.

Parameters
  • y (list or numpy array of shape (n_samples,)) – The ground truth. Binary (0: inliers, 1: outliers).

  • y_pred (list or numpy array of shape (n_samples,)) – The raw outlier scores as returned by a fitted model.

  • n (int, optional (default=None)) – The number of outliers. if not defined, infer using ground truth.

Returns

labels – binary labels 0: normal points and 1: outliers

Return type

numpy array of shape (n_samples,)

Examples

>>> from pyod.utils.utility import get_label_n
>>> y = [0, 1, 1, 0, 0]
>>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7]
>>> get_label_n(y, y_pred)
array([0, 1, 0, 0, 1])
pyod.utils.utility.invert_order(scores, method='multiplication')[source]

Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different.

Parameters
  • scores (list, array or numpy array with shape (n_samples,)) – The list of values to be inverted

  • method (str, optional (default='multiplication')) –

    Methods used for order inversion. Valid methods are:

    • ’multiplication’: multiply by -1

    • ’subtraction’: max(scores) - scores

Returns

inverted_scores – The inverted list

Return type

numpy array of shape (n_samples,)

Examples

>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1]
>>> invert_order(scores1)
array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1])
>>> invert_order(scores1, method='subtraction')
array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])
pyod.utils.utility.precision_n_scores(y, y_pred, n=None)[source]

Utility function to calculate precision @ rank n.

Parameters
  • y (list or numpy array of shape (n_samples,)) – The ground truth. Binary (0: inliers, 1: outliers).

  • y_pred (list or numpy array of shape (n_samples,)) – The raw outlier scores as returned by a fitted model.

  • n (int, optional (default=None)) – The number of outliers. if not defined, infer using ground truth.

Returns

precision_at_rank_n – Precision at rank n score.

Return type

float

pyod.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)[source]

Turn raw outlier outlier scores to binary labels (0 or 1).

Parameters
  • pred_scores (list or numpy array of shape (n_samples,)) – Raw outlier scores. Outliers are assumed have larger values.

  • outliers_fraction (float in (0,1)) – Percentage of outliers.

Returns

outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

Return type

numpy array of shape (n_samples,)

pyod.utils.utility.standardizer(X, X_t=None, keep_scalar=False)[source]

Conduct Z-normalization on data to turn input samples become zero-mean and unit variance.

Parameters
  • X (numpy array of shape (n_samples, n_features)) – The training samples

  • X_t (numpy array of shape (n_samples_new, n_features), optional (default=None)) – The data to be converted

  • keep_scalar (bool, optional (default=False)) – The flag to indicate whether to return the scalar

Returns

  • X_norm (numpy array of shape (n_samples, n_features)) – X after the Z-score normalization

  • X_t_norm (numpy array of shape (n_samples, n_features)) – X_t after the Z-score normalization

  • scalar (sklearn scalar object) – The scalar used in conversion