# Welcome to PyOD documentation!¶

**Deployment & Documentation & Stats**

**Build Status & Code Coverage & Maintainability**

PyOD is a comprehensive and scalable **Python toolkit** for **detecting outlying objects** in
multivariate data. This exciting yet challenging field is commonly referred as
Outlier Detection
or Anomaly Detection.
Since 2017, PyOD has been successfully used in various academic researches and commercial products [AZH18a][AZH18b][AZHNL19].
PyOD is featured for:

**Unified APIs, detailed documentation, and interactive examples**across various algorithms.**Advanced models**, including**Neural Networks/Deep Learning**and**Outlier Ensembles**.**Optimized performance with JIT and parallelization**when possible, using numba and joblib.**Compatible with both Python 2 & 3**(scikit-learn compatible as well).

**Important Notes**:
PyOD contains neural network based models, e.g., AutoEncoders, which are
implemented in Keras. However, PyOD would **NOT** install **Keras** and/or
**TensorFlow** automatically. This reduces the risk of damaging your local copies.
If you want to use neural net based models, you should install Keras and back-end libraries like TensorFlow manually.
An instruction is provided: neural-net FAQ.
Similarly, some models, e.g., XGBOD, depend on **xgboost**, which would **NOT** be installed by default.

**Key Links and Resources**:

# Quick Introduction¶

PyOD toolkit consists of three major groups of functionalities:

**(i) Individual Detection Algorithms** :

- Linear Models for Outlier Detection:

Type | Abbr | Algorithm | Year | Class | Ref |
---|---|---|---|---|---|

Linear Model | PCA | Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes) | 2003 | `pyod.models.pca.PCA` |
[ASCSC03] |

Linear Model | MCD | Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores) | 1999 | `pyod.models.mcd.MCD` |
[ARD99][AHR04] |

Linear Model | OCSVM | One-Class Support Vector Machines | 2003 | `pyod.models.ocsvm.OCSVM` |
[AMP03] |

Proximity-Based | LOF | Local Outlier Factor | 2000 | `pyod.models.lof.LOF` |
[ABKNS00] |

Proximity-Based | CBLOF | Clustering-Based Local Outlier Factor | 2003 | `pyod.models.cblof.CBLOF` |
[AHXD03]: |

Proximity-Based | LOCI | LOCI: Fast outlier detection using the local correlation integral | 2003 | `pyod.models.loci.LOCI` |
[APKGF03] |

Proximity-Based | HBOS | Histogram-based Outlier Score | 2012 | `pyod.models.hbos.HBOS` |
[AGD12] |

Proximity-Based | kNN | k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score | 2000 | `pyod.models.knn.KNN` |
[ARRS00][AAP02] |

Proximity-Based | AvgKNN | Average kNN (use the average distance to k nearest neighbors as the outlier score) | 2002 | `pyod.models.knn.KNN` |
[ARRS00][AAP02] |

Proximity-Based | MedKNN | Median kNN (use the median distance to k nearest neighbors as the outlier score) | 2002 | `pyod.models.knn.KNN` |
[ARRS00][AAP02] |

Probabilistic | ABOD | Angle-Based Outlier Detection | 2008 | `pyod.models.abod.ABOD` |
[AKZ+08] |

Probabilistic | FastABOD | Fast Angle-Based Outlier Detection using approximation | 2008 | `pyod.models.abod.ABOD` |
[AKZ+08] |

Probabilistic | SOS | Stochastic Outlier Selection | 2012 | `pyod.models.sos.SOS` |
[AJHuszarPvdH12] |

Outlier Ensembles | IForest | Isolation Forest | 2008 | `pyod.models.iforest.IForest` |
[ALTZ08][ALTZ12] |

Outlier Ensembles | Feature Bagging | 2005 | `pyod.models.feature_bagging.FeatureBagging` |
[ALK05] | |

Outlier Ensembles | LSCP | LSCP: Locally Selective Combination of Parallel Outlier Ensembles | 2019 | `pyod.models.lscp.LSCP` |
[AZHNL19] |

Outlier Ensembles | XGBOD | Extreme Boosting Based Outlier Detection (Supervised) |
2018 | `pyod.models.xgbod.XGBOD` |
[AZH18a] |

Neural Networks | AutoEncoder | Fully connected AutoEncoder (use reconstruction error as the outlier score) | 2015 | `pyod.models.auto_encoder.AutoEncoder` |
[AAgg15] |

Neural Networks | SO_GAAL | Single-Objective Generative Adversarial Active Learning | 2019 | `pyod.models.so_gaal.SO_GAAL` |
[ALLZ+18] |

Neural Networks | MO_GAAL | Multiple-Objective Generative Adversarial Active Learning | 2019 | `pyod.models.mo_gaal.MO_GAAL` |
[ALLZ+18] |

**(ii) Outlier Ensembles & Outlier Detector Combination Frameworks**:

Type | Abbr | Algorithm | Year | Ref | |
---|---|---|---|---|---|

Outlier Ensembles | Feature Bagging | 2005 | `pyod.models.feature_bagging.FeatureBagging` |
[ALK05] | |

Outlier Ensembles | LSCP | LSCP: Locally Selective Combination of Parallel Outlier Ensembles | 2019 | `pyod.models.lscp.LSCP` |
[AZHNL19] |

Combination | Average | Simple combination by averaging the scores | 2015 | `pyod.models.combination.average()` |
[AAS15] |

Combination | Weighted Average | Simple combination by averaging the scores with detector weights | 2015 | `pyod.models.combination.average()` |
[AAS15] |

Combination | Maximization | Simple combination by taking the maximum scores | 2015 | `pyod.models.combination.maximization()` |
[AAS15] |

Combination | AOM | Average of Maximum | 2015 | `pyod.models.combination.aom()` |
[AAS15] |

Combination | MOA | Maximization of Average | 2015 | `pyod.models.combination.moa()` |
[AAS15] |

**(iii) Utility Functions**:

Type | Name | Function |
---|---|---|

Data | `pyod.utils.data.generate_data()` |
Synthesized data generation; normal data is generated by a multivariate Gaussian and outliers are generated by a uniform distribution |

Stat | `pyod.utils.stat_models.wpearsonr()` |
Calculate the weighted Pearson correlation of two samples |

Utility | `pyod.utils.utility.get_label_n()` |
Turn raw outlier scores into binary labels by assign 1 to top n outlier scores |

Utility | `pyod.utils.utility.precision_n_scores()` |
calculate precision @ rank n |

**Comparison of all implemented models** are made available below
(Code, Jupyter Notebooks):

For Jupyter Notebooks, please navigate to **“/notebooks/Compare All Models.ipynb”**

# Key APIs & Attributes¶

The following APIs are applicable for all detector models for easy use.

`pyod.models.base.BaseDetector.fit()`

: Fit detector. y is optional for unsupervised methods.`pyod.models.base.BaseDetector.fit_predict()`

: Fit detector first and then predict whether a particular sample is an outlier or not.`pyod.models.base.BaseDetector.fit_predict_score()`

: Fit the detector, predict on samples, and evaluate the model by predefined metrics, e.g., ROC.`pyod.models.base.BaseDetector.decision_function()`

: Predict raw anomaly score of X using the fitted detector.`pyod.models.base.BaseDetector.predict()`

: Predict if a particular sample is an outlier or not using the fitted detector.`pyod.models.base.BaseDetector.predict_proba()`

: Predict the probability of a sample being outlier using the fitted detector.

Key Attributes of a fitted model:

`pyod.models.base.BaseDetector.decision_scores_`

: The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores.`pyod.models.base.BaseDetector.labels_`

: The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies.

# Quick Links¶

References

[AAgg15] | Charu C Aggarwal. Outlier analysis. In Data mining, 75–79. Springer, 2015. |

[AAS15] | (1, 2, 3, 4, 5) Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1):24–47, 2015. |

[AAP02] | (1, 2, 3) Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery, 15–27. Springer, 2002. |

[ABKNS00] | Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In ACM sigmod record, volume 29, 93–104. ACM, 2000. |

[AGD12] | Markus Goldstein and Andreas Dengel. Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm. KI-2012: Poster and Demo Track, pages 59–63, 2012. |

[AHR04] | Johanna Hardin and David M Rocke. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4):625–638, 2004. |

[AHXD03] | Zengyou He, Xiaofei Xu, and Shengchun Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9-10):1641–1650, 2003. |

[AJHuszarPvdH12] | JHM Janssens, Ferenc Huszár, EO Postma, and HJ van den Herik. Stochastic outlier selection. Technical Report, Technical report TiCC TR 2012-001, Tilburg University, Tilburg Center for Cognition and Communication, Tilburg, The Netherlands, 2012. |

[AKZ+08] | (1, 2) Hans-Peter Kriegel, Arthur Zimek, and others. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 444–452. ACM, 2008. |

[ALK05] | (1, 2) Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 157–166. ACM, 2005. |

[ALTZ08] | Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Data Mining, 2008. ICDM‘08. Eighth IEEE International Conference on, 413–422. IEEE, 2008. |

[ALTZ12] | Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3, 2012. |

[ALLZ+18] | (1, 2) Yezheng Liu, Zhe Li, Chong Zhou, Yuanchun Jiang, Jianshan Sun, Meng Wang, and Xiangnan He. Generative adversarial active learning for unsupervised outlier detection. arXiv preprint arXiv:1809.10816, 2018. |

[AMP03] | Junshui Ma and Simon Perkins. Time-series novelty detection using one-class support vector machines. In Neural Networks, 2003. Proceedings of the International Joint Conference on, volume 3, 1741–1745. IEEE, 2003. |

[APKGF03] | Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B Gibbons, and Christos Faloutsos. Loci: fast outlier detection using the local correlation integral. In Data Engineering, 2003. Proceedings. 19th International Conference on, 315–326. IEEE, 2003. |

[ARRS00] | (1, 2, 3) Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In ACM Sigmod Record, volume 29, 427–438. ACM, 2000. |

[ARD99] | Peter J Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212–223, 1999. |

[ASCSC03] | Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003. |

[AZH18a] | (1, 2) Yue Zhao and Maciej K Hryniewicki. Xgbod: improving supervised outlier detection with unsupervised representation learning. In Neural Networks, 2018. Proceedings of the International Joint Conference on. IEEE, 2018. |

[AZHNL19] | (1, 2, 3) Yue Zhao, Maciej K Hryniewicki, Zain Nasrullah, and Zheng Li. LSCP: locally selective combination in parallel outlier ensembles. In SIAM International Conference on Data Mining (SDM). Calgary, Canada, May 2019. Society for Industrial and Applied Mathematics. |

[AZH18b] | Yue Zhao and Maciej K. Hryniewicki. Dcso: dynamic combination of detector scores for outlier ensembles. In ACM SIGKDD Workshop on Outlier Detection De-constructed (ODD v5.0). ACM, 2018. |