Text and Image Detectors¶
PyOD’s EmbeddingOD chains foundation model encoders (sentence-transformers, OpenAI, HuggingFace) with any PyOD detector for text and image anomaly detection. Rankings from NLP-ADBench.
See Layer 1: Text and Image Anomaly Detection for usage.
pyod.models.embedding module¶
EmbeddingOD and MultiModalOD: Anomaly detection via foundation model embeddings.
EmbeddingOD chains any embedding encoder with any PyOD detector, enabling anomaly detection on text, image, and other non-tabular data through PyOD’s standard API. MultiModalOD extends this to multi-modal data by running separate detectors per modality and fusing their scores.
- class pyod.models.embedding.EmbeddingOD(encoder, detector='LUNAR', contamination=0.1, batch_size=32, cache_embeddings=False, reduce_dim=None, standardize=True, random_state=None)[source]¶
Bases:
BaseDetectorAnomaly detection on raw data via embedding + detector pipeline.
Chains any embedding encoder with any PyOD detector. Encode raw data (text, images, or other modalities) into numeric embeddings, then apply outlier detection in the embedding space.
This implements the two-step approach shown to outperform end-to-end methods in NLP-ADBench (Li et al., EMNLP 2025) and TAD-Bench (Cao et al., 2025).
Parameters¶
- encoderstr, BaseEncoder, SentenceTransformer instance, or callable
Embedding encoder. Accepts: - Registry shortcut: ‘all-MiniLM-L6-v2’, ‘text-embedding-3-small’,
‘dinov2-base’
HuggingFace model ID: ‘sentence-transformers/all-MiniLM-L6-v2’
Local filesystem path: ‘/path/to/local/weights’ — loaded without any network call, suitable for air-gapped environments.
Pre-instantiated SentenceTransformer: passed directly, no reload.
BaseEncoder instance
Callable: fn(X) -> np.ndarray of shape (n_samples, n_features)
- detectorstr or BaseDetector, optional (default=’LUNAR’)
Any PyOD detector. String resolves to default-configured instance. Default is LUNAR (best performer in NLP-ADBench).
- contaminationfloat, optional (default=0.1)
Expected proportion of outliers in the dataset. Must be in (0, 0.5].
- batch_sizeint, optional (default=32)
Batch size for encoding.
- cache_embeddingsbool, optional (default=False)
Cache training embeddings to avoid re-encoding. Recommended for API-based encoders (e.g., OpenAI).
- reduce_dimint or None, optional (default=None)
If set, apply PCA to reduce embedding dimensionality before detection. Recommended for embeddings >1000 dims with distance-based detectors (KNN, LOF).
- standardizebool, optional (default=True)
Apply StandardScaler to embeddings before detection. Matches the preprocessing pipeline in NLP-ADBench.
- random_stateint, RandomState instance or None, optional (default=None)
Controls stochastic parts of EmbeddingOD. The seed is forwarded to (a) the dimensionality-reduction PCA when
reduce_dimis set (PCA may pick a randomized SVD solver on high-dimensional embeddings) and (b) the string-resolved inner detector when that detector class declares an explicitrandom_stateparameter (e.g., the default'LUNAR'preset, or'IForest'). It does NOT control the external encoder’s own inference (e.g., sentence-transformers, DINOv2), which is treated as deterministic given fixed weights. WhenADEngine(random_state=...)builds a preset plan, the engine seed flows here automatically.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
Outlier scores of the training data. Higher is more abnormal.
- threshold_float
Score threshold based on
contamination.- labels_numpy array of shape (n_samples,)
Binary labels of training data (0: inlier, 1: outlier).
- encoder_BaseEncoder
The resolved encoder instance.
- detector_BaseDetector
The resolved and fitted detector instance.
Examples¶
>>> from pyod.models.embedding import EmbeddingOD >>> clf = EmbeddingOD(encoder='all-MiniLM-L6-v2', detector='KNN') >>> clf.fit(train_texts) >>> scores = clf.decision_function(test_texts) >>> labels = clf.predict(test_texts)
# Air-gapped: local filesystem weights >>> clf = EmbeddingOD(encoder=’/path/to/local/weights’, detector=’KNN’) >>> clf.fit(texts)
# Pre-instantiated model (e.g., shared across multiple classifiers) >>> from sentence_transformers import SentenceTransformer >>> my_model = SentenceTransformer(‘all-MiniLM-L6-v2’) >>> clf = EmbeddingOD(encoder=my_model, detector=’IForest’) >>> clf.fit(texts)
- decision_function(X)[source]¶
Predict raw anomaly scores for X.
Parameters¶
- Xlist or array-like
Raw input data in the same format as fit().
Returns¶
- anomaly_scoresnumpy array of shape (n_samples,)
Anomaly scores. Higher is more abnormal.
- fit(X, y=None)[source]¶
Fit detector on raw input data.
Encodes X into embeddings, applies preprocessing, then fits the inner detector.
Parameters¶
- Xlist or array-like
Raw input data (e.g., list of strings for text, list of PIL Images for images).
- yIgnored
Not used, present for API consistency.
Returns¶
- selfobject
Fitted estimator.
- classmethod for_audio(quality='balanced', **kwargs)[source]¶
Create an EmbeddingOD configured for audio anomaly detection.
Uses a handcrafted 74-dim acoustic feature encoder (20 MFCC, 12 chroma, and 5 spectral descriptors, each as mean and standard deviation over frames) followed by a classical PyOD detector. This embed-then-detect pattern with classical detectors is competitive on standard audio anomaly detection benchmarks and needs no GPU. Requires
pyod[audio](librosa, soundfile).Input clips may be file paths, waveform arrays, or
(waveform, sample_rate)tuples.Parameters¶
- qualitystr, optional (default=’balanced’)
‘fast’: handcrafted features + IForest.
‘balanced’: handcrafted features + KNN.
‘best’: handcrafted features + LUNAR (requires torch).
- **kwargs
Override any EmbeddingOD parameter.
Returns¶
clf : EmbeddingOD
- classmethod for_image(quality='balanced', **kwargs)[source]¶
Create an EmbeddingOD configured for image anomaly detection.
Configurations are informed by AnomalyDINO (WACV 2025).
Parameters¶
- qualitystr, optional (default=’balanced’)
‘fast’: DINOv2-small (384d) + KNN.
‘balanced’: DINOv2-base (768d) + LOF.
‘best’: DINOv2-large (1024d) + KNN.
- **kwargs
Override any EmbeddingOD parameter.
Returns¶
clf : EmbeddingOD
- classmethod for_text(quality='balanced', **kwargs)[source]¶
Create an EmbeddingOD configured for text anomaly detection.
Configurations are informed by NLP-ADBench (EMNLP 2025).
Parameters¶
- qualitystr, optional (default=’balanced’)
‘fast’: MiniLM encoder (384d) + KNN. No API key needed.
‘balanced’: mpnet encoder (768d) + LUNAR. No API key needed.
‘best’: OpenAI large (3072d) + LUNAR. Requires API key.
- **kwargs
Override any EmbeddingOD parameter.
Returns¶
clf : EmbeddingOD
- predict_proba(X, method='linear', return_confidence=False)[source]¶
Predict the probability of a sample being an outlier.
Overrides the base implementation to handle list inputs (raw data such as text or images) which do not have a
.shapeattribute.Parameters¶
- Xlist or array-like
Raw input data in the same format as fit().
- methodstr, optional (default=’linear’)
Probability conversion method. One of ‘linear’ or ‘unify’.
- return_confidenceboolean, optional (default=False)
If True, also return the confidence of prediction.
Returns¶
outlier_probability : numpy array of shape (n_samples, n_classes)
- set_predict_proba_request(*, method: bool | None | str = '$UNCHANGED$', return_confidence: bool | None | str = '$UNCHANGED$') EmbeddingOD¶
Configure whether metadata should be requested to be passed to the
predict_probamethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredict_probaif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict_proba.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- methodstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
methodparameter inpredict_proba.- return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
return_confidenceparameter inpredict_proba.
Returns¶
- selfobject
The updated object.
- set_predict_request(*, return_confidence: bool | None | str = '$UNCHANGED$') EmbeddingOD¶
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
return_confidenceparameter inpredict.
Returns¶
- selfobject
The updated object.
- class pyod.models.embedding.MultiModalOD(modalities, combination='average', contamination=0.1, standardize_scores=True)[source]¶
Bases:
BaseDetectorMulti-modal anomaly detection via score fusion.
Runs a separate detector per modality and combines their anomaly scores. Each modality can use a different detector and encoder. Score combination uses PyOD’s existing combination functions.
This is complementary to using
MultiModalEncoderwithEmbeddingOD(early/feature fusion). Score fusion is preferred when modalities have very different characteristics or when per-modality anomaly scores are independently meaningful.Parameters¶
- modalitiesdict of {str: BaseDetector}
Maps modality name to a detector. Each detector can be: - An
EmbeddingODinstance (for text/image modalities) - AnyBaseDetectorinstance (for tabular modalities)- combinationstr, optional (default=’average’)
Score combination method. One of ‘average’, ‘maximization’, ‘median’.
- contaminationfloat, optional (default=0.1)
Expected proportion of outliers. Used for threshold and labels on the combined scores.
- standardize_scoresbool, optional (default=True)
Standardize per-modality scores to zero mean and unit variance before combination. Recommended when detectors produce scores on different scales.
Attributes¶
- decision_scores_numpy array of shape (n_samples,)
Combined outlier scores of the training data.
- threshold_float
Score threshold based on
contamination.- labels_numpy array of shape (n_samples,)
Binary labels (0: inlier, 1: outlier).
- detectors_dict of {str: BaseDetector}
The fitted detectors per modality.
Examples¶
>>> from pyod.models.embedding import EmbeddingOD, MultiModalOD >>> from pyod.models.knn import KNN >>> clf = MultiModalOD(modalities={ ... 'text': EmbeddingOD(encoder='all-MiniLM-L6-v2', detector='KNN'), ... 'tabular': KNN(), ... }) >>> data = {'text': train_texts, 'tabular': X_train} >>> clf.fit(data) >>> scores = clf.decision_function(data)
- decision_function(X)[source]¶
Predict combined anomaly scores for X.
Parameters¶
- Xdict of {str: data}
Maps modality name to test data. A modality value of
Nonemeans that modality is entirely missing for all test samples; its score is imputed as 0. Whenstandardize_scores=True(default), 0 is the training mean, so the missing modality contributes “average” to the combined score. Whenstandardize_scores=False, 0 is a raw score and may not be neutral; enable standardization for principled missing-data handling. Note that imputation reduces variance in the fused score compared to training, sopredict()thresholds may be less calibrated. Usedecision_function()and apply custom thresholds for best results with missing modalities.
Returns¶
anomaly_scores : numpy array of shape (n_samples,)
- fit(X, y=None)[source]¶
Fit a detector per modality on the input data.
Parameters¶
- Xdict of {str: data}
Maps modality name to training data. Keys must match the
modalitiesdict.- yIgnored
Not used, present for API consistency.
Returns¶
- selfobject
Fitted estimator.
- set_predict_proba_request(*, method: bool | None | str = '$UNCHANGED$', return_confidence: bool | None | str = '$UNCHANGED$') MultiModalOD¶
Configure whether metadata should be requested to be passed to the
predict_probamethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredict_probaif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict_proba.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- methodstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
methodparameter inpredict_proba.- return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
return_confidenceparameter inpredict_proba.
Returns¶
- selfobject
The updated object.
- set_predict_request(*, return_confidence: bool | None | str = '$UNCHANGED$') MultiModalOD¶
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters¶
- return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
return_confidenceparameter inpredict.
Returns¶
- selfobject
The updated object.