Text and Image Detectors

PyOD’s EmbeddingOD chains foundation model encoders (sentence-transformers, OpenAI, HuggingFace) with any PyOD detector for text and image anomaly detection. Rankings from NLP-ADBench.

See Layer 1: Text and Image Anomaly Detection for usage.

pyod.models.embedding module

EmbeddingOD and MultiModalOD: Anomaly detection via foundation model embeddings.

EmbeddingOD chains any embedding encoder with any PyOD detector, enabling anomaly detection on text, image, and other non-tabular data through PyOD’s standard API. MultiModalOD extends this to multi-modal data by running separate detectors per modality and fusing their scores.

class pyod.models.embedding.EmbeddingOD(encoder, detector='LUNAR', contamination=0.1, batch_size=32, cache_embeddings=False, reduce_dim=None, standardize=True, random_state=None)[source]

Bases: BaseDetector

Anomaly detection on raw data via embedding + detector pipeline.

Chains any embedding encoder with any PyOD detector. Encode raw data (text, images, or other modalities) into numeric embeddings, then apply outlier detection in the embedding space.

This implements the two-step approach shown to outperform end-to-end methods in NLP-ADBench (Li et al., EMNLP 2025) and TAD-Bench (Cao et al., 2025).

Parameters

encoderstr, BaseEncoder, SentenceTransformer instance, or callable

Embedding encoder. Accepts: - Registry shortcut: ‘all-MiniLM-L6-v2’, ‘text-embedding-3-small’,

‘dinov2-base’

  • HuggingFace model ID: ‘sentence-transformers/all-MiniLM-L6-v2’

  • Local filesystem path: ‘/path/to/local/weights’ — loaded without any network call, suitable for air-gapped environments.

  • Pre-instantiated SentenceTransformer: passed directly, no reload.

  • BaseEncoder instance

  • Callable: fn(X) -> np.ndarray of shape (n_samples, n_features)

detectorstr or BaseDetector, optional (default=’LUNAR’)

Any PyOD detector. String resolves to default-configured instance. Default is LUNAR (best performer in NLP-ADBench).

contaminationfloat, optional (default=0.1)

Expected proportion of outliers in the dataset. Must be in (0, 0.5].

batch_sizeint, optional (default=32)

Batch size for encoding.

cache_embeddingsbool, optional (default=False)

Cache training embeddings to avoid re-encoding. Recommended for API-based encoders (e.g., OpenAI).

reduce_dimint or None, optional (default=None)

If set, apply PCA to reduce embedding dimensionality before detection. Recommended for embeddings >1000 dims with distance-based detectors (KNN, LOF).

standardizebool, optional (default=True)

Apply StandardScaler to embeddings before detection. Matches the preprocessing pipeline in NLP-ADBench.

random_stateint, RandomState instance or None, optional (default=None)

Controls stochastic parts of EmbeddingOD. The seed is forwarded to (a) the dimensionality-reduction PCA when reduce_dim is set (PCA may pick a randomized SVD solver on high-dimensional embeddings) and (b) the string-resolved inner detector when that detector class declares an explicit random_state parameter (e.g., the default 'LUNAR' preset, or 'IForest'). It does NOT control the external encoder’s own inference (e.g., sentence-transformers, DINOv2), which is treated as deterministic given fixed weights. When ADEngine(random_state=...) builds a preset plan, the engine seed flows here automatically.

Attributes

decision_scores_numpy array of shape (n_samples,)

Outlier scores of the training data. Higher is more abnormal.

threshold_float

Score threshold based on contamination.

labels_numpy array of shape (n_samples,)

Binary labels of training data (0: inlier, 1: outlier).

encoder_BaseEncoder

The resolved encoder instance.

detector_BaseDetector

The resolved and fitted detector instance.

Examples

>>> from pyod.models.embedding import EmbeddingOD
>>> clf = EmbeddingOD(encoder='all-MiniLM-L6-v2', detector='KNN')
>>> clf.fit(train_texts)
>>> scores = clf.decision_function(test_texts)
>>> labels = clf.predict(test_texts)

# Air-gapped: local filesystem weights >>> clf = EmbeddingOD(encoder=’/path/to/local/weights’, detector=’KNN’) >>> clf.fit(texts)

# Pre-instantiated model (e.g., shared across multiple classifiers) >>> from sentence_transformers import SentenceTransformer >>> my_model = SentenceTransformer(‘all-MiniLM-L6-v2’) >>> clf = EmbeddingOD(encoder=my_model, detector=’IForest’) >>> clf.fit(texts)

decision_function(X)[source]

Predict raw anomaly scores for X.

Parameters

Xlist or array-like

Raw input data in the same format as fit().

Returns

anomaly_scoresnumpy array of shape (n_samples,)

Anomaly scores. Higher is more abnormal.

fit(X, y=None)[source]

Fit detector on raw input data.

Encodes X into embeddings, applies preprocessing, then fits the inner detector.

Parameters

Xlist or array-like

Raw input data (e.g., list of strings for text, list of PIL Images for images).

yIgnored

Not used, present for API consistency.

Returns

selfobject

Fitted estimator.

classmethod for_audio(quality='balanced', **kwargs)[source]

Create an EmbeddingOD configured for audio anomaly detection.

Uses a handcrafted 74-dim acoustic feature encoder (20 MFCC, 12 chroma, and 5 spectral descriptors, each as mean and standard deviation over frames) followed by a classical PyOD detector. This embed-then-detect pattern with classical detectors is competitive on standard audio anomaly detection benchmarks and needs no GPU. Requires pyod[audio] (librosa, soundfile).

Input clips may be file paths, waveform arrays, or (waveform, sample_rate) tuples.

Parameters

qualitystr, optional (default=’balanced’)
  • ‘fast’: handcrafted features + IForest.

  • ‘balanced’: handcrafted features + KNN.

  • ‘best’: handcrafted features + LUNAR (requires torch).

**kwargs

Override any EmbeddingOD parameter.

Returns

clf : EmbeddingOD

classmethod for_image(quality='balanced', **kwargs)[source]

Create an EmbeddingOD configured for image anomaly detection.

Configurations are informed by AnomalyDINO (WACV 2025).

Parameters

qualitystr, optional (default=’balanced’)
  • ‘fast’: DINOv2-small (384d) + KNN.

  • ‘balanced’: DINOv2-base (768d) + LOF.

  • ‘best’: DINOv2-large (1024d) + KNN.

**kwargs

Override any EmbeddingOD parameter.

Returns

clf : EmbeddingOD

classmethod for_text(quality='balanced', **kwargs)[source]

Create an EmbeddingOD configured for text anomaly detection.

Configurations are informed by NLP-ADBench (EMNLP 2025).

Parameters

qualitystr, optional (default=’balanced’)
  • ‘fast’: MiniLM encoder (384d) + KNN. No API key needed.

  • ‘balanced’: mpnet encoder (768d) + LUNAR. No API key needed.

  • ‘best’: OpenAI large (3072d) + LUNAR. Requires API key.

**kwargs

Override any EmbeddingOD parameter.

Returns

clf : EmbeddingOD

predict_proba(X, method='linear', return_confidence=False)[source]

Predict the probability of a sample being an outlier.

Overrides the base implementation to handle list inputs (raw data such as text or images) which do not have a .shape attribute.

Parameters

Xlist or array-like

Raw input data in the same format as fit().

methodstr, optional (default=’linear’)

Probability conversion method. One of ‘linear’ or ‘unify’.

return_confidenceboolean, optional (default=False)

If True, also return the confidence of prediction.

Returns

outlier_probability : numpy array of shape (n_samples, n_classes)

set_predict_proba_request(*, method: bool | None | str = '$UNCHANGED$', return_confidence: bool | None | str = '$UNCHANGED$') EmbeddingOD

Configure whether metadata should be requested to be passed to the predict_proba method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

methodstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for method parameter in predict_proba.

return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for return_confidence parameter in predict_proba.

Returns

selfobject

The updated object.

set_predict_request(*, return_confidence: bool | None | str = '$UNCHANGED$') EmbeddingOD

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for return_confidence parameter in predict.

Returns

selfobject

The updated object.

class pyod.models.embedding.MultiModalOD(modalities, combination='average', contamination=0.1, standardize_scores=True)[source]

Bases: BaseDetector

Multi-modal anomaly detection via score fusion.

Runs a separate detector per modality and combines their anomaly scores. Each modality can use a different detector and encoder. Score combination uses PyOD’s existing combination functions.

This is complementary to using MultiModalEncoder with EmbeddingOD (early/feature fusion). Score fusion is preferred when modalities have very different characteristics or when per-modality anomaly scores are independently meaningful.

Parameters

modalitiesdict of {str: BaseDetector}

Maps modality name to a detector. Each detector can be: - An EmbeddingOD instance (for text/image modalities) - Any BaseDetector instance (for tabular modalities)

combinationstr, optional (default=’average’)

Score combination method. One of ‘average’, ‘maximization’, ‘median’.

contaminationfloat, optional (default=0.1)

Expected proportion of outliers. Used for threshold and labels on the combined scores.

standardize_scoresbool, optional (default=True)

Standardize per-modality scores to zero mean and unit variance before combination. Recommended when detectors produce scores on different scales.

Attributes

decision_scores_numpy array of shape (n_samples,)

Combined outlier scores of the training data.

threshold_float

Score threshold based on contamination.

labels_numpy array of shape (n_samples,)

Binary labels (0: inlier, 1: outlier).

detectors_dict of {str: BaseDetector}

The fitted detectors per modality.

Examples

>>> from pyod.models.embedding import EmbeddingOD, MultiModalOD
>>> from pyod.models.knn import KNN
>>> clf = MultiModalOD(modalities={
...     'text': EmbeddingOD(encoder='all-MiniLM-L6-v2', detector='KNN'),
...     'tabular': KNN(),
... })
>>> data = {'text': train_texts, 'tabular': X_train}
>>> clf.fit(data)
>>> scores = clf.decision_function(data)
decision_function(X)[source]

Predict combined anomaly scores for X.

Parameters

Xdict of {str: data}

Maps modality name to test data. A modality value of None means that modality is entirely missing for all test samples; its score is imputed as 0. When standardize_scores=True (default), 0 is the training mean, so the missing modality contributes “average” to the combined score. When standardize_scores=False, 0 is a raw score and may not be neutral; enable standardization for principled missing-data handling. Note that imputation reduces variance in the fused score compared to training, so predict() thresholds may be less calibrated. Use decision_function() and apply custom thresholds for best results with missing modalities.

Returns

anomaly_scores : numpy array of shape (n_samples,)

fit(X, y=None)[source]

Fit a detector per modality on the input data.

Parameters

Xdict of {str: data}

Maps modality name to training data. Keys must match the modalities dict.

yIgnored

Not used, present for API consistency.

Returns

selfobject

Fitted estimator.

set_predict_proba_request(*, method: bool | None | str = '$UNCHANGED$', return_confidence: bool | None | str = '$UNCHANGED$') MultiModalOD

Configure whether metadata should be requested to be passed to the predict_proba method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

methodstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for method parameter in predict_proba.

return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for return_confidence parameter in predict_proba.

Returns

selfobject

The updated object.

set_predict_request(*, return_confidence: bool | None | str = '$UNCHANGED$') MultiModalOD

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

return_confidencestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for return_confidence parameter in predict.

Returns

selfobject

The updated object.