Layer 3: Agentic Investigation

PyOD 3’s od-expert skill lets any AI agent drive a full anomaly detection investigation through natural conversation. The agent handles benchmark-backed detector selection, multi-detector consensus, quality diagnostics, adaptive escalation, and iteration on user feedback, all without requiring the user to be an OD expert.

PyOD 3 agentic investigation demo on a diabetes screening dataset

4-turn agentic conversation on a diabetes screening dataset (768 patients, 8 clinical features; shipped with PyOD as examples/data/pima.csv). The dark callouts alongside the agent’s turns show the od-expert skill’s live decision-making: modality triage, top-10 pitfall checks, and the 11 adaptive escalation triggers.


What the od-expert skill encodes

In v3.2.0, the skill grew from a 78-line API reference to roughly 1000 lines of expert content split across an always-loaded SKILL.md and six on-demand reference files. The content encodes:

  • A master decision tree that routes the user’s data to one of five modalities (tabular, time series, graph, text, image) based on observable properties.

  • Top-10 critical pitfalls, always in the agent’s working memory: unscaled features for distance-based detectors, contamination mismatches, deep learning on tiny data, missing optional extras like pyod[graph], raw-score reporting, single-detector runs, and so on. The agent walks each one before calling engine.run.

  • 11 adaptive escalation triggers that decide when the agent proceeds autonomously vs. when it pauses to ask the user. Examples: modality ambiguity (T1), contamination uncertainty (T2), detector disagreement post-run (T3), high-stakes domain hints (T8 – medical, fraud, safety), labels mentioned but not provided (T5). If none of the triggers fire, the agent runs end-to-end without interruption.

  • On-demand reference files for each modality (references/tabular.md, references/time_series.md, references/graph.md, references/text_image.md, plus references/workflow.md and references/pitfalls.md). The agent loads them when the modality decision routes there.

  • A KB-derived detector list for each modality, refreshed from pyod.utils.knowledge by scripts/regen_skill.py at build time. Every detector name in the skill is mechanically validated against the live KB by a CI safety net test (pyod/test/test_skill_kb_consistency.py) so drift fails the build.

  • An API safety net (pyod/test/test_skill_api_refs.py) that walks ADEngine and InvestigationState via a live dry run and validates every state.X / state.X['key'] / engine.X reference in the skill content. Added in v3.2.1 after a regression that shipped invented API names in v3.2.0.

See the skill maintenance methodology guide for the full pattern and for how to add a new skill.


How It Works

When a user asks about anomalies in their data, PyOD’s od-expert skill auto-activates based on intent keywords. The agent then:

  1. Walks the master decision tree – timestamps, graph structure, text/image, or tabular? Load the matching references/<modality>.md.

  2. Walks the top-10 pitfall checklist – is any pitfall active for this data? Example: feature scale ratio > 100 triggers Pitfall 1 (unscaled features for distance-based detectors) and the agent recommends a pre-scaling step or flags it in the report.

  3. Walks the 11 escalation triggers – does anything about the request call for a pause? Example: “medical screening” fires Trigger 8 (high-stakes domain) and the agent commits to dual-detector validation and a confidence caveat.

  4. Selects detectors – calls engine.plan(state) to pick the top-3 from PyOD’s 60-detector catalog based on benchmark evidence (ADBench, TSB-AD, BOND). The benchmark ranks seed the plan; the agent may override them from its own judgment or the user’s constraints. Each plan entry in state.plans has detector_name, confidence, reason, evidence.

  5. Runs in parallel – executes all selected detectors and builds a rank-normalized consensus in state.consensus.

  6. Re-walks a subset of triggers post-run – detector disagreement (T3), cutoff instability (T4), suspiciously clean results (T10). If any fire, the agent hedges the report or iterates.

  7. Generates a report – Markdown or JSON, always including a “what I assumed and why” block that lists the contamination rate, the detectors used, the best detector, and any caveats the trigger/pitfall walk surfaced.

The agent’s decisions at each of these steps are visible in the interactive demo’s dark “od-expert” panels.


Activation Paths

PyOD 3 reaches agents through three paths. Pick whichever matches your stack:

Claude Code / Claude Desktop / Codex

The od-expert skill ships as package data inside the pyod wheel. Two install modes are supported:

pip install pyod

# Claude Code / Claude Desktop: user-global install
pyod install skill              # installs to ~/.claude/skills/od-expert/

# Codex: project-local install (Codex reads ./skills/, has no user-global dir)
pyod install skill --project    # installs to ./skills/od-expert/

After installing, run pyod info to confirm the skill is detected and to see which agent stacks are present (pyod info checks both ~/.claude/ and ~/.codex/ and recommends the right install command). The legacy pyod-install-skill command from v3.0.0 still works as an alias for pyod install skill.

MCP-compatible agents

Install the optional MCP extra, then launch the server:

pip install pyod[mcp]
pyod mcp serve              # alias for ``python -m pyod.mcp_server``

Any MCP-compatible LLM picks up the tools based on intent.

Python apps / custom agents

Import and call directly: from pyod.utils.ad_engine import ADEngine. The agent calls engine.investigate(data) and follows state.next_action at each step.


Conversation Walkthrough

Dataset: Diabetes screening dataset, 768 patients with 8 clinical features (glucose, blood pressure, skin thickness, insulin, BMI, pedigree, age, pregnancies). Shipped with PyOD at examples/data/pima.csv.

Why this dataset? It exercises the skill’s machinery: the feature scale ratio is ~347x (Pitfall 1 fires), and the user’s “medical screening” context fires Trigger 8 (high-stakes domain). On a happy-path dataset the side callouts would be mostly empty; on this data, they catch real problems.

Turn 1. User asks for screening:

User: I have a diabetes screening dataset
      -- 768 patients with 8 clinical features (glucose,
      blood pressure, BMI, insulin, age, etc.). Find
      unusual patient profiles that might need follow-up.
      This is a medical screening context.

[od-expert skill: pre-run decisions]
  Master tree     -> tabular (no timestamps, no graph, no text/image)
  Pitfall #1      -> FIRES: std ratio 347x, note as report caveat
  Pitfalls #2-10  -> all pass
  Trigger T8      -> FIRES: medical screening; dual-detector validation
  Other triggers  -> not applicable
  Plan            -> KNN (0.85), IForest (0.85), LOF (0.80)

Agent: Profiled as tabular, low-dim, n=768 x 8. Running
       KNN + IForest + LOF from the ADBench top-5 for
       low-dim small datasets. Scale mismatch noted for
       the final report.

       Results: 62 flagged (8.1%), agreement 0.59
       (label-free; separation and the quality verdict
       are descriptive only). Top case: patient #13.
       KNN strongest individually.

Behind the scenes:

import numpy as np
from pyod.utils.ad_engine import ADEngine

np.random.seed(42)
data = np.genfromtxt('examples/data/pima.csv',
                     delimiter=',', skip_header=1)
X = data[:, :-1]

engine = ADEngine()
state = engine.investigate(X)
# state.plans[0]['detector_name']          == 'KNN'
# state.analysis['best_detector']          == 'KNN'
# state.analysis['consensus_analysis']['n_anomalies']    == 62
# state.analysis['consensus_analysis']['anomaly_ratio']  == 0.081
# state.quality['separation']              == 0.956
# state.quality['agreement']               == 0.592
# state.quality['stability']               (cutoff-gap value, data-dependent)
# state.quality['overall']                 == 0.787
# state.quality['verdict']                 == 'high'

Turn 2. User asks what is driving the flags:

User: What clinical features are driving the flags?

Agent: Analyzing patient #13 via KNN (best detector).
       engine.explain_findings returns absolute z-score
       magnitudes (|z|); raw values are cross-checked
       manually to show direction:
         insulin       |z|=6.65  (846 uU/mL)
         age           |z|=2.19  (59 yrs)
         glucose       |z|=2.13  (189 mg/dL)
         pregnancies   |z|=0.84  (1)
       Metabolic-syndrome profile: extreme insulin,
       elevated glucose, older age.
best_idx = state.analysis['best_detector_index']
best_result = state.results[best_idx]
explanations = engine.explain_findings(best_result, X=X, top_k=5)
# explanations[0]['contributing_features'][0] ==
#   {'feature': 4, 'z_score': 6.65}  # insulin (absolute magnitude)

Turn 3. User reveals labels exist. Trigger 5 fires:

User: I should mention -- I actually have diabetes
      diagnosis labels for the full cohort.

[od-expert skill: new turn, re-evaluating triggers]
  Trigger T5 -> FIRES: labels mentioned but the session
                is on the unsupervised path
  Recommended switch -> drop out of ADEngine session and
                        call XGBOD directly (pyod[xgboost])

Agent: That changes the plan. The ADEngine session API is
       unsupervised in v3.2.x. For labelled data, supervised
       detection (XGBOD) typically lifts AUC 10-20 points.
       I would switch to the classic fit / predict path.
# ADEngine session is unsupervised in v3.2.x; for labels,
# use the classic fit / predict path directly:
from pyod.models.xgbod import XGBOD
clf = XGBOD()
clf.fit(X, y_labels)               # supervised training
scores = clf.decision_function(X)  # anomaly scores
labels_pred = clf.predict(X)       # binary labels

Turn 4. User asks for the unsupervised report for now:

report = engine.report(state, format='text')
# Includes the quality bars, the selected detectors, the
# best detector, and an explicit "assumptions and caveats"
# block citing the scale mismatch (Pitfall 1), the
# observed anomaly ratio, the high-stakes caveat (Trigger
# 8), and the label-availability note (Trigger 5).

# Or JSON for programmatic consumption:
report_dict = engine.report(state, format='json')

The Session API

The agentic workflow is built on ADEngine’s session API. Each method advances the investigation and sets state.next_action to guide the agent:

Method

Purpose

start(X)

Profile data, return InvestigationState

plan(state)

Select top-N detectors, populate state.plans

run(state)

Execute all detectors, fill state.consensus

analyze(state)

Populate state.quality and state.analysis

iterate(state, fb)

Adjust plan based on structured or NL feedback

report(state)

Generate markdown or JSON report

investigate(X)

One-shot: start + plan + run + analyze

Key state fields the agent reads:

  • state.profile – dict with data_type, n_samples, n_features, has_nan, dtype, dimensionality_class.

  • state.plans – list of plan dicts with detector_name, confidence, reason, evidence, alternatives.

  • state.consensus – dict with scores, labels, n_detectors, agreement.

  • state.quality – dict with separation, agreement, stability, overall, verdict, explanation.

  • state.analysis – dict with consensus_analysis (containing n_anomalies, anomaly_ratio, top_anomalies, etc.), best_detector, best_detector_index, per_detector_analysis.

  • state.next_action – dict with action in {report_to_user, iterate, confirm_with_user}, plus reason, summary, and sometimes suggestion / proposed_change.

Feedback to iterate(state, fb) can be structured (dict) or natural language (string):

# Structured (executes immediately)
engine.iterate(state, {"action": "adjust_contamination", "value": 0.05})
engine.iterate(state, {"action": "exclude", "detectors": ["IForest"]})
engine.iterate(state, {"action": "include", "detectors": ["ECOD"]})
engine.iterate(state, {"action": "rerun"})

# Natural language (parsed to action, may need confirmation)
engine.iterate(state, "too many false positives")
engine.iterate(state, "try without IForest")

Why This Is Different

Without PyOD 3, an AI agent wrapping a library like scikit-learn would:

  1. Pick one detector (probably the wrong one)

  2. Run it once with default parameters

  3. Return raw scores without quality assessment

  4. Rely on the LLM to interpret results from first principles

With PyOD 3 and the v3.2.0 od-expert skill:

  1. Walks a master decision tree to pick the right modality and detector family.

  2. Walks a top-10 pitfall checklist before running, catching problems like scale mismatch or DL on tiny data.

  3. Walks 11 adaptive escalation triggers to decide when to pause and ask vs. run autonomously.

  4. Selects detectors via benchmark-backed routing (ADBench, TSB-AD, BOND).

  5. Runs top-3 in parallel and builds rank-normalized consensus.

  6. Re-checks quality-related triggers post-run and hedges the report accordingly.

  7. Always reports the assumptions and caveats, including the scale mismatch, contamination, and any triggered escalations.

The agent follows an OD expert’s workflow through the library, not despite it.