Layer 3: Agentic Investigation¶
PyOD 3’s od-expert skill lets any AI agent drive a full anomaly detection investigation through natural conversation. The agent handles benchmark-backed detector selection, multi-detector consensus, quality diagnostics, adaptive escalation, and iteration on user feedback, all without requiring the user to be an OD expert.
4-turn agentic conversation on a diabetes screening dataset
(768 patients, 8 clinical features; shipped with PyOD as
examples/data/pima.csv). The dark callouts alongside the agent’s
turns show the od-expert skill’s live decision-making: modality
triage, top-10 pitfall checks, and the 11 adaptive escalation triggers.¶
Runnable script: agentic_example.py
Interactive demo: agentic_demo.html (open in a browser for the full visual walkthrough with skill decision panels)
What the od-expert skill encodes¶
In v3.2.0, the skill grew from a 78-line API reference to roughly 1000 lines of expert content split across an always-loaded SKILL.md and six on-demand reference files. The content encodes:
A master decision tree that routes the user’s data to one of five modalities (tabular, time series, graph, text, image) based on observable properties.
Top-10 critical pitfalls, always in the agent’s working memory: unscaled features for distance-based detectors, contamination mismatches, deep learning on tiny data, missing optional extras like
pyod[graph], raw-score reporting, single-detector runs, and so on. The agent walks each one before callingengine.run.11 adaptive escalation triggers that decide when the agent proceeds autonomously vs. when it pauses to ask the user. Examples: modality ambiguity (T1), contamination uncertainty (T2), detector disagreement post-run (T3), high-stakes domain hints (T8 – medical, fraud, safety), labels mentioned but not provided (T5). If none of the triggers fire, the agent runs end-to-end without interruption.
On-demand reference files for each modality (
references/tabular.md,references/time_series.md,references/graph.md,references/text_image.md, plusreferences/workflow.mdandreferences/pitfalls.md). The agent loads them when the modality decision routes there.A KB-derived detector list for each modality, refreshed from
pyod.utils.knowledgebyscripts/regen_skill.pyat build time. Every detector name in the skill is mechanically validated against the live KB by a CI safety net test (pyod/test/test_skill_kb_consistency.py) so drift fails the build.An API safety net (
pyod/test/test_skill_api_refs.py) that walksADEngineandInvestigationStatevia a live dry run and validates everystate.X/state.X['key']/engine.Xreference in the skill content. Added in v3.2.1 after a regression that shipped invented API names in v3.2.0.
See the skill maintenance methodology guide for the full pattern and for how to add a new skill.
How It Works¶
When a user asks about anomalies in their data, PyOD’s od-expert skill auto-activates based on intent keywords. The agent then:
Walks the master decision tree – timestamps, graph structure, text/image, or tabular? Load the matching
references/<modality>.md.Walks the top-10 pitfall checklist – is any pitfall active for this data? Example: feature scale ratio > 100 triggers Pitfall 1 (unscaled features for distance-based detectors) and the agent recommends a pre-scaling step or flags it in the report.
Walks the 11 escalation triggers – does anything about the request call for a pause? Example: “medical screening” fires Trigger 8 (high-stakes domain) and the agent commits to dual-detector validation and a confidence caveat.
Selects detectors – calls
engine.plan(state)to pick the top-3 from PyOD’s 60-detector catalog based on benchmark evidence (ADBench, TSB-AD, BOND). The benchmark ranks seed the plan; the agent may override them from its own judgment or the user’s constraints. Each plan entry instate.planshasdetector_name,confidence,reason,evidence.Runs in parallel – executes all selected detectors and builds a rank-normalized consensus in
state.consensus.Re-walks a subset of triggers post-run – detector disagreement (T3), cutoff instability (T4), suspiciously clean results (T10). If any fire, the agent hedges the report or iterates.
Generates a report – Markdown or JSON, always including a “what I assumed and why” block that lists the contamination rate, the detectors used, the best detector, and any caveats the trigger/pitfall walk surfaced.
The agent’s decisions at each of these steps are visible in the interactive demo’s dark “od-expert” panels.
Activation Paths¶
PyOD 3 reaches agents through three paths. Pick whichever matches your stack:
- Claude Code / Claude Desktop / Codex
The
od-expertskill ships as package data inside the pyod wheel. Two install modes are supported:pip install pyod # Claude Code / Claude Desktop: user-global install pyod install skill # installs to ~/.claude/skills/od-expert/ # Codex: project-local install (Codex reads ./skills/, has no user-global dir) pyod install skill --project # installs to ./skills/od-expert/
After installing, run
pyod infoto confirm the skill is detected and to see which agent stacks are present (pyod infochecks both~/.claude/and~/.codex/and recommends the right install command). The legacypyod-install-skillcommand from v3.0.0 still works as an alias forpyod install skill.- MCP-compatible agents
Install the optional MCP extra, then launch the server:
pip install pyod[mcp] pyod mcp serve # alias for ``python -m pyod.mcp_server``
Any MCP-compatible LLM picks up the tools based on intent.
- Python apps / custom agents
Import and call directly:
from pyod.utils.ad_engine import ADEngine. The agent callsengine.investigate(data)and followsstate.next_actionat each step.
Conversation Walkthrough¶
Dataset: Diabetes screening dataset, 768 patients with 8 clinical features (glucose, blood pressure, skin thickness, insulin, BMI, pedigree, age, pregnancies). Shipped with PyOD at examples/data/pima.csv.
Why this dataset? It exercises the skill’s machinery: the feature scale ratio is ~347x (Pitfall 1 fires), and the user’s “medical screening” context fires Trigger 8 (high-stakes domain). On a happy-path dataset the side callouts would be mostly empty; on this data, they catch real problems.
Turn 1. User asks for screening:
User: I have a diabetes screening dataset
-- 768 patients with 8 clinical features (glucose,
blood pressure, BMI, insulin, age, etc.). Find
unusual patient profiles that might need follow-up.
This is a medical screening context.
[od-expert skill: pre-run decisions]
Master tree -> tabular (no timestamps, no graph, no text/image)
Pitfall #1 -> FIRES: std ratio 347x, note as report caveat
Pitfalls #2-10 -> all pass
Trigger T8 -> FIRES: medical screening; dual-detector validation
Other triggers -> not applicable
Plan -> KNN (0.85), IForest (0.85), LOF (0.80)
Agent: Profiled as tabular, low-dim, n=768 x 8. Running
KNN + IForest + LOF from the ADBench top-5 for
low-dim small datasets. Scale mismatch noted for
the final report.
Results: 62 flagged (8.1%), agreement 0.59
(label-free; separation and the quality verdict
are descriptive only). Top case: patient #13.
KNN strongest individually.
Behind the scenes:
import numpy as np
from pyod.utils.ad_engine import ADEngine
np.random.seed(42)
data = np.genfromtxt('examples/data/pima.csv',
delimiter=',', skip_header=1)
X = data[:, :-1]
engine = ADEngine()
state = engine.investigate(X)
# state.plans[0]['detector_name'] == 'KNN'
# state.analysis['best_detector'] == 'KNN'
# state.analysis['consensus_analysis']['n_anomalies'] == 62
# state.analysis['consensus_analysis']['anomaly_ratio'] == 0.081
# state.quality['separation'] == 0.956
# state.quality['agreement'] == 0.592
# state.quality['stability'] (cutoff-gap value, data-dependent)
# state.quality['overall'] == 0.787
# state.quality['verdict'] == 'high'
Turn 2. User asks what is driving the flags:
User: What clinical features are driving the flags?
Agent: Analyzing patient #13 via KNN (best detector).
engine.explain_findings returns absolute z-score
magnitudes (|z|); raw values are cross-checked
manually to show direction:
insulin |z|=6.65 (846 uU/mL)
age |z|=2.19 (59 yrs)
glucose |z|=2.13 (189 mg/dL)
pregnancies |z|=0.84 (1)
Metabolic-syndrome profile: extreme insulin,
elevated glucose, older age.
best_idx = state.analysis['best_detector_index']
best_result = state.results[best_idx]
explanations = engine.explain_findings(best_result, X=X, top_k=5)
# explanations[0]['contributing_features'][0] ==
# {'feature': 4, 'z_score': 6.65} # insulin (absolute magnitude)
Turn 3. User reveals labels exist. Trigger 5 fires:
User: I should mention -- I actually have diabetes
diagnosis labels for the full cohort.
[od-expert skill: new turn, re-evaluating triggers]
Trigger T5 -> FIRES: labels mentioned but the session
is on the unsupervised path
Recommended switch -> drop out of ADEngine session and
call XGBOD directly (pyod[xgboost])
Agent: That changes the plan. The ADEngine session API is
unsupervised in v3.2.x. For labelled data, supervised
detection (XGBOD) typically lifts AUC 10-20 points.
I would switch to the classic fit / predict path.
# ADEngine session is unsupervised in v3.2.x; for labels,
# use the classic fit / predict path directly:
from pyod.models.xgbod import XGBOD
clf = XGBOD()
clf.fit(X, y_labels) # supervised training
scores = clf.decision_function(X) # anomaly scores
labels_pred = clf.predict(X) # binary labels
Turn 4. User asks for the unsupervised report for now:
report = engine.report(state, format='text')
# Includes the quality bars, the selected detectors, the
# best detector, and an explicit "assumptions and caveats"
# block citing the scale mismatch (Pitfall 1), the
# observed anomaly ratio, the high-stakes caveat (Trigger
# 8), and the label-availability note (Trigger 5).
# Or JSON for programmatic consumption:
report_dict = engine.report(state, format='json')
The Session API¶
The agentic workflow is built on ADEngine’s session API. Each method advances the investigation and sets state.next_action to guide the agent:
Method |
Purpose |
|---|---|
|
Profile data, return |
|
Select top-N detectors, populate |
|
Execute all detectors, fill |
|
Populate |
|
Adjust plan based on structured or NL feedback |
|
Generate markdown or JSON report |
|
One-shot: |
Key state fields the agent reads:
state.profile– dict withdata_type,n_samples,n_features,has_nan,dtype,dimensionality_class.state.plans– list of plan dicts withdetector_name,confidence,reason,evidence,alternatives.state.consensus– dict withscores,labels,n_detectors,agreement.state.quality– dict withseparation,agreement,stability,overall,verdict,explanation.state.analysis– dict withconsensus_analysis(containingn_anomalies,anomaly_ratio,top_anomalies, etc.),best_detector,best_detector_index,per_detector_analysis.state.next_action– dict withactionin {report_to_user,iterate,confirm_with_user}, plusreason,summary, and sometimessuggestion/proposed_change.
Feedback to iterate(state, fb) can be structured (dict) or natural language (string):
# Structured (executes immediately)
engine.iterate(state, {"action": "adjust_contamination", "value": 0.05})
engine.iterate(state, {"action": "exclude", "detectors": ["IForest"]})
engine.iterate(state, {"action": "include", "detectors": ["ECOD"]})
engine.iterate(state, {"action": "rerun"})
# Natural language (parsed to action, may need confirmation)
engine.iterate(state, "too many false positives")
engine.iterate(state, "try without IForest")
Why This Is Different¶
Without PyOD 3, an AI agent wrapping a library like scikit-learn would:
Pick one detector (probably the wrong one)
Run it once with default parameters
Return raw scores without quality assessment
Rely on the LLM to interpret results from first principles
With PyOD 3 and the v3.2.0 od-expert skill:
Walks a master decision tree to pick the right modality and detector family.
Walks a top-10 pitfall checklist before running, catching problems like scale mismatch or DL on tiny data.
Walks 11 adaptive escalation triggers to decide when to pause and ask vs. run autonomously.
Selects detectors via benchmark-backed routing (ADBench, TSB-AD, BOND).
Runs top-3 in parallel and builds rank-normalized consensus.
Re-checks quality-related triggers post-run and hedges the report accordingly.
Always reports the assumptions and caveats, including the scale mismatch, contamination, and any triggered escalations.
The agent follows an OD expert’s workflow through the library, not despite it.