Layer 3: Agentic Investigation =============================== PyOD 3's ``od-expert`` skill lets any AI agent drive a full anomaly detection investigation through natural conversation. The agent handles benchmark-backed detector selection, multi-detector consensus, quality diagnostics, adaptive escalation, and iteration on user feedback, all without requiring the user to be an OD expert. .. figure:: ../figs/agentic-demo.png :alt: PyOD 3 agentic investigation demo on a diabetes screening dataset :align: center :width: 720 4-turn agentic conversation on a diabetes screening dataset (768 patients, 8 clinical features; shipped with PyOD as ``examples/data/pima.csv``). The dark callouts alongside the agent's turns show the ``od-expert`` skill's live decision-making: modality triage, top-10 pitfall checks, and the 11 adaptive escalation triggers. * **Runnable script**: `agentic_example.py `_ * **Interactive demo**: `agentic_demo.html `_ (open in a browser for the full visual walkthrough with skill decision panels) ---- What the ``od-expert`` skill encodes ------------------------------------ In v3.2.0, the skill grew from a 78-line API reference to roughly 1000 lines of expert content split across an always-loaded ``SKILL.md`` and six on-demand reference files. The content encodes: * **A master decision tree** that routes the user's data to one of five modalities (tabular, time series, graph, text, image) based on observable properties. * **Top-10 critical pitfalls**, always in the agent's working memory: unscaled features for distance-based detectors, contamination mismatches, deep learning on tiny data, missing optional extras like ``pyod[graph]``, raw-score reporting, single-detector runs, and so on. The agent walks each one before calling ``engine.run``. * **11 adaptive escalation triggers** that decide when the agent proceeds autonomously vs. when it pauses to ask the user. Examples: modality ambiguity (T1), contamination uncertainty (T2), detector disagreement post-run (T3), high-stakes domain hints (T8 -- medical, fraud, safety), labels mentioned but not provided (T5). If none of the triggers fire, the agent runs end-to-end without interruption. * **On-demand reference files** for each modality (``references/tabular.md``, ``references/time_series.md``, ``references/graph.md``, ``references/text_image.md``, plus ``references/workflow.md`` and ``references/pitfalls.md``). The agent loads them when the modality decision routes there. * **A KB-derived detector list** for each modality, refreshed from ``pyod.utils.knowledge`` by ``scripts/regen_skill.py`` at build time. Every detector name in the skill is mechanically validated against the live KB by a CI safety net test (``pyod/test/test_skill_kb_consistency.py``) so drift fails the build. * **An API safety net** (``pyod/test/test_skill_api_refs.py``) that walks ``ADEngine`` and ``InvestigationState`` via a live dry run and validates every ``state.X`` / ``state.X['key']`` / ``engine.X`` reference in the skill content. Added in v3.2.1 after a regression that shipped invented API names in v3.2.0. See :doc:`the skill maintenance methodology guide <../skill_maintenance>` for the full pattern and for how to add a new skill. ---- How It Works ------------ When a user asks about anomalies in their data, PyOD's ``od-expert`` skill auto-activates based on intent keywords. The agent then: 1. **Walks the master decision tree** -- timestamps, graph structure, text/image, or tabular? Load the matching ``references/.md``. 2. **Walks the top-10 pitfall checklist** -- is any pitfall active for this data? Example: feature scale ratio > 100 triggers Pitfall 1 (unscaled features for distance-based detectors) and the agent recommends a pre-scaling step or flags it in the report. 3. **Walks the 11 escalation triggers** -- does anything about the request call for a pause? Example: "medical screening" fires Trigger 8 (high-stakes domain) and the agent commits to dual-detector validation and a confidence caveat. 4. **Selects detectors** -- calls ``engine.plan(state)`` to pick the top-3 from PyOD's 60-detector catalog based on benchmark evidence (ADBench, TSB-AD, BOND). The benchmark ranks seed the plan; the agent may override them from its own judgment or the user's constraints. Each plan entry in ``state.plans`` has ``detector_name``, ``confidence``, ``reason``, ``evidence``. 5. **Runs in parallel** -- executes all selected detectors and builds a rank-normalized consensus in ``state.consensus``. 6. **Re-walks a subset of triggers post-run** -- detector disagreement (T3), cutoff instability (T4), suspiciously clean results (T10). If any fire, the agent hedges the report or iterates. 7. **Generates a report** -- Markdown or JSON, always including a "what I assumed and why" block that lists the contamination rate, the detectors used, the best detector, and any caveats the trigger/pitfall walk surfaced. The agent's decisions at each of these steps are visible in the interactive demo's dark "od-expert" panels. ---- Activation Paths ---------------- PyOD 3 reaches agents through three paths. Pick whichever matches your stack: **Claude Code / Claude Desktop / Codex** The ``od-expert`` skill ships as package data inside the pyod wheel. Two install modes are supported: .. code-block:: bash pip install pyod # Claude Code / Claude Desktop: user-global install pyod install skill # installs to ~/.claude/skills/od-expert/ # Codex: project-local install (Codex reads ./skills/, has no user-global dir) pyod install skill --project # installs to ./skills/od-expert/ After installing, run ``pyod info`` to confirm the skill is detected and to see which agent stacks are present (``pyod info`` checks both ``~/.claude/`` and ``~/.codex/`` and recommends the right install command). The legacy ``pyod-install-skill`` command from v3.0.0 still works as an alias for ``pyod install skill``. **MCP-compatible agents** Install the optional MCP extra, then launch the server: .. code-block:: bash pip install pyod[mcp] pyod mcp serve # alias for ``python -m pyod.mcp_server`` Any MCP-compatible LLM picks up the tools based on intent. **Python apps / custom agents** Import and call directly: ``from pyod.utils.ad_engine import ADEngine``. The agent calls ``engine.investigate(data)`` and follows ``state.next_action`` at each step. ---- Conversation Walkthrough ------------------------ **Dataset:** Diabetes screening dataset, 768 patients with 8 clinical features (glucose, blood pressure, skin thickness, insulin, BMI, pedigree, age, pregnancies). Shipped with PyOD at ``examples/data/pima.csv``. Why this dataset? It exercises the skill's machinery: the feature scale ratio is ~347x (Pitfall 1 fires), and the user's "medical screening" context fires Trigger 8 (high-stakes domain). On a happy-path dataset the side callouts would be mostly empty; on this data, they catch real problems. **Turn 1.** User asks for screening: .. code-block:: text User: I have a diabetes screening dataset -- 768 patients with 8 clinical features (glucose, blood pressure, BMI, insulin, age, etc.). Find unusual patient profiles that might need follow-up. This is a medical screening context. [od-expert skill: pre-run decisions] Master tree -> tabular (no timestamps, no graph, no text/image) Pitfall #1 -> FIRES: std ratio 347x, note as report caveat Pitfalls #2-10 -> all pass Trigger T8 -> FIRES: medical screening; dual-detector validation Other triggers -> not applicable Plan -> KNN (0.85), IForest (0.85), LOF (0.80) Agent: Profiled as tabular, low-dim, n=768 x 8. Running KNN + IForest + LOF from the ADBench top-5 for low-dim small datasets. Scale mismatch noted for the final report. Results: 62 flagged (8.1%), agreement 0.59 (label-free; separation and the quality verdict are descriptive only). Top case: patient #13. KNN strongest individually. Behind the scenes: .. code-block:: python import numpy as np from pyod.utils.ad_engine import ADEngine np.random.seed(42) data = np.genfromtxt('examples/data/pima.csv', delimiter=',', skip_header=1) X = data[:, :-1] engine = ADEngine() state = engine.investigate(X) # state.plans[0]['detector_name'] == 'KNN' # state.analysis['best_detector'] == 'KNN' # state.analysis['consensus_analysis']['n_anomalies'] == 62 # state.analysis['consensus_analysis']['anomaly_ratio'] == 0.081 # state.quality['separation'] == 0.956 # state.quality['agreement'] == 0.592 # state.quality['stability'] (cutoff-gap value, data-dependent) # state.quality['overall'] == 0.787 # state.quality['verdict'] == 'high' **Turn 2.** User asks what is driving the flags: .. code-block:: text User: What clinical features are driving the flags? Agent: Analyzing patient #13 via KNN (best detector). engine.explain_findings returns absolute z-score magnitudes (|z|); raw values are cross-checked manually to show direction: insulin |z|=6.65 (846 uU/mL) age |z|=2.19 (59 yrs) glucose |z|=2.13 (189 mg/dL) pregnancies |z|=0.84 (1) Metabolic-syndrome profile: extreme insulin, elevated glucose, older age. .. code-block:: python best_idx = state.analysis['best_detector_index'] best_result = state.results[best_idx] explanations = engine.explain_findings(best_result, X=X, top_k=5) # explanations[0]['contributing_features'][0] == # {'feature': 4, 'z_score': 6.65} # insulin (absolute magnitude) **Turn 3.** User reveals labels exist. Trigger 5 fires: .. code-block:: text User: I should mention -- I actually have diabetes diagnosis labels for the full cohort. [od-expert skill: new turn, re-evaluating triggers] Trigger T5 -> FIRES: labels mentioned but the session is on the unsupervised path Recommended switch -> drop out of ADEngine session and call XGBOD directly (pyod[xgboost]) Agent: That changes the plan. The ADEngine session API is unsupervised in v3.2.x. For labelled data, supervised detection (XGBOD) typically lifts AUC 10-20 points. I would switch to the classic fit / predict path. .. code-block:: python # ADEngine session is unsupervised in v3.2.x; for labels, # use the classic fit / predict path directly: from pyod.models.xgbod import XGBOD clf = XGBOD() clf.fit(X, y_labels) # supervised training scores = clf.decision_function(X) # anomaly scores labels_pred = clf.predict(X) # binary labels **Turn 4.** User asks for the unsupervised report for now: .. code-block:: python report = engine.report(state, format='text') # Includes the quality bars, the selected detectors, the # best detector, and an explicit "assumptions and caveats" # block citing the scale mismatch (Pitfall 1), the # observed anomaly ratio, the high-stakes caveat (Trigger # 8), and the label-availability note (Trigger 5). # Or JSON for programmatic consumption: report_dict = engine.report(state, format='json') ---- The Session API --------------- The agentic workflow is built on :class:`~pyod.utils.ad_engine.ADEngine`'s session API. Each method advances the investigation and sets ``state.next_action`` to guide the agent: ========================== ======================================================== Method Purpose ========================== ======================================================== ``start(X)`` Profile data, return ``InvestigationState`` ``plan(state)`` Select top-N detectors, populate ``state.plans`` ``run(state)`` Execute all detectors, fill ``state.consensus`` ``analyze(state)`` Populate ``state.quality`` and ``state.analysis`` ``iterate(state, fb)`` Adjust plan based on structured or NL feedback ``report(state)`` Generate markdown or JSON report ``investigate(X)`` One-shot: ``start`` + ``plan`` + ``run`` + ``analyze`` ========================== ======================================================== Key ``state`` fields the agent reads: * ``state.profile`` -- dict with ``data_type``, ``n_samples``, ``n_features``, ``has_nan``, ``dtype``, ``dimensionality_class``. * ``state.plans`` -- list of plan dicts with ``detector_name``, ``confidence``, ``reason``, ``evidence``, ``alternatives``. * ``state.consensus`` -- dict with ``scores``, ``labels``, ``n_detectors``, ``agreement``. * ``state.quality`` -- dict with ``separation``, ``agreement``, ``stability``, ``overall``, ``verdict``, ``explanation``. * ``state.analysis`` -- dict with ``consensus_analysis`` (containing ``n_anomalies``, ``anomaly_ratio``, ``top_anomalies``, etc.), ``best_detector``, ``best_detector_index``, ``per_detector_analysis``. * ``state.next_action`` -- dict with ``action`` in {``report_to_user``, ``iterate``, ``confirm_with_user``}, plus ``reason``, ``summary``, and sometimes ``suggestion`` / ``proposed_change``. Feedback to ``iterate(state, fb)`` can be structured (dict) or natural language (string): .. code-block:: python # Structured (executes immediately) engine.iterate(state, {"action": "adjust_contamination", "value": 0.05}) engine.iterate(state, {"action": "exclude", "detectors": ["IForest"]}) engine.iterate(state, {"action": "include", "detectors": ["ECOD"]}) engine.iterate(state, {"action": "rerun"}) # Natural language (parsed to action, may need confirmation) engine.iterate(state, "too many false positives") engine.iterate(state, "try without IForest") ---- Why This Is Different --------------------- Without PyOD 3, an AI agent wrapping a library like scikit-learn would: 1. Pick one detector (probably the wrong one) 2. Run it once with default parameters 3. Return raw scores without quality assessment 4. Rely on the LLM to interpret results from first principles With PyOD 3 and the v3.2.0 ``od-expert`` skill: 1. Walks a master decision tree to pick the right modality and detector family. 2. Walks a top-10 pitfall checklist before running, catching problems like scale mismatch or DL on tiny data. 3. Walks 11 adaptive escalation triggers to decide when to pause and ask vs. run autonomously. 4. Selects detectors via benchmark-backed routing (ADBench, TSB-AD, BOND). 5. Runs top-3 in parallel and builds rank-normalized consensus. 6. Re-checks quality-related triggers post-run and hedges the report accordingly. 7. Always reports the assumptions and caveats, including the scale mismatch, contamination, and any triggered escalations. The agent follows an OD expert's workflow through the library, not despite it.