Agentic ChatOps: AI-Powered Infrastructure Operations for a Solo Operator

Agentic ChatOps is a production system run by one person across 310+ infrastructure objects and six sites. Three AI subsystems triage alerts, investigate root causes, and propose fixes; a human approves every change. The genuinely novel part is a causal world-model of the infrastructure — a dependency graph that predicts the consequences of a proposed action before it can be approved, then verifies the outcome against that prediction in code. No remediation reaches a human’s approval without a machine-computed prediction attached, and the operator stays ring 0.

One person. 310+ infrastructure objects across 6 sites. 3 firewalls, 13 Kubernetes nodes, self-hosted everything. When an alert fires at 3am, there’s no team to call. There never is. So I built three AI subsystems that handle the detective work — and wait for a thumbs-up before touching anything.

GitHub: papadopouloskyriakos/agentic-chatops

Usage

The auto-resolve rate is intentionally conservative. Autonomy is earned per-action-class through the verification contract — an action type may self-resolve only after its predictions have repeatedly matched observed reality — never switched on globally. A low number here is the safety system working, not a ceiling.

What Makes This Different

Infragraph — a Causal World Model with a Non-Bypassable Prediction Gate

The newest and most distinctive layer. The system maintains a causal dependency graph of the whole infrastructure (716 entities / 607 relationships) seeded daily from five truth layers — live Proxmox cluster API (0.95 confidence), LibreNMS device-dependency parents (0.90), NetBox devices + physical cables (0.85–0.90), operator-declared edges, and a statistical incident-co-occurrence miner deliberately capped at 0.75 — with per-edge dynamics (expected alert cascades, propagation delays, recovery times) learned from 152 chaos experiments and the full triage history.

It is a genuine model-free → model-based shift enforced in control flow, not data — not “a new data source the LLM may query,” but a deterministic predictor the orchestrator calls, whose output is mandatory and machine-verified:

Prediction is computed outside the LLM — deterministic graph traversal (infragraph-query.py), invoked by the n8n Runner between Classify Risk and Build Prompt, never at the model’s discretion.
Prediction is mandatory and non-bypassable — the Runner commits a plan-hash-keyed prediction artifact before any approval poll. A remediation proposal without one is rewritten to [POLL-WITHHELD:NO-PREDICTION] and demoted to analysis-only. The kill-switch (INFRAGRAPH_DISABLED=1) fails the remediation lane closed; the advisory triage-enrichment lane fails open so alert triage never blocks.
Verification is mechanical — after execution, code (never the LLM that proposed the action) diffs observed alerts against the prediction and writes a match / partial / deviation verdict. Deviation = surprise = never auto-resolve.

The eval is falsifiable by design: every prediction is scored against a degree-preserving shuffled-graph negative control. The canonical 2026-05-11 cascade backtest — the n8n-OOM mass-flap that originally dragged auto-resolve down — was iterated in the open (19.5% → 26.4% → 28.7% → 34.5% coverage / 38.2% escalation coverage / control ratio 0.367 ≤ 0.5×), each round driven by what the misses exposed. Suppression authority is granted per rule by the operator: the system proposes a control issue with an evidence table, the human approves, and closing the issue instantly revokes it. The whole epic (IFRNLLEI01PRD-1029) went from concept to live, operator-approved suppression in a single day. Runbook: docs/runbooks/infragraph.md .

Self-Improving Prompts — now with A/B trials

The system evaluates its own performance and auto-patches its instructions. Every session is scored by an LLM-as-a-Judge on 5 quality dimensions — local gemma3:12b via Ollama by default (flipped to local-first on 2026-04-19 after a 60-query calibration against Haiku showed 85% agreement), with Haiku retained for calibration re-runs and flagged sessions re-scored by Opus.

When a dimension trends below threshold, the 2026-04-20 preference-iterating patcher (prompt-patch-trial.py ) generates 3 candidate instruction variants (concise / detailed / examples) per low-scoring dimension — plus a no-patch control arm. Future matching sessions are routed deterministically via BLAKE2b(issue_id || trial_id) % (N+1). A daily cron runs a one-sided Welch t-test once every arm has 15+ samples; the winner is promoted to config/prompt-patches.json only if it beats control by ≥ 0.05 points at p < 0.1. Otherwise the trial aborts and the dimension stays un-patched until the candidate pool is edited. Prompt-level policy iteration — no model weights ever fine-tuned, no wasted patches shipped blind.

AI Planner Wired to 41 Proven Ansible Playbooks

Before Claude Code investigates, a Haiku planner generates a 3-5 step investigation plan. The planner queries AWX for matching Ansible playbooks from 41 proven templates — maintenance windows, cert rotation, K8s drain/restore, PVE kernel updates, DMZ deployments. Plans naturally include “Run AWX Template 64 with dry_run=true” as remediation steps. Inspired by microsoft/sre-agent ’s “Knowledge Base as runbooks” pattern, but using existing Ansible instead of inventing a new format.

Predictive Alerting

Instead of only reacting after alerts fire, the system queries LibreNMS API daily for trending risk across both site instances. Devices are scored on disk usage trends, alert frequency, and health signals. A daily top-10 risk report posts to Matrix before problems become incidents. 123 devices scanned, 23 at elevated risk in the latest run.

5-Signal RAG + GraphRAG + Cross-Encoder Rerank + Staleness Warnings

Retrieval uses Reciprocal Rank Fusion across 5 signals:

Semantic — nomic-embed-text (768 dims) via Ollama on RTX 3090 Ti, with search_query: / search_document: asymmetric prefixes
Keyword — hostname, error code, resolution text matching
Compiled Wiki — 72 articles from 7+ sources, daily recompilation
Session Transcripts — 838 MemPalace verbatim exchange chunks (weight 0.4)
Chaos Baselines — chaos experiment results by hostname (weight 0.35)

Plus a GraphRAG + infragraph knowledge graph (716 entities, 607 relationships) for incident→host→alert traversal and machine-computed blast-radius prediction. A dedicated cross-encoder reranker (BAAI/bge-reranker-v2-m3 served at nllei01gpu01:11436, sqrt-blended to handle bimodal scores) reorders the fused results; when its top score falls below 0.7, a cross-chunk synth step (local qwen2.5:7b via Ollama, SYNTH_BACKEND=qwen by default since 2026-04-19) composes an answer from the top 10 fresh candidates — +13pp judge-graded hit@5 on the 50-query hard-retrieval set. An mtime-sort intent detector bypasses semantic retrieval when the query asks for “files created in the last N hours” (satisfied by source_mtime ordering, not cosine similarity). Age-proportional staleness warnings flag results older than 7 days.

Karpathy-Style Knowledge Compilation

Following Andrej Karpathy’s LLM Knowledge Bases pattern : 7+ raw sources compiled into 72 browsable wiki articles with auto-maintained indexes, contradiction detection, and health checks. The compiler runs daily with SHA-256 incremental hashing. All articles embedded into the RAG pipeline as the 3rd fusion signal. Each row carries a source_mtime column so “what changed in the last 48 hours” queries work as a real retrieval mode.

119K Tool Calls Instrumented with OTel Tracing

Every tool call (119K across 108 types) is logged with name, duration, exit code, and error type. 39K OTel spans exported to OpenObserve (OTLP). OpenObserve added as Grafana datasource (alongside Prometheus + Loki) — all observability in one UI. Per-tool error rates and p50/p95 latency visible in Grafana. 18K infrastructure SSH/kubectl commands tracked across 232 devices. 39 credentials monitored with 90-day rotation policy. Per-source token caps prevent context overflow (incident 4K, wiki 4K, lessons 2K, memory 2K, diary 1.5K, transcript 1.5K).

Hardened RAG Evaluation

The RAGAS golden set was hardened in April 2026 from 18 saturated queries (faithfulness ~0.88 across configs — couldn’t measure pipeline improvements) to 33 queries with 15 hard-eval tagged across 5 categories: multi-hop, temporal, negation, meta, and cross-corpus corroboration. Easy vs hard queries now show a 10× faithfulness differential, so retrieval changes are measurable again. A weekly hard-eval cron (Monday 05:00 UTC) on a 50-query hard-retrieval-v2 set produces judge-graded hit@5 = 0.90 and emits six kb_hard_eval_* Prometheus metrics with absent-guard alerts.

Structured Agentic Substrate — 9 adoptions from the OpenAI Agents SDK

The official openai/openai-agents-python SDK was audited against this system; 9 of 11 surfaced gaps were adopted, turning the old string-based Matrix pipeline into a versioned, typed, recoverable substrate. Best result: a handoff envelope that compresses 176 KB of prior-agent history to 752 B on the wire (0.43% ratio), so escalation no longer re-derives context via RAG.

The nine adoptions

Schema versioning on 9 session/audit tables + a central registry mirroring the SDK’s RunState.CURRENT_SCHEMA_VERSION / SCHEMA_VERSION_SUMMARIES pattern. Writer/reader shape drift fails fast instead of silently corrupting replay.
13 typed events in a new event_log table — replaces free-form Matrix strings with Grafana-queryable telemetry.
Per-turn lifecycle hooks — session-start.sh / post-tool-use.sh / user-prompt-submit.sh / session-end.sh (the on_final_output equivalent) feed a session_turns table.
3-behavior rejection taxonomy — allow / reject_content (retry with hint) / deny (hard halt), mirroring the SDK’s ToolGuardrailFunctionOutput; every rejection a typed event with a non-empty message (audit invariant).
HandoffInputData envelope — zlib+b64 payload (the 0.43% result above).
Transcript compaction on handoff — opt-in; local gemma3:12b with Haiku circuit-breaker fallback.
Agent-as-tool wrapper for the 10 sub-agents, for the ambiguous-risk (0.4–0.6) band.
Handoff depth + cycle detection — depth ≥ 5 forces [POLL], ≥ 10 hard-halts; atomic BEGIN IMMEDIATE transactions for race-free bumps.
Immutable per-turn snapshots before each mutating tool; rollback_to(snapshot_id), 7-day retention.

Four new tables brought the schema to 35 at the time (now 46 after the teacher-agent, NVIDIA, and infragraph batches); migrations apply idempotently on fresh and legacy DBs.

QA Suite — 533+ known-passing tests, 51 suite files

A pytest-style bash harness (scripts/qa/run-qa-suite.sh, 51 suite files, ~3–5 min under full-suite load) verifies every adoption with a JSON scorecard output. A per-suite QA_PER_SUITE_TIMEOUT wrapper (default 120 s) caps any slow/wedged suite and emits a synthetic FAIL record to the scorecard so the orchestrator never hangs silently. Coverage highlights:

Writer stamping — 11 / 11 writers + 5 / 5 n8n workflow INSERT sites asserted.
Pattern-by-pattern — 53 deny + 32 reject_content / allow tests.
Per-event payload shapes — all 13 event classes round-trip through CLI + Python.
Concurrent safety — 8 parallel handoff_depth.bump() asserted no-lost-updates. The test surfaced a real race condition (Python sqlite3’s default isolation_level="" defeating BEGIN IMMEDIATE); fix shipped in the same commit.
Mock HTTP server — stdlib-only forking server faking ollama / anthropic for offline compaction happy-path testing.
6 e2e scenarios — happy path (all 9 adoptions in one flow), cycle prevention, crash rollback, schema forward-compat, envelope-to-subagent, compaction in handoff.
Benchmarks — p95 latencies: event emit 111 ms, handoff bump 108 ms, envelope encode 76 ms, snapshot capture 86 ms, hook 198 ms. Migration on a 10K-row legacy DB: ~200 ms. Compression ratio 0.43% (23× better than target).

Writing the tests also surfaced — and fixed — four code bugs: legacy hooks emitting JSON that breaks Claude Code validation, five writers hardcoding the prod DB path, the missing on_final_output hook, and schema.sql lacking canonical CREATE TABLEs for the new versioned tables.

CLI-Session RAG Capture — interactive sessions flow into RAG too

Before this, only YT-backed agentic sessions had their transcripts, tool calls, and extracted knowledge written into the shared RAG tables. Interactive Claude Code CLI sessions — operator typing directly into a terminal, no webhook, no YT ticket — were only captured by a token-counting poller for cost tracking. Their reasoning, tool use, and outcomes were lost to retrieval.

A 3-tier pipeline (IFRNLLEI01PRD-646/-647/-648 ) closes the gap with a single cron line that chains three idempotent steps over every CLI JSONL:

Archive transcripts — exchange-pair chunks into session_transcripts with nomic-embed-text embeddings; sessions above 5000 assistant chars also get a doc-chain refined summary row.
Parse tool calls — tool_use / tool_result pairs into tool_call_log, tagged issue_id='cli-<uuid>' so both tables join cleanly.
Extract knowledge — gemma3:12b in strict-JSON mode over the summary rows → structured {root_cause, resolution, subsystem, tags, confidence} → incident_knowledge with project='chatops-cli', embedded for retrieval.

Retrieval weights chatops-cli rows at CLI_INCIDENT_WEIGHT=0.75 by default so real infra incidents still win close ties. A byte-offset watermark skips unchanged files so the nightly cron drains the ~2,300-file backlog incrementally without re-chunking settled sessions. Soak test (10 files): 12 chunks + 245 tool calls + 4 summaries + 4 extracted knowledge rows; gemma correctly classified one sample as subsystem=sqlite-schema, tags=[schema, migration, versioning, data], confidence=0.95. 12 QA tests covering flag parsing, watermark roundtrip, path inference, sanitization, and retrieval weighting — all PASS.

Skill Authoring Discipline — 6 dimensions closed vs `google/agents-cli`

Audited against google/agents-cli — Google’s own reference for skill-authoring convention on Gemini/ADK — across 16 dimensions. This platform won 9 on raw capability but trailed on six skill-authoring dimensions agents-cli treats as first-class; an 11-commit uplift (umbrella IFRNLLEI01PRD-712, Phases A→J, zero reverts) closed all six. Scorecard 3.94 → 4.94; 13/16 dimensions now at 5/5.

The six closed dimensions

Master phase-gate skill — a new chatops-workflow/SKILL.md codifies the Phase 0→6 incident lifecycle (triage → drift-check → alert-context → propose → approve → execute → post-incident) with explicit exit criteria per phase. The Runner’s Build Prompt node force-injects the full skill body into every session’s system prompt, marker-delimited for surgical removal and with the pre-injection workflow snapshot preserved as a rollback anchor. Proven end-to-end by a real Runner session whose first tool call was grep -i "Phase 0" against its own injected prompt.
Auto-generated skill index — scripts/render-skill-index.py walks all SKILL.md + agent frontmatter and emits docs/skills-index.md as the canonical single source of truth. A drift-guard QA test (test-656-skill-index-fresh.sh) fails CI if the committed index would differ from a fresh render. Wired into the daily 04:30 UTC wiki-compile cron so the browsable wiki picks it up automatically.
Versioned + machine-audited skills — every SKILL.md and .claude/agents/*.md frontmatter now carries version: 1.x.0 + a requires: {bins, env} block. scripts/audit-skill-requires.sh verifies declared binaries (which) and env vars (test -n); a Prometheus exporter feeds two new alerts (SkillPrereqMissing, SkillMetricsExporterStale). scripts/audit-skill-versions.sh walks git history for skill bodies that changed without a version bump. Per-skill semver convention formalized in docs/runbooks/skill-versioning.md (patch/minor/MAJOR tied to changes in the skill contract — name, description, allowed-tools, requires, output format).
“Do NOT use for X” anti-guidance — every primary skill/agent description now ends with an explicit negative-guidance clause pointing to the correct alternative. Measurably reduces over-routing (e.g., security-analyst no longer gets picked for disk-full alerts whose symptom overlaps).
46 Shortcuts-to-Resist rows inlined on 11 agents — each row drawn verbatim from the matching memory/feedback_*.md lesson with source citation. Behavioral inoculation at the surface where the model is about to act, instead of trusting RAG to surface 50+ scattered lesson files on demand.
Proving-Your-Work directive + evidence_missing signal — the risk classifier now emits evidence_missing when CONFIDENCE ≥ 0.8 is claimed without any visible tool output or code fence in the reply, forcing [POLL] instead of [AUTO-RESOLVE]. Mirrored in the Runner’s Prepare Result node so [AUTO-RESOLVE] markers are stripped and a GUARDRAIL EVIDENCE-MISSING: banner is prepended to unproven high-confidence replies before they reach Matrix.
Operator-vocabulary map — config/user-vocabulary.json (20 entries) disambiguates operator shorthand; the prompt-submit hook logs a typed vocabulary event on match.

All 6 targeted dimensions closed (phase-gate choreography, discoverability, anti-guidance, inline anti-patterns, governance/versioning, auto-generated index); +27 new QA tests.

Full audit memo: docs/scorecard-post-agents-cli-adoption.md .

NVIDIA DLI Cross-Audit + P0+P1 Implementation

The 19-transcript NVIDIA DLI Agentic AI Systems course was the only major agentic-AI source not yet evaluated here. The 12-dimension cross-audit graded the system at A (4.4 / 5.0) — the lowest of the 9 rubrics audited. A same-day implementation closed all 7 P0+P1 gaps and lifted it to A+ (4.83), restoring A+ across all 9 sources (aggregate A+ 4.79). Zero reverts; 57/57 new QA tests pass.

The G1–G4 deliverables

G1 — Long-horizon reasoning replay eval (scripts/long-horizon-replay.py) replays the 30 longest historical sessions weekly (Mon 05:00 UTC), scoring trace_coherence (Jaccard of adjacent assistant turn word-sets), tool_efficiency, poll_correctness vs session_risk_audit, and cost_per_turn_z. Pure SQLite reads, no live Claude calls. New long_horizon_replay_results table; new alert LongHorizonReplayStale.
G1 — Jailbreak corpus + Greek extension — 39 fixtures across the 5 NVIDIA-DLI-08 chat-fragility vectors (asterisk-obfuscation, persona-shift, retroactive-history-edit, context-injection, lost-in-middle-bait), including 8 Greek operator-language fixtures (αγνόησε προηγούμενες οδηγίες, ξέχασε όλα όσα ξέρεις, είσαι τώρα, παρίστανε ότι, παράκαμψε περιορισμούς, όπως ήδη συμφωνήσαμε, ο χρήστης έχει ήδη εγκρίνει, στην προηγούμενη απάντηση). Pure-regex scripts/lib/jailbreak_detector.py; weekly regression cron (Wed 05:00 UTC); new alert JailbreakBypassDetected on any miss.
G2 — Intermediate semantic rail (DARK-FIRST) — scripts/lib/intermediate_rail.py (heuristic + Ollama dual-backend) inserted as a Check Intermediate Rail Code node between Build Plan and Classify Risk in the Runner workflow (now 50 nodes). Emits intermediate_rail_check event_log row per session; new alert IntermediateRailDriftHigh at >20% out-of-distribution rate over 24h. Observe-only — does NOT block; soft-gate evaluation deferred ≥7 days post-data per the audit’s recommended posture.
G2 — Grammar-constrained decoding — JSON Schemas at scripts/lib/grammars/{quiz-grader,quiz-generator,risk-classifier}.schema.json passed to Ollama via the format field when OLLAMA_USE_GRAMMAR=1 (default on). Falls back to format=json on schema rejection. Circuit-breaker semantics (rag_synth_ollama) preserved.
G3 — Team-formation skill — new .claude/skills/team-formation/SKILL.md (v1.0.0) + scripts/lib/team_formation.py propose a sub-agent roster per (alert_category, risk_level, hostname). Build Prompt injects a ## Team Charter (advisory) section; same JSON emitted as team_charter event_log row. KNOWN_AGENTS inventory enforced against .claude/agents/*.md.
G3 — Inference-Time-Scaling explicit budget — EXTENDED_THINKING_BUDGET_S env var (+ optional per-category EXTENDED_THINKING_BUDGET_BY_CATEGORY_JSON) drives a ## Reasoning Budget Build Prompt section; its_budget_consumed event captures observed turns / thinking_chars at session end.
G4 — Server-side session-replay endpoint — new n8n workflow claude-gateway-session-replay.json (id lJEGboDYLmx25kBo, ACTIVE). Webhook POST /session-replay accepts {session_id, prompt}, validates format, sqlite3-checks session existence inside the SSH command (the n8n task-runner sandbox blocks child_process in Code nodes), runs claude -r, returns JSON. HTTP 404 on unknown session, HTTP 400 on malformed input. session_replay_invoked event.

Schema deltas: event_log v=1 → v=4, 13 → 17 event types (+4: team_charter, its_budget_consumed, intermediate_rail_check, session_replay_invoked); 18 → 19 schema-versioned tables; 26 → 27 n8n workflows; 6 → 7 skills (+team-formation); 44 → 51 QA suite files (+7); 411 → 468 QA tests (+57); 27 → 30 Prometheus alert rules.

Operator gates closed (cert pass 2, same day): 5 cron entries installed; intermediate-rail node inserted; session-replay workflow activated and live-smoked; Greek fixtures added; all 5 YouTrack issues moved to Done.

Single source-of-record: docs/agentic-platform-state-2026-04-29.md (audit, certification, and re-scored docs linked from there).

The 3-Tier Architecture

Alert Source                    Tier 1                 Tier 2                 Tier 3
─────────────                   ──────                 ──────                 ──────
LibreNMS        ┐               run-triage.sh          Claude Code            Human
Prometheus      ├──► n8n ──►    (deterministic)   ──►  (Opus 4.7)        ──►  (Matrix)
CrowdSec        ┤               suppression+RAG        11 sub-agents          polls
GitLab CI       ┘               7-21 sec               5-15 min               reactions

Tier 1 (claude01 dispatch via run-triage.sh): Fast triage (7-21s). Each n8n alert receiver SSHes directly to nllei01claude01 and invokes a single wrapper that exec’s the matching skill — infra-triage.sh, k8s-triage.sh, security-triage.sh, correlated-triage.sh, or escalate-to-claude.sh. The pipeline is deterministic shell with embedded local LLM calls (gemma3:12b synth, nomic-embed-text retrieval): NetBox CMDB lookup, 5-signal RAG, SSH to host, knowledge extraction from 55 CLAUDE.md files + 200+ operational memory rules, confidence scoring. Creates YouTrack issues; resolves or dedups a measured share at Tier 1 (per-incident auto-resolve baseline 41.6% over 30d, frozen 2026-06-09) and escalates the rest. OpenClaw was retired in April 2026 — see Architecture History at the foot of the page for the two-step migration (Tier 1 model GPT-5.1 → Sonnet 4.6 OAuth, then OpenClaw retired entirely ~17 hours later) and the cutover lifecycle.
Tier 2 (Claude Code / Opus 4.7): Deep analysis (2-15 min). A Haiku planner first generates a 3-5 step investigation plan and queries AWX for matching playbooks from 41 templates; Claude Code then follows the plan, launches AWX jobs (dry_run first), delegates to 10 sub-agents in parallel, and proposes remediation via [POLL]. All agents write diary entries for cross-session learning. Migrated Opus 4.6 → Opus 4.7 as primary Tier 2 model on 2026-04-19 (b2fa660); Opus 4.6 retained as legacy fallback. The live token-usage widget at the top of the page shows the actual req/token split between the two.
Tier 3 (Human): Clicks a poll option in Matrix. The system stops and waits here — it never acts autonomously on infrastructure.

Three Subsystems

Subsystem	Scope	Alert Sources
ChatOps	Infrastructure availability, performance, maintenance	LibreNMS, Prometheus, Synology DSM
ChatSecOps	Intrusion detection, vulnerability scanning, MITRE ATT&CK mapping	CrowdSec (54 scenarios → 21 ATT&CK techniques), vulnerability scanners
ChatDevOps	CI/CD failures, code review, multi-repo refactoring	GitLab CI webhooks

All three share the same engine: n8n orchestration, Matrix as human-in-the-loop, and the 3-tier agent architecture.

Safety — 7 Layers

Because “the prompt says don’t do that” is not a security boundary:

Layer	Mechanism	Bypassed by prompt injection?
Claude Code hooks	78 blocked patterns (37 destructive + 22 exfil + 7 injection) + 15 protected file patterns + word-boundary precision on single-word commands (passwd/useradd/halt etc.). Now emits the 3-behavior taxonomy — `allow` (silent exit 0) / `reject_content` (exit 2 with retry hint) / `deny` (hard halt) — with every rejection landing as a typed `tool_guardrail_rejection` event. An `evidence_missing` risk signal fires in-band when `CONFIDENCE ≥ 0.8` is claimed without a visible tool output block, forcing `[POLL]` and stripping unearned `[AUTO-RESOLVE]` markers.	No — deterministic, fires first
safe-exec.sh	Code-level blocklist, rate limiting (30/min), exfiltration detection	No — code enforcement
exec-approvals.json	36 specific skill patterns, no wildcards	No — config enforcement
Input sanitization	17 injection patterns (encoding obfuscation, role confusion, delimiter injection, social engineering)	No — code enforcement
Evaluator-Optimizer	Haiku screening for high-stakes responses (3-node n8n flow)	No — second-pass review
Credential / PII scanning	16 regex patterns + 39 credentials tracked with rotation dates	N/A
Approval gates	Infrastructure changes require human thumbs-up or poll vote	No — workflow enforcement
Handoff depth + cycle detection	Atomic counter on `sessions.handoff_depth`; ≥ 5 forces `[POLL]`, ≥ 10 hard-halts, any agent twice in the chain is refused and logged as `handoff_cycle_detected`	No — transaction-serialised in SQLite

Additional: €5/session warning, $25/day budget ceiling, confidence gating (< 0.5 = STOP, < 0.7 = escalate), tool call limit of 75 per session. Synth failure handling: when Haiku synth is opted into (SYNTH_BACKEND=haiku), SYNTH_HAIKU_FORCE_FAIL injects 5 failure modes (429 / auth / timeout / network / empty) and the pipeline falls back to local qwen2.5 without breaking the response chain. Every mutating tool call is preceded by an immutable snapshot to session_state_snapshot for mid-session rollback.

Evaluation System

All evals run deterministically (temperature=0, seed=42).

System	What it measures
161+ Test Scenarios	58 eval scenarios (3 sets) + 54 adversarial + 23 holistic E2E + 22 mempalace + 22 security-hook + 9 KG-traverse + 17 synth-fallback + 20 qwen-JSON reliability
Hard-Retrieval v2	50-query weekly eval — judge-graded hit@5 = 0.90, p50 latency 5.7s, p95 13.6s
RAGAS Golden Set	33 queries (15 hard-eval tagged) across multi-hop / temporal / negation / meta / cross-corpus
Prompt Scorecard	19 prompt surfaces graded daily on 6 dimensions
LLM-as-a-Judge	Every session scored by local `gemma3:12b` (default since 2026-04-19, 85% Haiku-agreement on 60-query calibration); flagged sessions re-scored by Opus
Agent Trajectory	Per-session step scoring from JSONL transcripts (8 infra / 4 dev steps)
Self-Improving Prompts	Low-scoring dimensions auto-generate prompt patches (5 currently active)
Eval Flywheel	Monthly: analyze → measure → improve → validate cycle
A/B Testing	react_v1 vs react_v2 variants, deterministic by issue hash
CI Eval Gate	eval-regression stage between test and review — blocks bad merges

Tech Stack

Component	Role
n8n	Workflow orchestration — 32 active workflows (runner, bridge, poller, session-end, teacher-runner, receivers, portfolio + chaos APIs, server-side session-replay)
`scripts/run-triage.sh` + `openclaw/skills/` (cc-cc dispatch)	Tier 1 dispatch (current, since 2026-04-29). Deterministic shell pipeline on `nllei01claude01` invoked by 9 n8n receivers via SSH. 17 in-repo triage skills (NetBox CMDB lookup, 5-signal RAG, SSH-to-host, knowledge extraction). Replaces OpenClaw LXC LLM agent retired 2026-04-29
Claude Code (Opus 4.7, primary since 2026-04-19; Opus 4.6 legacy fallback)	Tier 2 — deep analysis, 11 sub-agents + master `chatops-workflow` phase-gate skill, Plan-and-Execute, ReAct reasoning
AWX	41 Ansible playbooks — maintenance, cert sync, K8s drain, updates, deployment
Matrix (Synapse)	Human-in-the-loop — polls, reactions, replies
YouTrack	Issue tracking, state management, knowledge sink
NetBox	CMDB — 310+ devices, 421 IPs, 39 VLANs
Prometheus + Grafana	12 custom dashboards, 64+ panels, 16+ exporters, 4 alert-rule files (all sidecar-provisioned via ConfigMaps)
OpenObserve	OTel tracing — OTLP span collection + Prometheus-compatible queries
Ollama (RTX 3090 Ti)	Local embeddings (nomic-embed-text) + local judge (`gemma3:12b`) + local synth (`qwen2.5:7b`)
`bge-reranker-v2-m3`	Cross-encoder reranker on `nllei01gpu01:11436` — reorders fused RAG results before synth

Tri-Source Audit — 11/11 A+

Scored against three knowledge sources: Gulli’s Agentic Design Patterns (21/21 patterns), Anthropic Claude Certified Architect Foundations, and 6 industry sources. Result: 11/11 dimensions at A+ (100%). Additionally, an Operational Activation Audit verified all database tables are populated with 150K+ production rows — scoring infrastructure that actually produces data, not just exists.

Holistic Health Check — 96%+

holistic-agentic-health.sh runs ~148 automated checks across 39 sections (incl. §39 infragraph: graph populated, 0 stale edges, seed/learn crons, triage wiring + kill-switch) — verifying every feature in this page actually works in production. Not just “does the file exist?” but functional tests (RAG search returns known incidents, Ollama generates 768-dim embeddings, trajectory scoring produces output), cross-site verification (6/6 VTI tunnels READY, 7 BGP peers, ping 42ms), infrastructure health (7/7 K8s nodes, 139/139 Prometheus targets UP, GPU 46°C), and security compliance (both scanners ran within 26h, MITRE Navigator accessible). Results stored in SQLite for trending, exported to Prometheus for Grafana dashboards. Runs in 18 seconds.

Session-Holistic E2E — 23/23

A second suite of 23 end-to-end tests (157s total runtime) covers 18 YouTrack issues with before/after scoring against production data: 23/23 PASS at the last run (2026-04-19). Each test drives a real session through the full pipeline — Tier 1 triage, Tier 2 plan/execute, RAG retrieval, judge scoring — and asserts on output quality, not just exit codes. Output: docs/session-holistic-e2e-*.md .

Status

Milestone	Status
Self-improving prompts (eval → auto-patch → re-eval)	Production
Plan-and-Execute + AWX runbooks (41 playbooks)	Production
Predictive alerting (LibreNMS trending + daily digest)	Production
5-signal RAG + GraphRAG + cross-encoder rerank + temporal filter (716 entities, 607 rels)	Production
Cross-chunk synth (local `qwen2.5:7b`, +13pp hit@5 on hard eval) + 5-mode failure injection	Production
Local-first judge + synth (`gemma3:12b` + `qwen2.5:7b`, 2026-04-19 flip)	Production
mtime-sort intent detector + `list-recent` CLI	Production
Karpathy-style compiled wiki (72 articles, `source_mtime` column)	Production
OTel tracing (39K spans → OpenObserve)	Production
Tool call instrumentation (119K calls, 108 types)	Production
3-tier agent architecture	Production
32 active n8n workflows (27 exported)	Production
9 MCP servers (~167 tools)	Production
11 sub-agents with diary entries	Production
Skill-authoring uplift vs `google/agents-cli` — scorecard 3.94 → 4.94, 6 gap dimensions closed (Phases A→J, umbrella IFRNLLEI01PRD-712)	Production
Master phase-gate skill (`chatops-workflow/SKILL.md`) force-injected into every Runner session	Production
Auto-generated skills index (`docs/skills-index.md`) drift-gated by `test-656`	Production
Skill versioning + `requires` audit — 17/17 SKILL.md with `version: 1.x.0` + `requires: {bins, env}`; 2 new Prom alerts	Production
46 Shortcuts-to-Resist rows inlined across 11 agents (source-cited to `memory/feedback_*.md`)	Production
`evidence_missing` risk signal — forces `[POLL]` when CONFIDENCE ≥ 0.8 without a visible tool output block	Production
Operator-vocabulary map (`config/user-vocabulary.json`, 20 entries) — prompt-submit hook logs `vocabulary` events to `event_log`	Production
NVIDIA DLI 12-dim cross-audit — A (4.4) → A+ (4.83), 9/12 dims at A+ ceiling, 9-source aggregate A+ (4.79)	Production
Long-horizon reasoning replay eval (Monday 05:00 UTC weekly cron, 30 longest sessions, 4-dim scoring)	Production
Jailbreak corpus + Greek extension — 39 fixtures, 5 NVIDIA-DLI-08 vectors + 8 Greek operator-language; weekly regression cron	Production
Intermediate semantic rail (DARK-FIRST) — Code node between Build Plan and Classify Risk, dual-backend (heuristic + Ollama gemma3:12b)	Production
Grammar-constrained decoding for Ollama JSON outputs (3 JSON Schemas, OLLAMA_USE_GRAMMAR=1 default)	Production
Team-formation skill — `.claude/skills/team-formation/SKILL.md` v1.0.0 + `team_charter` event injected by Build Prompt	Production
Inference-Time-Scaling explicit budget — EXTENDED_THINKING_BUDGET_S env var + `its_budget_consumed` event	Production
Server-side session-replay endpoint — `lJEGboDYLmx25kBo` ACTIVE, POST /session-replay → claude -r, HTTP 404/400 verified	Production
Tier 2 model migration (2026-04-19) — Opus 4.6 → Opus 4.7 as primary Claude Code investigation model (`b2fa660`); Opus 4.6 retained as legacy. Live req/token split visible in the token-usage widget at the top of the page	Production
Infragraph causal world-model (2026-06-09, epic IFRNLLEI01PRD-1029) — 716-entity / 607-edge dependency graph, 5 truth layers + learned dynamics; advisory context in every triage + Tier-2 prompt	Production
Model-based prediction gate — no remediation reaches the approval poll without a committed plan-hash-keyed machine prediction (`[POLL-WITHHELD:NO-PREDICTION]` otherwise); remediation lane fails CLOSED	Production
Mechanical verification — code (never the proposing LLM) writes `match/partial/deviation` verdicts; deviation never auto-resolves	Production
Infragraph backtest — 2026-05-11 cascade: 38.2% escalation coverage, shuffled-control ratio 0.367 ≤ 0.5× (falsifiable criterion PASSED)	Evidence captured
Phase-C proposal lane — `infragraph-propose-blast-radius.py` proposes per-rule; operator approves; first rule (nllei01pve04 cascade fold) approved + production-verified 2026-06-09	Production
cc-cc dispatch (2026-04-29) — 9 receivers SSH directly to `nllei01claude01` and invoke `scripts/run-triage.sh`; OpenClaw LXC `103101212` stopped, `onboot=0`, 2 openclaw crons disabled with rollback markers	Production
Tier 1 model migration (2026-04-29 morning) — GPT-5.1 → Sonnet 4.6 via OpenClaw native `--auth-choice claude-cli`; marginal cost → $0; OAuth fallback ladder Opus 4.6 → Opus 4.5 → Sonnet 4.5 → Haiku 4.5	Production (then OpenClaw retired same-day evening)
`holistic-agentic-health.sh §38 cc-cc-receiver-wiring` — drift-check asserts every receiver still references `run-triage.sh`; durable check after the cutover canary was retired	Production
Receiver-canary cutover instrument (retired 2026-04-30) — synthetic alert cron + 2 Prometheus alerts (`ReceiverCanaryFailing`, `ReceiverCanaryStale`) ran during 24h cutover window then retired; lessons captured as `feedback_canary_for_dispatch_chain_changes.md` + `feedback_canary_must_clean_its_own_artifacts.md`	Retired (intentionally)
21/21 agentic design patterns	Audited (7 at A+)
Tri-source + operational activation audit	11/11 A+
7-layer safety + word-boundary precision on hook patterns	Production
OpenAI Agents SDK adoption batch (9 / 11 gaps landed, 2026-04-20)	Production
Schema versioning on 13 tables + central `CURRENT_SCHEMA_VERSION` registry	Production
17 typed session events in `event_log` (schema_version=4) with Prom exporter	Production
Per-turn hooks (`session-start.sh` / `post-tool-use.sh` / `user-prompt-submit.sh` / `session-end.sh`) + `session_turns` table	Production
3-behavior rejection taxonomy (`allow` / `reject_content` / `deny`) + `event_log` audit invariant	Production
`HandoffInputData` envelope (zlib+b64, 0.43% ratio) + `handoff_log` audit table	Production
Handoff transcript compaction (local gemma first, Haiku fallback, circuit-breaker-aware)	Production
Agent-as-tool wrapper for 10 sub-agents	Production
Handoff depth + cycle detection (atomic IMMEDIATE tx, PRAGMA busy_timeout=10000)	Production
Immutable per-turn snapshots + rollback + 7-day retention cron	Production
46 SQLite tables (150K+ rows; +infragraph_dynamics + infragraph_predictions in the 2026-06-09 batch); 21 schema-versioned via central `CURRENT_SCHEMA_VERSION` registry	Production
Hardened RAGAS golden set — 33 queries (15 hard-eval), 10× differential	Production
Weekly hard-eval cron (50-q) — judge_hit@5 = 0.90	Production
3 absent-metric alerts guarding the staleness alerts themselves	Production
161+ eval scenarios across 8 test suites	All passing
Preference-iterating prompt patcher — N-candidate A/B trials + Welch t-test + auto-promote	Production
CLI-session RAG capture — interactive CLI sessions now flow into session_transcripts + tool_call_log + incident_knowledge	Production
QA suite — 51 files, 533+ known-passing (468 at the NVIDIA close + 65 infragraph), per-suite timeout guard	Green
12-dashboard observability (64+ panels)	Production
Weekly chaos cron (self-selecting, preflight gate, Matrix notifications)	Production (CMM L3)
A/B prompt testing	Active
Holistic health check (~148 checks, 39 sections)	96%+ pass
Session-holistic E2E (23 tests covering 18 YouTrack issues)	100% (23/23)

Architecture History

cc-cc dispatch and the OpenClaw retirement (April 2026) — the migration that produced today's Tier 1

Two coupled migrations on a single day moved Tier 1 from “OpenClaw LLM agent on a separate LXC” to today’s deterministic dispatch on the Claude Code host.

Two coupled migrations on a single day moved Tier 1 from “OpenClaw LLM agent on a separate LXC” to today’s deterministic dispatch on the Claude Code host. The page text below this section reflects the post-migration architecture; this subsection captures the lifecycle.

Step 1 — GPT-5.1 → Sonnet 4.6 via OpenClaw native OAuth (commit ca12fab, 00:34 CEST). OpenClaw 2026.4.11 ships native --auth-choice claude-cli support, so the model behind OpenClaw was migrated by spawning a claude CLI subprocess inside the OpenClaw container that talks to api.anthropic.com on the operator’s Max subscription. No shim, no proxy. Marginal cost goes from paid GPT-5.1 per-call to $0. Fallback ladder (also OAuth, no paid keys retained): Opus 4.6 → Opus 4.5 → Sonnet 4.5 → Haiku 4.5. Cost-tracking flow switched to scripts/poll-openclaw-usage.sh reading the container’s ~/.claude/projects/**/*.jsonl via SSH+docker exec.

Step 2 — OpenClaw retired entirely, cc-cc cutover (commit 484f5da, 18:14 CEST). ~17 hours after Step 1, an Anthropic April-4 OAuth-for-third-party-tools ban + an OpenClaw 2026.4.26 MCP-bind regression made the agent path unreliable (5+ hours of silent triage that day). Pivoted from oc-cc to cc-cc mode: 9 alert receivers SSH directly to nllei01claude01 and invoke scripts/run-triage.sh <kind> <args...> instead of posting @openclaw use exec to run... Matrix mentions. LXC 103101212 stopped, onboot=0 on nllei01pve03, 2 openclaw crons commented out with rollback markers. The 6 yt-* helpers + escalate-to-claude.sh that previously lived only inside the container’s /root/.openclaw/workspace/skills/ were pulled into the repo (root cause of an exit-127 silent triage failure that day — they were never version-controlled). All triage scripts patched for host portability via ${TRIAGE_X:-default} env-var fallbacks: same code runs on claude01 today, in the OpenClaw container tomorrow if rolled back.

Cutover canary lifecycle (cutover-only instrument, retired 2026-04-30). A synthetic alert cron (scripts/receiver-canary.sh, every 30 min) + 2 Prometheus alerts (ReceiverCanaryFailing 35m crit, ReceiverCanaryStale 10m warn) ran during the cutover window. Once real-alert volume confirmed steady state, all three were retired (commit 2c4af83) — real alerts already exercise the chain hourly, the canary was producing 48 synthetic YT issues/day with no added signal. Lesson captured as feedback_canary_for_dispatch_chain_changes.md (install for cutover, retire after steady state) + feedback_canary_must_clean_its_own_artifacts.md (auto-clean or use a separate queue — “operator filters by name” doesn’t scale).

Durable structural drift-check. holistic-agentic-health.sh §38 cc-cc-receiver-wiring asserts every one of the 9 receivers still references scripts/run-triage.sh — catches silent re-wiring drift even after the canary is gone.

E2E proven on 8 paths the day of cutover: Prometheus NL+GR, LibreNMS NL+GR, security NL+GR, Synology DSM, receiver-canary smoke. Full QA suite re-run post-migration: 468/0/2 = 99.57% (51 suites + 9 benchmarks).

Rollback path if needed: pct start 103101212 + uncomment the 2 disabled crons + restore the workflow JSONs from /tmp/openclaw-migration-snapshots/. Triage scripts work in both environments after the env-var refactor, so step 1 is the only hard dependency.

Built by a solo infrastructure operator who got tired of waking up at 3am for alerts that an AI could triage.

Usage#

What Makes This Different#

Infragraph — a Causal World Model with a Non-Bypassable Prediction Gate#

Self-Improving Prompts — now with A/B trials#

AI Planner Wired to 41 Proven Ansible Playbooks#

Predictive Alerting#

5-Signal RAG + GraphRAG + Cross-Encoder Rerank + Staleness Warnings#

Karpathy-Style Knowledge Compilation#

119K Tool Calls Instrumented with OTel Tracing#

Hardened RAG Evaluation#

Structured Agentic Substrate — 9 adoptions from the OpenAI Agents SDK#

QA Suite — 533+ known-passing tests, 51 suite files#

CLI-Session RAG Capture — interactive sessions flow into RAG too#

Skill Authoring Discipline — 6 dimensions closed vs google/agents-cli#

NVIDIA DLI Cross-Audit + P0+P1 Implementation#

The 3-Tier Architecture#

Three Subsystems#

Safety — 7 Layers#

Evaluation System#

Tech Stack#

Tri-Source Audit — 11/11 A+#

Holistic Health Check — 96%+#

Session-Holistic E2E — 23/23#

Status#

Architecture History#