One person. 310+ infrastructure objects across 6 sites. 3 firewalls, 13 Kubernetes nodes, self-hosted everything. When an alert fires at 3am, there’s no team to call. There never is. So I built three AI subsystems that handle the detective work — and wait for a thumbs-up before touching anything.
GitHub: papadopouloskyriakos/agentic-chatops
Usage
What Makes This Different
Self-Improving Prompts — now with A/B trials
The system evaluates its own performance and auto-patches its instructions. Every session is scored by an LLM-as-a-Judge
on 5 quality dimensions — local gemma3:12b via Ollama by default (flipped to local-first on 2026-04-19 after a 60-query calibration against Haiku showed 85% agreement), with Haiku retained for calibration re-runs and flagged sessions re-scored by Opus.
When a dimension trends below threshold, the 2026-04-20 preference-iterating patcher (prompt-patch-trial.py
) generates 3 candidate instruction variants (concise / detailed / examples) per low-scoring dimension — plus a no-patch control arm. Future matching sessions are routed deterministically via BLAKE2b(issue_id || trial_id) % (N+1). A daily cron runs a one-sided Welch t-test once every arm has 15+ samples; the winner is promoted to config/prompt-patches.json only if it beats control by ≥ 0.05 points at p < 0.1. Otherwise the trial aborts and the dimension stays un-patched until the candidate pool is edited. Prompt-level policy iteration — no model weights ever fine-tuned, no wasted patches shipped blind.
AI Planner Wired to 41 Proven Ansible Playbooks
Before Claude Code investigates, a Haiku planner generates a 3-5 step investigation plan. The planner queries AWX for matching Ansible playbooks from 41 proven templates — maintenance windows, cert rotation, K8s drain/restore, PVE kernel updates, DMZ deployments. Plans naturally include “Run AWX Template 64 with dry_run=true” as remediation steps. Inspired by microsoft/sre-agent ’s “Knowledge Base as runbooks” pattern, but using existing Ansible instead of inventing a new format.
Predictive Alerting
Instead of only reacting after alerts fire, the system queries LibreNMS API daily for trending risk across both site instances. Devices are scored on disk usage trends, alert frequency, and health signals. A daily top-10 risk report posts to Matrix before problems become incidents. 123 devices scanned, 23 at elevated risk in the latest run.
5-Signal RAG + GraphRAG + Cross-Encoder Rerank + Staleness Warnings
Retrieval uses Reciprocal Rank Fusion across 5 signals:
- Semantic — nomic-embed-text (768 dims) via Ollama on RTX 3090 Ti, with
search_query:/search_document:asymmetric prefixes - Keyword — hostname, error code, resolution text matching
- Compiled Wiki — 45 articles from 7+ sources, daily recompilation
- Session Transcripts — 838 MemPalace verbatim exchange chunks (weight 0.4)
- Chaos Baselines — chaos experiment results by hostname (weight 0.35)
Plus a GraphRAG knowledge graph (360 entities, 193 relationships) for incident→host→alert traversal. A dedicated cross-encoder reranker (BAAI/bge-reranker-v2-m3 served at nllei01gpu01:11436, sqrt-blended to handle bimodal scores) reorders the fused results; when its top score falls below 0.7, a cross-chunk synth step (local qwen2.5:7b via Ollama, SYNTH_BACKEND=qwen by default since 2026-04-19) composes an answer from the top 10 fresh candidates — +13pp judge-graded hit@5 on the 50-query hard-retrieval set. An mtime-sort intent detector bypasses semantic retrieval when the query asks for “files created in the last N hours” (satisfied by source_mtime ordering, not cosine similarity). Age-proportional staleness warnings flag results older than 7 days.
Karpathy-Style Knowledge Compilation
Following Andrej Karpathy’s LLM Knowledge Bases pattern
: 7+ raw sources compiled into 45 browsable wiki articles with auto-maintained indexes, contradiction detection, and health checks. The compiler runs daily with SHA-256 incremental hashing. All articles embedded into the RAG pipeline as the 3rd fusion signal. Each row carries a source_mtime column so “what changed in the last 48 hours” queries work as a real retrieval mode.
88K Tool Calls Instrumented with OTel Tracing
Every tool call (88,474 across 108 types) is logged with name, duration, exit code, and error type. 39K OTel spans exported to OpenObserve (OTLP). OpenObserve added as Grafana datasource (alongside Prometheus + Loki) — all observability in one UI. Per-tool error rates and p50/p95 latency visible in Grafana. 18,220 infrastructure SSH/kubectl commands tracked across 232 devices. 39 credentials monitored with 90-day rotation policy. Per-source token caps prevent context overflow (incident 4K, wiki 4K, lessons 2K, memory 2K, diary 1.5K, transcript 1.5K).
Hardened RAG Evaluation
The RAGAS golden set was hardened in April 2026 from 18 saturated queries (faithfulness ~0.88 across configs — couldn’t measure pipeline improvements) to 33 queries with 15 hard-eval tagged across 5 categories: multi-hop, temporal, negation, meta, and cross-corpus corroboration. Easy vs hard queries now show a 10× faithfulness differential, so retrieval changes are measurable again. A weekly hard-eval cron (Monday 05:00 UTC) on a 50-query hard-retrieval-v2 set produces judge-graded hit@5 = 0.90 and emits six kb_hard_eval_* Prometheus metrics with absent-guard alerts.
Structured Agentic Substrate — 9 adoptions from the OpenAI Agents SDK (2026-04-20)
The official openai/openai-agents-python SDK (v0.14.2, ~88K LOC) was audited against this system; 11 gaps surfaced, 9 landed as a coherent batch:
- Schema versioning on 9 session/audit tables + a central registry mirroring the SDK’s
RunState.CURRENT_SCHEMA_VERSION/SCHEMA_VERSION_SUMMARIESpattern. Writer/reader shape drift now fails fast instead of silently corrupting replay. - 13 typed events in a new
event_logtable —tool_started/ended,handoff_requested/completed/cycle_detected/compaction,reasoning_item_created,mcp_approval_*,agent_updated,message_output_created,tool_guardrail_rejection,agent_as_tool_call. Replaces free-form Matrix strings with Grafana-queryable telemetry. - Per-turn lifecycle hooks —
session-start.sh,post-tool-use.sh,user-prompt-submit.sh, and a newsession-end.sh(theon_final_outputequivalent) feed asession_turnstable with per-turn cost, tokens, duration, tool count. - 3-behavior rejection taxonomy —
allow/reject_content(retry with hint) /deny(hard halt), mirroring the SDK’sToolGuardrailFunctionOutput. Every rejection is a typed event with non-empty message (audit invariant). HandoffInputDataenvelope — zlib+b64 payload carrying prior agent’sinput_history,pre_handoff_items,new_items,run_context. 176 KB input_history → 752 B on the wire (0.43% ratio). No more re-deriving context via RAG on escalation.- Transcript compaction on handoff — opt-in per escalation. Local
gemma3:12bcompresses long T1 triage to a 1-turn summary; Haiku as circuit-breaker fallback. - Agent-as-tool wrapper — wraps the 10 sub-agent definitions as callable tools so the orchestrator LLM can conditionally invoke them in the ambiguous-risk (0.4–0.6) band, complementing the existing deterministic routing.
- Handoff depth + cycle detection — depth ≥ 5 forces
[POLL], ≥ 10 hard-halts, any agent twice in the chain is refused and logged. AtomicBEGIN IMMEDIATEtransactions withPRAGMA busy_timeout=10000for race-free concurrent bumps. - Immutable per-turn snapshots — capture BEFORE each mutating tool (
Bash,Edit,Write,Task; read-only skipped).rollback_to(snapshot_id)restores any priorsessionsrow. 7-day retention.
Four new SQLite tables (event_log, handoff_log, session_state_snapshot, session_turns) bring the total from 31 to 35. Migrations 006–011 apply idempotently on both fresh and legacy DBs.
QA Suite — 411/0 PASS (99.52%), 44 suite files
A pytest-style bash harness (scripts/qa/run-qa-suite.sh, 44 suite files, ~3–5 min under full-suite load) verifies every adoption with a JSON scorecard output. A per-suite QA_PER_SUITE_TIMEOUT wrapper (default 120 s) caps any slow/wedged suite and emits a synthetic FAIL record to the scorecard so the orchestrator never hangs silently. Coverage highlights:
- Writer stamping — 11 / 11 writers + 5 / 5 n8n workflow INSERT sites asserted.
- Pattern-by-pattern — 53 deny + 32 reject_content / allow tests.
- Per-event payload shapes — all 13 event classes round-trip through CLI + Python.
- Concurrent safety — 8 parallel
handoff_depth.bump()asserted no-lost-updates. The test surfaced a real race condition (Python sqlite3’s defaultisolation_level=""defeatingBEGIN IMMEDIATE); fix shipped in the same commit. - Mock HTTP server — stdlib-only forking server faking ollama / anthropic for offline compaction happy-path testing.
- 6 e2e scenarios — happy path (all 9 adoptions in one flow), cycle prevention, crash rollback, schema forward-compat, envelope-to-subagent, compaction in handoff.
- Benchmarks — p95 latencies: event emit 111 ms, handoff bump 108 ms, envelope encode 76 ms, snapshot capture 86 ms, hook 198 ms. Migration on a 10K-row legacy DB: ~200 ms. Compression ratio 0.43% (23× better than target).
Writing the tests also surfaced — and fixed — four code bugs: legacy hooks emitting JSON that breaks Claude Code validation, five writers hardcoding the prod DB path, the missing on_final_output hook, and schema.sql lacking canonical CREATE TABLEs for the new versioned tables.
CLI-Session RAG Capture — interactive sessions flow into RAG too (2026-04-20)
Before this, only YT-backed agentic sessions had their transcripts, tool calls, and extracted knowledge written into the shared RAG tables. Interactive Claude Code CLI sessions — operator typing directly into a terminal, no webhook, no YT ticket — were only captured by a token-counting poller for cost tracking. Their reasoning, tool use, and outcomes were lost to retrieval.
A 3-tier pipeline (IFRNLLEI01PRD-646/-647/-648 ) closes the gap with a single cron line that chains three idempotent steps over every CLI JSONL:
- Archive transcripts — exchange-pair chunks into
session_transcriptswithnomic-embed-textembeddings; sessions above 5000 assistant chars also get a doc-chain refined summary row. - Parse tool calls —
tool_use/tool_resultpairs intotool_call_log, taggedissue_id='cli-<uuid>'so both tables join cleanly. - Extract knowledge —
gemma3:12bin strict-JSON mode over the summary rows → structured{root_cause, resolution, subsystem, tags, confidence}→incident_knowledgewithproject='chatops-cli', embedded for retrieval.
Retrieval weights chatops-cli rows at CLI_INCIDENT_WEIGHT=0.75 by default so real infra incidents still win close ties. A byte-offset watermark skips unchanged files so the nightly cron drains the ~2,300-file backlog incrementally without re-chunking settled sessions. Soak test (10 files): 12 chunks + 245 tool calls + 4 summaries + 4 extracted knowledge rows; gemma correctly classified one sample as subsystem=sqlite-schema, tags=[schema, migration, versioning, data], confidence=0.95. 12 QA tests covering flag parsing, watermark roundtrip, path inference, sanitization, and retrieval weighting — all PASS.
Skill Authoring Discipline — 6 dimensions closed vs google/agents-cli (2026-04-23)
Most of the heavy lifting in an agentic system lives in the skills and sub-agents that carry the discipline — when to use a skill, when not to, what to resist doing, how to prove the work is actually done. A 2026-04-23 deep audit compared this platform against google/agents-cli
(Google’s own reference implementation of skill-authoring convention for Gemini/ADK). The scorecard ran 16 dimensions; this platform won 9 on raw capability, but trailed on six skill-authoring dimensions that agents-cli treats as first-class: phase-gate choreography, skill discoverability, “when NOT to use” anti-guidance, inline behavioral anti-patterns, governance/versioning, and auto-generated single-source-of-truth indexes.
An 11-commit uplift (YouTrack umbrella IFRNLLEI01PRD-712, Phases A→J, direct-pushed to main, zero reverts) closed every gap:
- Master phase-gate skill — a new
chatops-workflow/SKILL.mdcodifies the Phase 0→6 incident lifecycle (triage → drift-check → alert-context → propose → approve → execute → post-incident) with explicit exit criteria per phase. The Runner’s Build Prompt node force-injects the full skill body into every session’s system prompt, marker-delimited for surgical removal and with the pre-injection workflow snapshot preserved as a rollback anchor. Proven end-to-end by a real Runner session whose first tool call wasgrep -i "Phase 0"against its own injected prompt. - Auto-generated skill index —
scripts/render-skill-index.pywalks allSKILL.md+ agent frontmatter and emitsdocs/skills-index.mdas the canonical single source of truth. A drift-guard QA test (test-656-skill-index-fresh.sh) fails CI if the committed index would differ from a fresh render. Wired into the daily 04:30 UTC wiki-compile cron so the browsable wiki picks it up automatically. - Versioned + machine-audited skills — every
SKILL.mdand.claude/agents/*.mdfrontmatter now carriesversion: 1.x.0+ arequires: {bins, env}block.scripts/audit-skill-requires.shverifies declared binaries (which) and env vars (test -n); a Prometheus exporter feeds two new alerts (SkillPrereqMissing,SkillMetricsExporterStale).scripts/audit-skill-versions.shwalks git history for skill bodies that changed without a version bump. Per-skill semver convention formalized indocs/runbooks/skill-versioning.md(patch/minor/MAJOR tied to changes in the skill contract — name, description, allowed-tools, requires, output format). - “Do NOT use for X” anti-guidance — every primary skill/agent description now ends with an explicit negative-guidance clause pointing to the correct alternative. Measurably reduces over-routing (e.g.,
security-analystno longer gets picked for disk-full alerts whose symptom overlaps). - 46 Shortcuts-to-Resist rows inlined on 11 agents — each row drawn verbatim from the matching
memory/feedback_*.mdlesson with source citation. Behavioral inoculation at the surface where the model is about to act, instead of trusting RAG to surface 50+ scattered lesson files on demand. - Proving-Your-Work directive +
evidence_missingsignal — the risk classifier now emitsevidence_missingwhenCONFIDENCE ≥ 0.8is claimed without any visible tool output or code fence in the reply, forcing[POLL]instead of[AUTO-RESOLVE]. Mirrored in the Runner’s Prepare Result node so[AUTO-RESOLVE]markers are stripped and aGUARDRAIL EVIDENCE-MISSING:banner is prepended to unproven high-confidence replies before they reach Matrix. - Operator-vocabulary map —
config/user-vocabulary.json(20 entries) disambiguates operator shorthand (“the firewall” →nllei01fw01;grskg01fw01, “xs4all” → “budget” post-rename, etc.). The prompt-submit hook scans every incoming message and logs a typedvocabularyevent toevent_logon match, fed back into Grafana as a clarification-request reduction metric.
Scorecard delta: 3.94 → 4.94 average across 16 dimensions (13/16 now at 5/5, was 9/16). All 6 targeted gap dimensions closed. +27 new QA tests (test-656/-660/-718/-724/-726/-727), all passing; the QA orchestrator gained a per-suite 120-second timeout guard as part of the same pass so no future wedged suite can silently stall the whole run. Full audit memo: docs/scorecard-post-agents-cli-adoption.md
.
The 3-Tier Architecture
Alert Source Tier 1 Tier 2 Tier 3
───────────── ────── ────── ──────
LibreNMS ┐ OpenClaw Claude Code Human
Prometheus ├──► n8n ──► (GPT-5.1) ──► (Opus 4.6) ──► (Matrix)
CrowdSec ┤ 17 skills 10 sub-agents polls
GitLab CI ┘ 7-21 sec 5-15 min reactions
- Tier 1 (OpenClaw / GPT-5.1): Fast triage (7-21s). Queries NetBox CMDB, runs hybrid semantic search, extracts procedural knowledge from 55 CLAUDE.md files + 200+ operational memory rules, SSHes to the host. Creates YouTrack issues. Handles 80%+ of alerts without escalation.
- Tier 2 (Claude Code / Opus 4.6): Deep analysis (2-15 min). A Haiku planner first generates a 3-5 step investigation plan and queries AWX for matching playbooks from 41 templates; Claude Code then follows the plan, launches AWX jobs (dry_run first), delegates to 10 sub-agents in parallel, and proposes remediation via [POLL]. All agents write diary entries for cross-session learning.
- Tier 3 (Human): Clicks a poll option in Matrix. The system stops and waits here — it never acts autonomously on infrastructure.
Three Subsystems
| Subsystem | Scope | Alert Sources |
|---|---|---|
| ChatOps | Infrastructure availability, performance, maintenance | LibreNMS, Prometheus, Synology DSM |
| ChatSecOps | Intrusion detection, vulnerability scanning, MITRE ATT&CK mapping | CrowdSec (54 scenarios → 21 ATT&CK techniques), vulnerability scanners |
| ChatDevOps | CI/CD failures, code review, multi-repo refactoring | GitLab CI webhooks |
All three share the same engine: n8n orchestration, Matrix as human-in-the-loop, and the 3-tier agent architecture.
Safety — 7 Layers
Because “the prompt says don’t do that” is not a security boundary:
| Layer | Mechanism | Bypassed by prompt injection? |
|---|---|---|
| Claude Code hooks | 78 blocked patterns (37 destructive + 22 exfil + 7 injection) + 15 protected file patterns + word-boundary precision on single-word commands (passwd/useradd/halt etc.). Now emits the 3-behavior taxonomy — allow (silent exit 0) / reject_content (exit 2 with retry hint) / deny (hard halt) — with every rejection landing as a typed tool_guardrail_rejection event. An evidence_missing risk signal fires in-band when CONFIDENCE ≥ 0.8 is claimed without a visible tool output block, forcing [POLL] and stripping unearned [AUTO-RESOLVE] markers. | No — deterministic, fires first |
| safe-exec.sh | Code-level blocklist, rate limiting (30/min), exfiltration detection | No — code enforcement |
| exec-approvals.json | 36 specific skill patterns, no wildcards | No — config enforcement |
| Input sanitization | 17 injection patterns (encoding obfuscation, role confusion, delimiter injection, social engineering) | No — code enforcement |
| Evaluator-Optimizer | Haiku screening for high-stakes responses (3-node n8n flow) | No — second-pass review |
| Credential / PII scanning | 16 regex patterns + 39 credentials tracked with rotation dates | N/A |
| Approval gates | Infrastructure changes require human thumbs-up or poll vote | No — workflow enforcement |
| Handoff depth + cycle detection | Atomic counter on sessions.handoff_depth; ≥ 5 forces [POLL], ≥ 10 hard-halts, any agent twice in the chain is refused and logged as handoff_cycle_detected | No — transaction-serialised in SQLite |
Additional: €5/session warning, $25/day budget ceiling, confidence gating (< 0.5 = STOP, < 0.7 = escalate), tool call limit of 75 per session. Synth failure handling: when Haiku synth is opted into (SYNTH_BACKEND=haiku), SYNTH_HAIKU_FORCE_FAIL injects 5 failure modes (429 / auth / timeout / network / empty) and the pipeline falls back to local qwen2.5 without breaking the response chain. Every mutating tool call is preceded by an immutable snapshot to session_state_snapshot for mid-session rollback.
Evaluation System
All evals run deterministically (temperature=0, seed=42).
| System | What it measures |
|---|---|
| 161+ Test Scenarios | 58 eval scenarios (3 sets) + 54 adversarial + 23 holistic E2E + 22 mempalace + 22 security-hook + 9 KG-traverse + 17 synth-fallback + 20 qwen-JSON reliability |
| Hard-Retrieval v2 | 50-query weekly eval — judge-graded hit@5 = 0.90, p50 latency 5.7s, p95 13.6s |
| RAGAS Golden Set | 33 queries (15 hard-eval tagged) across multi-hop / temporal / negation / meta / cross-corpus |
| Prompt Scorecard | 19 prompt surfaces graded daily on 6 dimensions |
| LLM-as-a-Judge | Every session scored by local gemma3:12b (default since 2026-04-19, 85% Haiku-agreement on 60-query calibration); flagged sessions re-scored by Opus |
| Agent Trajectory | Per-session step scoring from JSONL transcripts (8 infra / 4 dev steps) |
| Self-Improving Prompts | Low-scoring dimensions auto-generate prompt patches (5 currently active) |
| Eval Flywheel | Monthly: analyze → measure → improve → validate cycle |
| A/B Testing | react_v1 vs react_v2 variants, deterministic by issue hash |
| CI Eval Gate | eval-regression stage between test and review — blocks bad merges |
Tech Stack
| Component | Role |
|---|---|
| n8n | Workflow orchestration — 26 workflows (runner, bridge, poller, session-end, teacher-runner, receivers, etc.) |
| OpenClaw (GPT-5.1) | Tier 1 — fast triage, 17 native skills |
| Claude Code (Opus 4.6) | Tier 2 — deep analysis, 11 sub-agents + master chatops-workflow phase-gate skill, Plan-and-Execute, ReAct reasoning |
| AWX | 41 Ansible playbooks — maintenance, cert sync, K8s drain, updates, deployment |
| Matrix (Synapse) | Human-in-the-loop — polls, reactions, replies |
| YouTrack | Issue tracking, state management, knowledge sink |
| NetBox | CMDB — 310+ devices, 421 IPs, 39 VLANs |
| Prometheus + Grafana | 11 custom dashboards, 64+ panels, 16+ exporters, 4 alert-rule files (all sidecar-provisioned via ConfigMaps) |
| OpenObserve | OTel tracing — OTLP span collection + Prometheus-compatible queries |
| Ollama (RTX 3090 Ti) | Local embeddings (nomic-embed-text) + local judge (gemma3:12b) + local synth (qwen2.5:7b) |
bge-reranker-v2-m3 | Cross-encoder reranker on nllei01gpu01:11436 — reorders fused RAG results before synth |
Tri-Source Audit — 11/11 A+
Scored against three knowledge sources: Gulli’s Agentic Design Patterns (21/21 patterns), Anthropic Claude Certified Architect Foundations, and 6 industry sources. Result: 11/11 dimensions at A+ (100%). Additionally, an Operational Activation Audit verified all database tables are populated with 150K+ production rows — scoring infrastructure that actually produces data, not just exists.
Holistic Health Check — 96%+
holistic-agentic-health.sh
runs 142 automated checks across 37 sections — verifying every feature in this page actually works in production. Not just “does the file exist?” but functional tests (RAG search returns known incidents, Ollama generates 768-dim embeddings, trajectory scoring produces output), cross-site verification (6/6 VTI tunnels READY, 7 BGP peers, ping 42ms), infrastructure health (7/7 K8s nodes, 139/139 Prometheus targets UP, GPU 46°C), and security compliance (both scanners ran within 26h, MITRE Navigator accessible). Results stored in SQLite for trending, exported to Prometheus for Grafana dashboards. Runs in 18 seconds.
Session-Holistic E2E — 23/23
A second suite of 23 end-to-end tests (157s total runtime) covers 18 YouTrack issues with before/after scoring against production data: 23/23 PASS at the last run (2026-04-19). Each test drives a real session through the full pipeline — Tier 1 triage, Tier 2 plan/execute, RAG retrieval, judge scoring — and asserts on output quality, not just exit codes. Output: docs/session-holistic-e2e-*.md
.
Status
| Milestone | Status |
|---|---|
| Self-improving prompts (eval → auto-patch → re-eval) | Production |
| Plan-and-Execute + AWX runbooks (41 playbooks) | Production |
| Predictive alerting (LibreNMS trending + daily digest) | Production |
| 5-signal RAG + GraphRAG + cross-encoder rerank + temporal filter (360 entities, 193 rels) | Production |
Cross-chunk synth (local qwen2.5:7b, +13pp hit@5 on hard eval) + 5-mode failure injection | Production |
Local-first judge + synth (gemma3:12b + qwen2.5:7b, 2026-04-19 flip) | Production |
mtime-sort intent detector + list-recent CLI | Production |
Karpathy-style compiled wiki (45 articles, source_mtime column) | Production |
| OTel tracing (39K spans → OpenObserve) | Production |
| Tool call instrumentation (88K calls, 108 types) | Production |
| 3-tier agent architecture | Production |
| 26 n8n workflows | Production |
| 10 MCP servers (153 tools) | Production |
| 11 sub-agents with diary entries | Production |
Skill-authoring uplift vs google/agents-cli — scorecard 3.94 → 4.94, 6 gap dimensions closed (Phases A→J, umbrella IFRNLLEI01PRD-712) | Production |
Master phase-gate skill (chatops-workflow/SKILL.md) force-injected into every Runner session | Production |
Auto-generated skills index (docs/skills-index.md) drift-gated by test-656 | Production |
Skill versioning + requires audit — 17/17 SKILL.md with version: 1.x.0 + requires: {bins, env}; 2 new Prom alerts | Production |
46 Shortcuts-to-Resist rows inlined across 11 agents (source-cited to memory/feedback_*.md) | Production |
evidence_missing risk signal — forces [POLL] when CONFIDENCE ≥ 0.8 without a visible tool output block | Production |
Operator-vocabulary map (config/user-vocabulary.json, 20 entries) — prompt-submit hook logs vocabulary events to event_log | Production |
| 21/21 agentic design patterns | Audited (7 at A+) |
| Tri-source + operational activation audit | 11/11 A+ |
| 7-layer safety + word-boundary precision on hook patterns | Production |
| OpenAI Agents SDK adoption batch (9 / 11 gaps landed, 2026-04-20) | Production |
Schema versioning on 13 tables + central CURRENT_SCHEMA_VERSION registry | Production |
13 typed session events in event_log with Prom exporter | Production |
Per-turn hooks (session-start.sh / post-tool-use.sh / user-prompt-submit.sh / session-end.sh) + session_turns table | Production |
3-behavior rejection taxonomy (allow / reject_content / deny) + event_log audit invariant | Production |
HandoffInputData envelope (zlib+b64, 0.43% ratio) + handoff_log audit table | Production |
| Handoff transcript compaction (local gemma first, Haiku fallback, circuit-breaker-aware) | Production |
| Agent-as-tool wrapper for 10 sub-agents | Production |
| Handoff depth + cycle detection (atomic IMMEDIATE tx, PRAGMA busy_timeout=10000) | Production |
| Immutable per-turn snapshots + rollback + 7-day retention cron | Production |
42 SQLite tables (150K+ rows; +6 from the SDK batch + 3 from the teacher-agent tiers: learning_progress, learning_sessions, teacher_operator_dm) | Production |
| Hardened RAGAS golden set — 33 queries (15 hard-eval), 10× differential | Production |
| Weekly hard-eval cron (50-q) — judge_hit@5 = 0.90 | Production |
| 3 absent-metric alerts guarding the staleness alerts themselves | Production |
| 161+ eval scenarios across 8 test suites | All passing |
| Preference-iterating prompt patcher — N-candidate A/B trials + Welch t-test + auto-promote | Production |
| CLI-session RAG capture — interactive CLI sessions now flow into session_transcripts + tool_call_log + incident_knowledge | Production |
| QA suite — 44 files, 411/0 PASS (2 benign skips), per-suite timeout guard | 99.52% |
| 11-dashboard observability (64+ panels) | Production |
| Weekly chaos cron (self-selecting, preflight gate, Matrix notifications) | Production (CMM L3) |
| A/B prompt testing | Active |
| Holistic health check (142 checks, 37 sections) | 96%+ pass |
| Session-holistic E2E (23 tests covering 18 YouTrack issues) | 100% (23/23) |
Built by a solo infrastructure operator who got tired of waking up at 3am for alerts that an AI could triage.
