One person. 310+ infrastructure objects across 6 sites. 3 firewalls, 13 Kubernetes nodes, self-hosted everything. When an alert fires at 3am, there’s no team to call. There never is. So I built three AI subsystems that handle the detective work — and wait for a thumbs-up before touching anything.

GitHub: papadopouloskyriakos/agentic-chatops

Usage


What Makes This Different

Self-Improving Prompts — now with A/B trials

The system evaluates its own performance and auto-patches its instructions. Every session is scored by an LLM-as-a-Judge on 5 quality dimensions — local gemma3:12b via Ollama by default (flipped to local-first on 2026-04-19 after a 60-query calibration against Haiku showed 85% agreement), with Haiku retained for calibration re-runs and flagged sessions re-scored by Opus.

When a dimension trends below threshold, the 2026-04-20 preference-iterating patcher (prompt-patch-trial.py ) generates 3 candidate instruction variants (concise / detailed / examples) per low-scoring dimension — plus a no-patch control arm. Future matching sessions are routed deterministically via BLAKE2b(issue_id || trial_id) % (N+1). A daily cron runs a one-sided Welch t-test once every arm has 15+ samples; the winner is promoted to config/prompt-patches.json only if it beats control by ≥ 0.05 points at p < 0.1. Otherwise the trial aborts and the dimension stays un-patched until the candidate pool is edited. Prompt-level policy iteration — no model weights ever fine-tuned, no wasted patches shipped blind.

AI Planner Wired to 41 Proven Ansible Playbooks

Before Claude Code investigates, a Haiku planner generates a 3-5 step investigation plan. The planner queries AWX for matching Ansible playbooks from 41 proven templates — maintenance windows, cert rotation, K8s drain/restore, PVE kernel updates, DMZ deployments. Plans naturally include “Run AWX Template 64 with dry_run=true” as remediation steps. Inspired by microsoft/sre-agent ’s “Knowledge Base as runbooks” pattern, but using existing Ansible instead of inventing a new format.

Predictive Alerting

Instead of only reacting after alerts fire, the system queries LibreNMS API daily for trending risk across both site instances. Devices are scored on disk usage trends, alert frequency, and health signals. A daily top-10 risk report posts to Matrix before problems become incidents. 123 devices scanned, 23 at elevated risk in the latest run.

5-Signal RAG + GraphRAG + Cross-Encoder Rerank + Staleness Warnings

Retrieval uses Reciprocal Rank Fusion across 5 signals:

  1. Semantic — nomic-embed-text (768 dims) via Ollama on RTX 3090 Ti, with search_query: / search_document: asymmetric prefixes
  2. Keyword — hostname, error code, resolution text matching
  3. Compiled Wiki — 45 articles from 7+ sources, daily recompilation
  4. Session Transcripts — 838 MemPalace verbatim exchange chunks (weight 0.4)
  5. Chaos Baselines — chaos experiment results by hostname (weight 0.35)

Plus a GraphRAG knowledge graph (360 entities, 193 relationships) for incident→host→alert traversal. A dedicated cross-encoder reranker (BAAI/bge-reranker-v2-m3 served at nllei01gpu01:11436, sqrt-blended to handle bimodal scores) reorders the fused results; when its top score falls below 0.7, a cross-chunk synth step (local qwen2.5:7b via Ollama, SYNTH_BACKEND=qwen by default since 2026-04-19) composes an answer from the top 10 fresh candidates — +13pp judge-graded hit@5 on the 50-query hard-retrieval set. An mtime-sort intent detector bypasses semantic retrieval when the query asks for “files created in the last N hours” (satisfied by source_mtime ordering, not cosine similarity). Age-proportional staleness warnings flag results older than 7 days.

Karpathy-Style Knowledge Compilation

Following Andrej Karpathy’s LLM Knowledge Bases pattern : 7+ raw sources compiled into 45 browsable wiki articles with auto-maintained indexes, contradiction detection, and health checks. The compiler runs daily with SHA-256 incremental hashing. All articles embedded into the RAG pipeline as the 3rd fusion signal. Each row carries a source_mtime column so “what changed in the last 48 hours” queries work as a real retrieval mode.

88K Tool Calls Instrumented with OTel Tracing

Every tool call (88,474 across 108 types) is logged with name, duration, exit code, and error type. 39K OTel spans exported to OpenObserve (OTLP). OpenObserve added as Grafana datasource (alongside Prometheus + Loki) — all observability in one UI. Per-tool error rates and p50/p95 latency visible in Grafana. 18,220 infrastructure SSH/kubectl commands tracked across 232 devices. 39 credentials monitored with 90-day rotation policy. Per-source token caps prevent context overflow (incident 4K, wiki 4K, lessons 2K, memory 2K, diary 1.5K, transcript 1.5K).

Hardened RAG Evaluation

The RAGAS golden set was hardened in April 2026 from 18 saturated queries (faithfulness ~0.88 across configs — couldn’t measure pipeline improvements) to 33 queries with 15 hard-eval tagged across 5 categories: multi-hop, temporal, negation, meta, and cross-corpus corroboration. Easy vs hard queries now show a 10× faithfulness differential, so retrieval changes are measurable again. A weekly hard-eval cron (Monday 05:00 UTC) on a 50-query hard-retrieval-v2 set produces judge-graded hit@5 = 0.90 and emits six kb_hard_eval_* Prometheus metrics with absent-guard alerts.

Structured Agentic Substrate — 9 adoptions from the OpenAI Agents SDK (2026-04-20)

The official openai/openai-agents-python SDK (v0.14.2, ~88K LOC) was audited against this system; 11 gaps surfaced, 9 landed as a coherent batch:

  • Schema versioning on 9 session/audit tables + a central registry mirroring the SDK’s RunState.CURRENT_SCHEMA_VERSION / SCHEMA_VERSION_SUMMARIES pattern. Writer/reader shape drift now fails fast instead of silently corrupting replay.
  • 13 typed events in a new event_log table — tool_started/ended, handoff_requested/completed/cycle_detected/compaction, reasoning_item_created, mcp_approval_*, agent_updated, message_output_created, tool_guardrail_rejection, agent_as_tool_call. Replaces free-form Matrix strings with Grafana-queryable telemetry.
  • Per-turn lifecycle hookssession-start.sh, post-tool-use.sh, user-prompt-submit.sh, and a new session-end.sh (the on_final_output equivalent) feed a session_turns table with per-turn cost, tokens, duration, tool count.
  • 3-behavior rejection taxonomyallow / reject_content (retry with hint) / deny (hard halt), mirroring the SDK’s ToolGuardrailFunctionOutput. Every rejection is a typed event with non-empty message (audit invariant).
  • HandoffInputData envelope — zlib+b64 payload carrying prior agent’s input_history, pre_handoff_items, new_items, run_context. 176 KB input_history → 752 B on the wire (0.43% ratio). No more re-deriving context via RAG on escalation.
  • Transcript compaction on handoff — opt-in per escalation. Local gemma3:12b compresses long T1 triage to a 1-turn summary; Haiku as circuit-breaker fallback.
  • Agent-as-tool wrapper — wraps the 10 sub-agent definitions as callable tools so the orchestrator LLM can conditionally invoke them in the ambiguous-risk (0.4–0.6) band, complementing the existing deterministic routing.
  • Handoff depth + cycle detection — depth ≥ 5 forces [POLL], ≥ 10 hard-halts, any agent twice in the chain is refused and logged. Atomic BEGIN IMMEDIATE transactions with PRAGMA busy_timeout=10000 for race-free concurrent bumps.
  • Immutable per-turn snapshots — capture BEFORE each mutating tool (Bash, Edit, Write, Task; read-only skipped). rollback_to(snapshot_id) restores any prior sessions row. 7-day retention.

Four new SQLite tables (event_log, handoff_log, session_state_snapshot, session_turns) bring the total from 31 to 35. Migrations 006–011 apply idempotently on both fresh and legacy DBs.

QA Suite — 411/0 PASS (99.52%), 44 suite files

A pytest-style bash harness (scripts/qa/run-qa-suite.sh, 44 suite files, ~3–5 min under full-suite load) verifies every adoption with a JSON scorecard output. A per-suite QA_PER_SUITE_TIMEOUT wrapper (default 120 s) caps any slow/wedged suite and emits a synthetic FAIL record to the scorecard so the orchestrator never hangs silently. Coverage highlights:

  • Writer stamping — 11 / 11 writers + 5 / 5 n8n workflow INSERT sites asserted.
  • Pattern-by-pattern — 53 deny + 32 reject_content / allow tests.
  • Per-event payload shapes — all 13 event classes round-trip through CLI + Python.
  • Concurrent safety — 8 parallel handoff_depth.bump() asserted no-lost-updates. The test surfaced a real race condition (Python sqlite3’s default isolation_level="" defeating BEGIN IMMEDIATE); fix shipped in the same commit.
  • Mock HTTP server — stdlib-only forking server faking ollama / anthropic for offline compaction happy-path testing.
  • 6 e2e scenarios — happy path (all 9 adoptions in one flow), cycle prevention, crash rollback, schema forward-compat, envelope-to-subagent, compaction in handoff.
  • Benchmarks — p95 latencies: event emit 111 ms, handoff bump 108 ms, envelope encode 76 ms, snapshot capture 86 ms, hook 198 ms. Migration on a 10K-row legacy DB: ~200 ms. Compression ratio 0.43% (23× better than target).

Writing the tests also surfaced — and fixed — four code bugs: legacy hooks emitting JSON that breaks Claude Code validation, five writers hardcoding the prod DB path, the missing on_final_output hook, and schema.sql lacking canonical CREATE TABLEs for the new versioned tables.

CLI-Session RAG Capture — interactive sessions flow into RAG too (2026-04-20)

Before this, only YT-backed agentic sessions had their transcripts, tool calls, and extracted knowledge written into the shared RAG tables. Interactive Claude Code CLI sessions — operator typing directly into a terminal, no webhook, no YT ticket — were only captured by a token-counting poller for cost tracking. Their reasoning, tool use, and outcomes were lost to retrieval.

A 3-tier pipeline (IFRNLLEI01PRD-646/-647/-648 ) closes the gap with a single cron line that chains three idempotent steps over every CLI JSONL:

  1. Archive transcripts — exchange-pair chunks into session_transcripts with nomic-embed-text embeddings; sessions above 5000 assistant chars also get a doc-chain refined summary row.
  2. Parse tool callstool_use / tool_result pairs into tool_call_log, tagged issue_id='cli-<uuid>' so both tables join cleanly.
  3. Extract knowledgegemma3:12b in strict-JSON mode over the summary rows → structured {root_cause, resolution, subsystem, tags, confidence}incident_knowledge with project='chatops-cli', embedded for retrieval.

Retrieval weights chatops-cli rows at CLI_INCIDENT_WEIGHT=0.75 by default so real infra incidents still win close ties. A byte-offset watermark skips unchanged files so the nightly cron drains the ~2,300-file backlog incrementally without re-chunking settled sessions. Soak test (10 files): 12 chunks + 245 tool calls + 4 summaries + 4 extracted knowledge rows; gemma correctly classified one sample as subsystem=sqlite-schema, tags=[schema, migration, versioning, data], confidence=0.95. 12 QA tests covering flag parsing, watermark roundtrip, path inference, sanitization, and retrieval weighting — all PASS.

Skill Authoring Discipline — 6 dimensions closed vs google/agents-cli (2026-04-23)

Most of the heavy lifting in an agentic system lives in the skills and sub-agents that carry the discipline — when to use a skill, when not to, what to resist doing, how to prove the work is actually done. A 2026-04-23 deep audit compared this platform against google/agents-cli (Google’s own reference implementation of skill-authoring convention for Gemini/ADK). The scorecard ran 16 dimensions; this platform won 9 on raw capability, but trailed on six skill-authoring dimensions that agents-cli treats as first-class: phase-gate choreography, skill discoverability, “when NOT to use” anti-guidance, inline behavioral anti-patterns, governance/versioning, and auto-generated single-source-of-truth indexes.

An 11-commit uplift (YouTrack umbrella IFRNLLEI01PRD-712, Phases A→J, direct-pushed to main, zero reverts) closed every gap:

  • Master phase-gate skill — a new chatops-workflow/SKILL.md codifies the Phase 0→6 incident lifecycle (triage → drift-check → alert-context → propose → approve → execute → post-incident) with explicit exit criteria per phase. The Runner’s Build Prompt node force-injects the full skill body into every session’s system prompt, marker-delimited for surgical removal and with the pre-injection workflow snapshot preserved as a rollback anchor. Proven end-to-end by a real Runner session whose first tool call was grep -i "Phase 0" against its own injected prompt.
  • Auto-generated skill indexscripts/render-skill-index.py walks all SKILL.md + agent frontmatter and emits docs/skills-index.md as the canonical single source of truth. A drift-guard QA test (test-656-skill-index-fresh.sh) fails CI if the committed index would differ from a fresh render. Wired into the daily 04:30 UTC wiki-compile cron so the browsable wiki picks it up automatically.
  • Versioned + machine-audited skills — every SKILL.md and .claude/agents/*.md frontmatter now carries version: 1.x.0 + a requires: {bins, env} block. scripts/audit-skill-requires.sh verifies declared binaries (which) and env vars (test -n); a Prometheus exporter feeds two new alerts (SkillPrereqMissing, SkillMetricsExporterStale). scripts/audit-skill-versions.sh walks git history for skill bodies that changed without a version bump. Per-skill semver convention formalized in docs/runbooks/skill-versioning.md (patch/minor/MAJOR tied to changes in the skill contract — name, description, allowed-tools, requires, output format).
  • “Do NOT use for X” anti-guidance — every primary skill/agent description now ends with an explicit negative-guidance clause pointing to the correct alternative. Measurably reduces over-routing (e.g., security-analyst no longer gets picked for disk-full alerts whose symptom overlaps).
  • 46 Shortcuts-to-Resist rows inlined on 11 agents — each row drawn verbatim from the matching memory/feedback_*.md lesson with source citation. Behavioral inoculation at the surface where the model is about to act, instead of trusting RAG to surface 50+ scattered lesson files on demand.
  • Proving-Your-Work directive + evidence_missing signal — the risk classifier now emits evidence_missing when CONFIDENCE ≥ 0.8 is claimed without any visible tool output or code fence in the reply, forcing [POLL] instead of [AUTO-RESOLVE]. Mirrored in the Runner’s Prepare Result node so [AUTO-RESOLVE] markers are stripped and a GUARDRAIL EVIDENCE-MISSING: banner is prepended to unproven high-confidence replies before they reach Matrix.
  • Operator-vocabulary mapconfig/user-vocabulary.json (20 entries) disambiguates operator shorthand (“the firewall” → nllei01fw01;grskg01fw01, “xs4all” → “budget” post-rename, etc.). The prompt-submit hook scans every incoming message and logs a typed vocabulary event to event_log on match, fed back into Grafana as a clarification-request reduction metric.

Scorecard delta: 3.94 → 4.94 average across 16 dimensions (13/16 now at 5/5, was 9/16). All 6 targeted gap dimensions closed. +27 new QA tests (test-656/-660/-718/-724/-726/-727), all passing; the QA orchestrator gained a per-suite 120-second timeout guard as part of the same pass so no future wedged suite can silently stall the whole run. Full audit memo: docs/scorecard-post-agents-cli-adoption.md .


The 3-Tier Architecture

Alert Source                    Tier 1                 Tier 2                 Tier 3
─────────────                   ──────                 ──────                 ──────
LibreNMS        ┐               OpenClaw               Claude Code            Human
Prometheus      ├──► n8n ──►    (GPT-5.1)         ──►  (Opus 4.6)        ──►  (Matrix)
CrowdSec        ┤               17 skills              10 sub-agents          polls
GitLab CI       ┘               7-21 sec               5-15 min               reactions
  • Tier 1 (OpenClaw / GPT-5.1): Fast triage (7-21s). Queries NetBox CMDB, runs hybrid semantic search, extracts procedural knowledge from 55 CLAUDE.md files + 200+ operational memory rules, SSHes to the host. Creates YouTrack issues. Handles 80%+ of alerts without escalation.
  • Tier 2 (Claude Code / Opus 4.6): Deep analysis (2-15 min). A Haiku planner first generates a 3-5 step investigation plan and queries AWX for matching playbooks from 41 templates; Claude Code then follows the plan, launches AWX jobs (dry_run first), delegates to 10 sub-agents in parallel, and proposes remediation via [POLL]. All agents write diary entries for cross-session learning.
  • Tier 3 (Human): Clicks a poll option in Matrix. The system stops and waits here — it never acts autonomously on infrastructure.

Three Subsystems

SubsystemScopeAlert Sources
ChatOpsInfrastructure availability, performance, maintenanceLibreNMS, Prometheus, Synology DSM
ChatSecOpsIntrusion detection, vulnerability scanning, MITRE ATT&CK mappingCrowdSec (54 scenarios → 21 ATT&CK techniques), vulnerability scanners
ChatDevOpsCI/CD failures, code review, multi-repo refactoringGitLab CI webhooks

All three share the same engine: n8n orchestration, Matrix as human-in-the-loop, and the 3-tier agent architecture.

Safety — 7 Layers

Because “the prompt says don’t do that” is not a security boundary:

LayerMechanismBypassed by prompt injection?
Claude Code hooks78 blocked patterns (37 destructive + 22 exfil + 7 injection) + 15 protected file patterns + word-boundary precision on single-word commands (passwd/useradd/halt etc.). Now emits the 3-behavior taxonomyallow (silent exit 0) / reject_content (exit 2 with retry hint) / deny (hard halt) — with every rejection landing as a typed tool_guardrail_rejection event. An evidence_missing risk signal fires in-band when CONFIDENCE ≥ 0.8 is claimed without a visible tool output block, forcing [POLL] and stripping unearned [AUTO-RESOLVE] markers.No — deterministic, fires first
safe-exec.shCode-level blocklist, rate limiting (30/min), exfiltration detectionNo — code enforcement
exec-approvals.json36 specific skill patterns, no wildcardsNo — config enforcement
Input sanitization17 injection patterns (encoding obfuscation, role confusion, delimiter injection, social engineering)No — code enforcement
Evaluator-OptimizerHaiku screening for high-stakes responses (3-node n8n flow)No — second-pass review
Credential / PII scanning16 regex patterns + 39 credentials tracked with rotation datesN/A
Approval gatesInfrastructure changes require human thumbs-up or poll voteNo — workflow enforcement
Handoff depth + cycle detectionAtomic counter on sessions.handoff_depth; ≥ 5 forces [POLL], ≥ 10 hard-halts, any agent twice in the chain is refused and logged as handoff_cycle_detectedNo — transaction-serialised in SQLite

Additional: €5/session warning, $25/day budget ceiling, confidence gating (< 0.5 = STOP, < 0.7 = escalate), tool call limit of 75 per session. Synth failure handling: when Haiku synth is opted into (SYNTH_BACKEND=haiku), SYNTH_HAIKU_FORCE_FAIL injects 5 failure modes (429 / auth / timeout / network / empty) and the pipeline falls back to local qwen2.5 without breaking the response chain. Every mutating tool call is preceded by an immutable snapshot to session_state_snapshot for mid-session rollback.

Evaluation System

All evals run deterministically (temperature=0, seed=42).

SystemWhat it measures
161+ Test Scenarios58 eval scenarios (3 sets) + 54 adversarial + 23 holistic E2E + 22 mempalace + 22 security-hook + 9 KG-traverse + 17 synth-fallback + 20 qwen-JSON reliability
Hard-Retrieval v250-query weekly eval — judge-graded hit@5 = 0.90, p50 latency 5.7s, p95 13.6s
RAGAS Golden Set33 queries (15 hard-eval tagged) across multi-hop / temporal / negation / meta / cross-corpus
Prompt Scorecard19 prompt surfaces graded daily on 6 dimensions
LLM-as-a-JudgeEvery session scored by local gemma3:12b (default since 2026-04-19, 85% Haiku-agreement on 60-query calibration); flagged sessions re-scored by Opus
Agent TrajectoryPer-session step scoring from JSONL transcripts (8 infra / 4 dev steps)
Self-Improving PromptsLow-scoring dimensions auto-generate prompt patches (5 currently active)
Eval FlywheelMonthly: analyze → measure → improve → validate cycle
A/B Testingreact_v1 vs react_v2 variants, deterministic by issue hash
CI Eval Gateeval-regression stage between test and review — blocks bad merges

Tech Stack

ComponentRole
n8nWorkflow orchestration — 26 workflows (runner, bridge, poller, session-end, teacher-runner, receivers, etc.)
OpenClaw (GPT-5.1)Tier 1 — fast triage, 17 native skills
Claude Code (Opus 4.6)Tier 2 — deep analysis, 11 sub-agents + master chatops-workflow phase-gate skill, Plan-and-Execute, ReAct reasoning
AWX41 Ansible playbooks — maintenance, cert sync, K8s drain, updates, deployment
Matrix (Synapse)Human-in-the-loop — polls, reactions, replies
YouTrackIssue tracking, state management, knowledge sink
NetBoxCMDB — 310+ devices, 421 IPs, 39 VLANs
Prometheus + Grafana11 custom dashboards, 64+ panels, 16+ exporters, 4 alert-rule files (all sidecar-provisioned via ConfigMaps)
OpenObserveOTel tracing — OTLP span collection + Prometheus-compatible queries
Ollama (RTX 3090 Ti)Local embeddings (nomic-embed-text) + local judge (gemma3:12b) + local synth (qwen2.5:7b)
bge-reranker-v2-m3Cross-encoder reranker on nllei01gpu01:11436 — reorders fused RAG results before synth

Tri-Source Audit — 11/11 A+

Scored against three knowledge sources: Gulli’s Agentic Design Patterns (21/21 patterns), Anthropic Claude Certified Architect Foundations, and 6 industry sources. Result: 11/11 dimensions at A+ (100%). Additionally, an Operational Activation Audit verified all database tables are populated with 150K+ production rows — scoring infrastructure that actually produces data, not just exists.

Holistic Health Check — 96%+

holistic-agentic-health.sh runs 142 automated checks across 37 sections — verifying every feature in this page actually works in production. Not just “does the file exist?” but functional tests (RAG search returns known incidents, Ollama generates 768-dim embeddings, trajectory scoring produces output), cross-site verification (6/6 VTI tunnels READY, 7 BGP peers, ping 42ms), infrastructure health (7/7 K8s nodes, 139/139 Prometheus targets UP, GPU 46°C), and security compliance (both scanners ran within 26h, MITRE Navigator accessible). Results stored in SQLite for trending, exported to Prometheus for Grafana dashboards. Runs in 18 seconds.

Session-Holistic E2E — 23/23

A second suite of 23 end-to-end tests (157s total runtime) covers 18 YouTrack issues with before/after scoring against production data: 23/23 PASS at the last run (2026-04-19). Each test drives a real session through the full pipeline — Tier 1 triage, Tier 2 plan/execute, RAG retrieval, judge scoring — and asserts on output quality, not just exit codes. Output: docs/session-holistic-e2e-*.md .

Status

MilestoneStatus
Self-improving prompts (eval → auto-patch → re-eval)Production
Plan-and-Execute + AWX runbooks (41 playbooks)Production
Predictive alerting (LibreNMS trending + daily digest)Production
5-signal RAG + GraphRAG + cross-encoder rerank + temporal filter (360 entities, 193 rels)Production
Cross-chunk synth (local qwen2.5:7b, +13pp hit@5 on hard eval) + 5-mode failure injectionProduction
Local-first judge + synth (gemma3:12b + qwen2.5:7b, 2026-04-19 flip)Production
mtime-sort intent detector + list-recent CLIProduction
Karpathy-style compiled wiki (45 articles, source_mtime column)Production
OTel tracing (39K spans → OpenObserve)Production
Tool call instrumentation (88K calls, 108 types)Production
3-tier agent architectureProduction
26 n8n workflowsProduction
10 MCP servers (153 tools)Production
11 sub-agents with diary entriesProduction
Skill-authoring uplift vs google/agents-cli — scorecard 3.94 → 4.94, 6 gap dimensions closed (Phases A→J, umbrella IFRNLLEI01PRD-712)Production
Master phase-gate skill (chatops-workflow/SKILL.md) force-injected into every Runner sessionProduction
Auto-generated skills index (docs/skills-index.md) drift-gated by test-656Production
Skill versioning + requires audit — 17/17 SKILL.md with version: 1.x.0 + requires: {bins, env}; 2 new Prom alertsProduction
46 Shortcuts-to-Resist rows inlined across 11 agents (source-cited to memory/feedback_*.md)Production
evidence_missing risk signal — forces [POLL] when CONFIDENCE ≥ 0.8 without a visible tool output blockProduction
Operator-vocabulary map (config/user-vocabulary.json, 20 entries) — prompt-submit hook logs vocabulary events to event_logProduction
21/21 agentic design patternsAudited (7 at A+)
Tri-source + operational activation audit11/11 A+
7-layer safety + word-boundary precision on hook patternsProduction
OpenAI Agents SDK adoption batch (9 / 11 gaps landed, 2026-04-20)Production
Schema versioning on 13 tables + central CURRENT_SCHEMA_VERSION registryProduction
13 typed session events in event_log with Prom exporterProduction
Per-turn hooks (session-start.sh / post-tool-use.sh / user-prompt-submit.sh / session-end.sh) + session_turns tableProduction
3-behavior rejection taxonomy (allow / reject_content / deny) + event_log audit invariantProduction
HandoffInputData envelope (zlib+b64, 0.43% ratio) + handoff_log audit tableProduction
Handoff transcript compaction (local gemma first, Haiku fallback, circuit-breaker-aware)Production
Agent-as-tool wrapper for 10 sub-agentsProduction
Handoff depth + cycle detection (atomic IMMEDIATE tx, PRAGMA busy_timeout=10000)Production
Immutable per-turn snapshots + rollback + 7-day retention cronProduction
42 SQLite tables (150K+ rows; +6 from the SDK batch + 3 from the teacher-agent tiers: learning_progress, learning_sessions, teacher_operator_dm)Production
Hardened RAGAS golden set — 33 queries (15 hard-eval), 10× differentialProduction
Weekly hard-eval cron (50-q) — judge_hit@5 = 0.90Production
3 absent-metric alerts guarding the staleness alerts themselvesProduction
161+ eval scenarios across 8 test suitesAll passing
Preference-iterating prompt patcher — N-candidate A/B trials + Welch t-test + auto-promoteProduction
CLI-session RAG capture — interactive CLI sessions now flow into session_transcripts + tool_call_log + incident_knowledgeProduction
QA suite — 44 files, 411/0 PASS (2 benign skips), per-suite timeout guard99.52%
11-dashboard observability (64+ panels)Production
Weekly chaos cron (self-selecting, preflight gate, Matrix notifications)Production (CMM L3)
A/B prompt testingActive
Holistic health check (142 checks, 37 sections)96%+ pass
Session-holistic E2E (23 tests covering 18 YouTrack issues)100% (23/23)

Built by a solo infrastructure operator who got tired of waking up at 3am for alerts that an AI could triage.