ARGUS: My Approach to Putting an AI Agent in Front of a Metrics Backend
This is a writeup of a prototype I built to explore how a local LLM agent could sit between an engineer and a large Prometheus/Thanos observability stack. It runs on a single MacBook M3 Pro, it's not production, and most of the interesting work is in the plumbing around the model rather than the model itself. I'm calling it ARGUS.
The goal was to find out, hands-on, what it actually takes to make an LLM useful for production monitoring — where the failure modes are, which problems the AI genuinely helps with, and which ones are better solved by boring engineering before the model gets involved.
The problem I wanted to poke at
A realistic observability stack can hold tens of thousands of distinct metrics — transactions per minute per tenant per region, failed requests, session counts, auth latencies, queue depths, saturation levels. An on-call engineer at 3am has three unpleasant options in front of that much surface area: stare at dashboards hoping something jumps out, write ad-hoc PromQL and hope the label filters are right, or trust static alert thresholds that were tuned for weekday daytime traffic and fire false positives every Sunday morning.
The question that actually matters during an incident isn't "is this metric high?" It's "is this metric unusual for this tenant, in this region, at this exact time of the week?" That question is answerable from the data, but not by a human in the middle of an incident.
I wanted to see whether a conversational agent could close that gap.
What the prototype does
ARGUS is a chat interface wired up to a set of tools that talk to Thanos. You type "is TenantAlpha degraded right now?" and it returns a comparison against the same weekday, same hour, one week ago — with trend direction, per-region breakdown, and any registered deployment window that might explain the delta.
The tool set covers seven areas:
- Platform health overview — active users, sessions, transaction rates, checkout failures, auth performance, queue depth, and saturation levels in a single call
- Tenant comparison — current vs 7-day baseline, per tenant, with trend direction and per-region breakdown
- Degradation detection — a batch-optimised scan across roughly 2,400 active transaction metrics, returning in ~1.5 seconds
- Alert management — baseline-aware rules with webhook notifications, embedded runbooks, deployment correlation
- Multi-metric monitoring — transaction rate, failed checkouts, sessions, saturation, auth latency, queue depth
- Anomaly analysis — z-score scan across a 21-day historical window
- Knowledge base — RAG-powered search over tenant metadata, traffic patterns, incident notes
None of that is shipping anywhere. It's a prototype to learn from.
Architecture
I deliberately kept the design boring. Five components, clear seams between them, nothing clever where clever wasn't needed:
User Browser
|
v
Flask Web App (session auth, per-user workspaces, SSE streaming)
|
v
AI Orchestrator (ReAct reasoning loop)
| |
v v
Ollama + qwen2.5:7b MCP Tool Layer
|
+-----------+-----------+-----------+
| | | |
Thanos Alert Engine ChromaDB Slack Webhook
(live) (background (RAG) (notifications)
poller)
The orchestrator runs a ReAct loop: the LLM reasons about the query, picks a tool, gets a structured result back, and either calls another tool or synthesises an answer. Capped at five tool rounds per query, temperature 0.1, typical response time 5–12 seconds.
Stack choices
LLM: qwen2.5:7b via Ollama, running locally. The M3 Pro has 18GB of unified memory, which rules out anything bigger than the 7–8B class. Within that class, qwen2.5:7b had noticeably better tool-calling reliability than Llama 3.1 8B or Mistral 7B in my testing — and tool calling is the entire point. Running inference locally also meant I didn't have to think about data egress for a prototype touching metric data.
RAG: ChromaDB + all-MiniLM-L6-v2. 80MB embeddings model, local persistence, good enough for a small corpus of tenant metadata, traffic patterns, and incident notes.
Web layer: Flask with Server-Sent Events for streaming. Per-user session auth, isolated workspaces under workspaces/{username}/, separate agent memory per user, shared alert rules. The UI streams thinking, tool_call, tool_result, and response events live so you see what the agent is doing instead of watching a spinner.
Metrics backend: Thanos over HTTPS. Anonymous read access, no service account required.
The batching lesson (or: how I went from ~4,800 queries to 2)
My first version of the degradation scanner was correct and unusable. For every one of the ~2,400 active transaction metrics, it ran two Thanos queries — one for current value, one for the baseline. Close to 4,800 serial HTTP round-trips. A full scan took several minutes, which made the whole thing a party trick rather than a tool.
The fix was to stop thinking about it metric-by-metric. Thanos supports regex matching on label selectors, so a single PromQL query can return every transaction metric at once. Pull current values in one query. Pull last-week-same-time values in a second query. Join the two result sets in Python — dictionary keyed by tenant + region, compute the delta, sort by severity.
Two queries. ~1.5 seconds. Works the same whether there are 1,000 alert rules or 10,000, because the query count is constant in rule count.
I applied the same pattern everywhere there was a loop over PromQL calls — tenant comparison, saturation evaluation, the platform health overview. The saturation levels query in particular went from ~900 HTTP calls to 3 batch queries.
| Operation | Queries | Time |
|---|---|---|
| Thanos fetch (~2,400 current metrics) | 1 | 0.72s |
| Thanos fetch (~2,400 baseline metrics) | 1 | 0.68s |
| Rule evaluation (1,000 rules) | 0 | 6.1ms |
| Full degradation scan | 2 | ~1.5s |
| Tenant comparison | 2 | ~1.5s |
| Platform health overview | ~8 | ~3.2s |
| Saturation levels with context | 3 | ~2.3s |
The poller sits under 0.5% CPU at a 5-minute check interval.
This was the first moment where the lesson clicked: the AI was the easy part. Making the AI's tool calls cheap enough to be worth making was where the actual engineering lived.
Baseline-aware alerting
Static thresholds — "alert if transactions drop below X" — produce either alert fatigue or missed incidents, depending on where X is set. "Low" is meaningless without context, because Sunday 8am has a fraction of the traffic of a weekday evening.
ARGUS compares the current value against the same weekday, same hour, one week ago. The diurnal swing is large enough that any single global threshold is wrong for most hours of the week. A same-time-last-week baseline captures that distinction with no hand-tuned thresholds per tenant.
The alert lifecycle:
- Rule is created via chat, including the webhook URL and optionally a runbook
- The agent validates — bare channel names get rejected, runbook is prompted for if missing
check_intervalis derived from the metric resolution (1-minute data gets 1-minute checks, 5-minute data gets 5-minute checks)- A background poller ticks every 30 seconds and evaluates only rules that are due
- Current and baseline values fetched in two batched queries (~1.5s)
- Rule evaluation happens in Python against the batched result
- Deployment windows are consulted — if the alert falls inside a registered maintenance window, it's tagged accordingly instead of paging
- If a threshold is breached, an alert fires with the runbook embedded in the Slack notification
- If the condition recovers, the alert auto-clears
The runbook-in-the-notification detail was small but felt disproportionately useful. On-call engineers don't want a wiki link at 3am — they want step one, step two, step three next to the thing that's broken.
What a small local LLM is actually like to work with
The 7B model is capable enough to be useful and unreliable enough to be educational. A few things I learned:
Aggressive response cleaning beats prompt engineering. qwen2.5:7b occasionally emits malformed tool calls — extra whitespace, trailing commas, stray backticks around JSON, half-markdown fences around the payload. Rather than try to prompt-engineer this away, I wrote a layer that extracts the JSON block, strips the noise, and tolerates the model's bad days. Cheaper and more robust than trying to talk the model out of its habits.
Structured tool descriptions matter more than elegant system prompts. The system prompt is short. The tool registry is verbose. Every tool has a one-line purpose, explicit parameter types, default values, and a note about how long it takes to run. The model routes correctly almost every time when the tool menu is unambiguous.
RAG needs keyword boosting when entity names overlap. ChromaDB semantic similarity will happily confuse tenants with similar names. I added a keyword-boost layer on top of the vector search for tenant and region terms, which recovered precision without losing recall on the fuzzy queries.
Cap the tool rounds. Five rounds is enough for any legitimate query and hard-stops the model when it gets confused and tries to "check one more thing" in a loop. Combined with a low temperature, this kept behaviour predictable.
Multi-user as an afterthought
What started as a single-user demo turned into a multi-user prototype once a few other engineers wanted to try it. The retrofit was mercifully small:
- Flask session-based auth with SHA-256 password hashing
- Per-user workspace directories:
workspaces/{username}/chat_history/ - A separate
Agentinstance per logged-in user, so conversation history doesn't bleed across sessions - Alert rules stay shared (one source of truth)
- A username badge in the header, clickable to log out
The agent-state isolation mattered more than I expected. Two people investigating different incidents shouldn't see each other's reasoning traces polluting the model's context.
Compliance and blast radius, even for a prototype
Because the prototype touches observability data, the boundaries were scoped hard from day one:
- The LLM has no internet access. Local Ollama, no cloud API, no outbound inference calls.
- The agent has no write access to anything production-adjacent. Read-only to Thanos. Write-only to Slack via webhook. Local filesystem for RAG, alert rules, chat history. No ability to modify monitoring data, execute arbitrary commands, or reach other network services.
- No personal data is processed. Only aggregated counts — transactions, sessions, failures — never anything that identifies an end user.
- Air-gappable after setup. The only internet-dependent steps are the initial Ollama install, the model pull, pip dependencies, and the embeddings model download. After that, it runs on an isolated network.
Worth thinking about this even at the prototype stage — the habits you form on toy projects are the habits that survive into real ones.
What I'd change before anyone relied on it
A prototype-honest list of rough edges:
- Alert rules live in local JSON files. Anything beyond a single-box deployment needs a proper database and a migration path.
- Thanos retention caps the anomaly window at 21 days here. Longer-horizon baseline analysis needs its own storage tier.
- Slack is the only notification sink. PagerDuty and email are obvious additions.
- The model occasionally hallucinates tool names when a query is very far outside the patterns it's seen. A constrained-decoding layer would fix that cleanly.
- There's no audit trail beyond chat history. Real production use would need structured logging of every tool call and every alert decision.
What I actually took away from building it
The thing that surprised me most is how much of the value came from boring engineering rather than the AI. The ReAct loop is twenty lines. The tool definitions are the interesting part. The batching rewrite saved more hypothetical incident-minutes than any cleverness on the LLM side could have. The local model was the right call not because it's smarter but because it erased an entire compliance conversation before it started.
My take, from this prototype: AI-in-the-middle is powerful when the tools around it are sharp. Do the query batching, build the baseline comparison, make the alert lifecycle honest — then put a conversational interface on top. The model doesn't need to be brilliant if the tools it's calling are good.
If I were to pick one thing for anyone else exploring this space: spend the first week on the tools, not the agent. The agent will be fine. The tools are what decide whether the whole thing is actually useful.