Operating Agentic Systems in Production: Lessons from Building Tendwell

Bogdan Moldovan June 3, 2026 AgentOps 6 min read

Tendwell is a self-hosted, local-first AgentOps tool. It watches production signals — metrics and logs — alongside operational knowledge like runbooks, reasons over that with a local LLM, and reports on production health in plain language. When something is wrong it can propose a remediation, which a human approves and which a tamper-evident audit log records. It's open source under Apache-2.0 and written in Python.

This is a writeup of the lessons that actually shaped the design. Most of them are not about the model. They're about the machinery you have to build so that an agent is safe to point at a real system, and so that the people responsible for that system are willing to trust it. I want to be honest about where the engineering really went, because it wasn't where I expected when I started.

The hard part is not the model, it is the plumbing around it

The temptation with an agentic tool is to spend your time on prompts. In practice, prompting was a small slice of the work. The bulk went into guardrails, normalization, and failure handling: turning messy production signals into a stable shape the rest of the system can reason about, deciding what the agent is allowed to touch, and deciding what happens when any step returns something unexpected.

None of that is glamorous, and none of it shows up in a demo. But it's the part that determines whether the tool is trustworthy or merely impressive. A good agentic system is mostly a careful, boring substrate with a model bolted into one corner of it.

Make the useful path independent of model quality

The single most important design decision was to keep correctness out of the model's hands. Tendwell evaluates SLOs deterministically. Before the LLM is involved at all, a pre-fetch reads the signals and computes status against the defined objectives in plain code. That status is correct whether or not a model is even present.

The model's job comes after: interpretation, correlation across signals, and a plain-language narrative that ties the numbers to the runbooks. Even a weak local model that fumbles every tool call still leaves you with a correct status, because the status was never the model's responsibility. Determinism goes where correctness matters; the model goes where judgment and explanation help. This split is what lets the tool degrade gracefully instead of lying confidently.

Small local models are unreliable at native tool calling

If you commit to running on whatever local model an operator already has, you inherit a wide range of tool-calling behavior, and a lot of it is bad in inconsistent, ugly ways. Models emit malformed JSON, wrap arguments in stray markdown fences, invent fields, drop required ones, or simply narrate a tool call instead of making one. The failure modes don't repeat cleanly, so you can't prompt your way out of them reliably.

The fix was a prompt-based ReAct fallback with strict output parsing, bounded retries, and graceful degradation. When native tool calling misbehaves, the system falls back to a parsed ReAct loop; when parsing fails, it retries within a bound and then degrades rather than crashing. A malformed model response never raises an exception out of the agent — it's a normal, handled outcome. The claim “works with any local LLM” is won or lost entirely here. It isn't a marketing line; it's a parsing-and-retry discipline applied consistently to every model interaction.

Local-first has to be a hard default, not a setting

For security-conscious and regulated environments, the only credible default is no egress at all — and that has to include the LLM and the embeddings, not just the obvious telemetry. A tool that runs locally but quietly ships prompts containing production data to a hosted model has not solved the problem the operator cares about; it has hidden it.

So local-first is not a checkbox you can toggle on. It's the default state, and the burden is on configuration to move away from it. If an operator points a backend off-host, Tendwell warns loudly at startup rather than letting it pass silently. The reasoning is simple: trust is the product. A tool that handles incident data earns its place by being verifiably contained, and a default that quietly phones home destroys that the first time someone notices — and in these environments, someone always notices.

The separation between proposing and executing is the product

It would be easy to let the agent “just do it” — detect the problem, pick the fix, run it. That convenience is exactly what forfeits the trust of a regulated buyer, so Tendwell is built the other way around. The model can only propose. Between any proposal and any change sit two things the model cannot influence: deterministic validation, and a human approval gate. The model has no path to approve and no path to execute.

The open core ships no real executor. With nothing wired in, the agent is structurally unable to mutate anything — not by policy, but by construction. That's deliberate. The separation between proposing and executing is not a safety feature attached to the product; it is the product. An operator can read a proposal, understand why it was made, and decide. The system never decides for them.

Partial failure is the normal case, not an edge case

Once you do allow real actions — through an executor an operator wires in themselves — the next lesson arrives quickly: actions over multiple targets are not atomic. You roll a change across three hosts and two take it and one doesn't. “2 of 3 succeeded” is not an error to paper over; it's the normal shape of the result.

So per-target outcomes are first-class. Each target has its own result, surfaced as such, never collapsed into a single success or failure flag. There are no silent retries on non-idempotent operations, because a blind retry of something that already partially happened is how you turn a small problem into a large one. And whatever the outcome, the system exits in a state it can describe. The question is never “did it work?” but “what is true now, target by target?”

Auditability is a feature, not overhead

It's tempting to treat logging as plumbing you add at the end. For a tool meant for regulated environments, the audit log is closer to the center of the value. Tendwell keeps an append-only, hash-chained log that cannot be disabled. The chaining makes it tamper-evident: you can tell whether the record was altered after the fact, which is exactly the property a reviewer needs before they'll trust anything the tool reports.

It also answers the question an incident review actually asks, which is not “what did the dashboard say” but “what was attempted versus what actually completed.” A proposal, its approval, the per-target results — all of it is on the chain, in order. That record is what lets a regulated user adopt the tool at all. It isn't overhead bolted onto the feature set; it's the thing that makes the rest defensible.

Designing for the reviewer, the auditor, and the failure

The throughline across all of this is that building for high-stakes environments means designing for the reviewer, the auditor, and the failure — not just the happy path. The happy path is the easy ten percent. The other ninety is what determines whether anyone with real exposure will let your agent near their production system.

If you want to look at how this is put together, the code is open. Try the one-command demo, read the source, and tell me where it's wrong.

github.com/bmoldo/tendwell · reops.tech/tendwell