FoxCommand - Decision Records for Consequential AI Workflows

You've probably seen the term used about AI-generated code. It's the gap between how fast agents write it and how fast a human can check it.

Fine. But the version of it that's actually going to hurt companies is less about the code, and more about the audit trail of decisions.

What it looks like from inside

I spent a year working on prior authorization at a large payer. This is what it looks like from inside.

A claim comes in. The system runs. A decision goes downstream.

Most of the time nobody asks why, because the volume is too high, the decision looked defensible, and everyone has a queue.

I asked a medical director once, casually over Slack, what criteria the system was actually checking on a particular claim type.

She said, "Well, it's in the guidelines, but you have to understand the context."

I asked which context.

"You'd have to know the current policy version."

I asked where that lived. There was a pause, the kind that in an office would be accompanied by someone glancing at the ceiling.

"I'd have to check with the vendor."

The system was live, processing thousands of decisions a day. The person closest to itdidn't know where the governing criteria were stored.

She was not, by any measure, a careless person.

That's where the debt accrues. Every one of those decisions ships without a record of what actually governed it: which policy version was active that day, what the model had access to, what threshold it was checking against, why it routed to deny instead of escalate.

Each one is a small, clean loan. The system feels fine because the system is fine, operationally speaking. Decisions are flowing.

The problem doesn't announce itself. It just waits.

A policy gets updated. A model gets quietly swapped out for a better one. Write-offs tick up three months later and someone in finance starts asking reasonable questions.

Then a patient files an appeal. Somewhere in a different building a regulator sends a letter requesting documentation of the decision your system made six weeks ago.

And you genuinely cannot produce it. The world that generated that decision doesn'texist anymore, and at the time it felt like there was no particular reason to write it down.

This isn't specific to healthcare

Every team building serious agentic workflows ends up here eventually.

And I want to be clear about something, because I think people outside this work imagine it as scrappier than it is. These teams aren't flying blind.

They're using LangSmith or Arize or Fiddler, or they built something internal that does the same thing. They can see the model version. They can see the system prompt and the step prompts. They have policy version numbers. They have traces of every span and token.

They can pull up an individual case and walk you through it — step one criteria not met, step two criteria met, step three escalated to human review, here's the model'sreasoning at the outcome step.

They have a lot.

The drop is visible. The affected decisions are not.

step 1 / 4

Something moved

production recall dropped45% → 18%

escalation volumeincreased 3×

model versionunchanged

prompt versionunchanged

policy versionunchanged

The dashboard shows the symptom. It does not show which decisions were affected.

Without the decision record, you don't have reconstruction. You have an investigation.

Here's the kind of thing that still happens.

The model hits 90% precision and 45% recall in staging. You're happy. You ship to production.

Two weeks in, somebody pulls the dashboard and precision has dropped to 65 and recall to 18. Same model. Same prompts. Same policy version. Branching logic is identical.

Nothing changed.

So you go investigate. You pull individual cases. The traces are clean. You can see exactly what the model saw and how it routed. You re-run a handful of them and the reruns mostly land in the same place.

You compare staging cases to production cases side by side and they look similar. Same documentation structure, same fields, same general shape.

Hours go by.

Then you notice something.

The production cases have equipment codes that don't appear anywhere in your staging corpus, because the historical data you trained and tested on came from a period before that particular type of durable medical equipment was being submitted through this workflow at volume.

The model isn't broken. It's just being asked questions you never tested it on.

That's the discovery. The reconstruction is the next problem and it's the harder one.

A rerun is a new decision wearing the old one's clothes.

→

Day 0 — decision made

modelevaluator-A

policyauth_policy_v1.3

retrieved_docsD1, D2, D3

threshold$500

outcomeESCALATE

Decision made under these exact conditions.

Replay the governed decision record, not the model's hidden reasoning.

You now need to go back through the last two weeks of production decisions and figure out which ones were affected.

The trace for each case says "criteria not met, escalated." It does not say"criteria not met because the model didn't recognize the equipment code as relevant evidence" versus "criteria not met because the documentation was actually insufficient."

Both look identical in the trace.

To untangle them, you'd need to pull each case, line up the case-shape against the staging distribution, look at what the model actually weighted, and reconstruct what should have happened under the staging distribution versus what did happen under the production distribution.

For a thousand cases. With the policy version and the retrieved context and the evaluator output frozen at the moment each decision was made.

The tools you have don't preserve that. They preserve traces. Traces are not decision records.

The traces will tell you, perfectly faithfully, that the model evaluated the case and reached a conclusion. They will not let you reconstruct, at scale and with confidence, why this conclusion under this distribution — which is the only reconstruction that actually helps you tell the VP what happened and what to do about it.

So you write a doc that says "we believe the drop is driven by a distribution shift in equipment-coded cases, recommend retraining the evaluator and adding equipment-code coverage to the staging corpus."

That's a real conclusion and it's probably right. It is also an educated guess assembled from partial evidence that you would not want to defend line-by-line in front of a regulator, and you know it.

The original is gone. The logs are there.

Every team running these workflows says the same thing: we have observability. We can re-run it.

This is said with confidence, and up until recently it was technically true, which is where the confusion comes from.

When software was deterministic, re-running it worked. The code was the decision, and the code was still there.

But once the thing making decisions is a probabilistic model whose behavior shifts with every input, every temperature setting, every document it retrieves, re-running it doesn't reconstruct the original decision.

It generates a new decision that resembles the old one. Close enough to look right. Different enough to matter in court.

The original is gone. The logs are there. These are not the same thing.

What's missing is a decision record: a preserved artifact of what governed the decision at the moment it was made.

A trace can show the path. It can't always defend the decision.

Execution path

criteria_check→threshold_eval→escalate

modelgpt-4.1

promptauth-review-v3

tokens1,842

latency6.2s

outputESCALATE

This shows execution. It does not preserve what governed the decision.

Traces show execution activity. Decision records preserve what governed the decision.

Your logs can tell you what ran. They cannot tell you why it decided what it decided. Not because logs are useless. Because execution history and decision authority are different artifacts.

You can pull every trace, every span, every token, every intermediate output from the run, and still be looking at a perfect record of a process whose reasoning you cannot reconstruct.

Observability tells you what happened. It doesn't tell you why. Audit tells you why. Most teams think they have both. They have one.

A very expensive shrug

The clearest version of what this looks like when it comes due is Stephen Hemsley, CEO of UnitedHealth Group, sitting in front of a House Energy and Commerce Health Subcommittee hearing on January 22, 2026.

Congresswoman Robin Kelly, a member of the subcommittee, asked him what should have been a basic question: who, or what, actually determines your denials?

In her own statement afterward, she said she was "deeply disappointed" by his inability to respond, and that Hemsley "could not assert whether AI is still improperly denying healthcare insurance claims."

I don't know whether he was stonewalling.

The more troubling possibility is that the answer didn't appear to exist in a form he could retrieve and defend in that moment. He may have been giving the only answer the organization could produce in that room: a very expensive shrug.

Their system is called nH Predict. It had been running post-acute care authorization decisions at scale for years.

The company's own CEO could not, in that hearing, confirm whether it was still wrongly denying claims.

The answer did not appear to exist in a form anyone could access, because nobody had thought to preserve it in a form designed for that kind of question.

That's verification debt coming due. No bug, no CVE, no outage. Just a question you cannot answer in a room where you are required to answer questions.

Six weeks later, a federal court in Minnesota made the downstream consequences explicit. In Estate of Gene B. Lokken v. UnitedHealth Group, the court granted broad discovery into how nH Predict was developed, deployed, and overseen — with some document categories reaching back to 2017.

The operating implication is not subtle: when an AI system is making consequential decisions, the development and oversight record of that system is discoverable. And if you can't show what governed a specific decision, the AI output is the decision.

Logs are not an audit trail. They're discovery exhibits.

What's actually missing

The shape of the fix isn't complicated, even if the engineering is.

You have to capture what governed each decision at the moment it's made, not as a log entry but as a full artifact: the model version, the prompt, the policy context that was active, the documents it retrieved, the threshold it was checking against, the relevant shape of the input itself.

Preserved. Reconstructable at the decision-record layer without running the workflow through the model again, because running it through a model again doesn't get you back to the original.

A record of the decision itself. Not more logs. Not better observability dashboards. The decision, in a form that still means something six months later when someone needs to understand it.

We're building this. The problem is real either way.

With tech debt, you usually get to pick the moment. You schedule the refactor. You plan the migration.

With verification debt, someone else picks it: a regulator, an auditor, a patient'slawyer, a congressional subcommittee with four more CEOs sitting behind you in the queue waiting their turn.

And when that moment arrives, logs won't be enough. You'll need the record of what governed the decision.

The only question is whether you wrote it down when the decision was made.