Why We Don't Train AI Agents Anymore
92% accuracy without training. From August 2026, the EU AI Act requires explainable individual decisions. Trained models cannot deliver that.
Training Is the New Fax
In 2019 we had to train AI models. They were too limited for anything else. GPT-2 could not write a coherent paragraph. BERT needed thousands of labelled examples and a GPU cluster running for days for every task.
That was six years ago. Six years in which language model capabilities improved by orders of magnitude. Yet the industry still acts as if “training” is the natural first step.
At a Glance - Why Training Is the Wrong Architecture
- An LLM achieves 92% correct decisions in invoice review - without a single training example. Experienced lawyers reach 72%.[1]
- From August 2026, the EU AI Act (Art. 13, 14, 86) requires explainable individual decisions for high-risk systems. Trained models cannot deliver that.[10]
- The alternative: rulebook (versioned), context (per decision), Decision Layer (human / rulebook / AI per Micro-Decision).
- Configured agents are model-agnostic: switch foundation models without changing a single rule. No lock-in, no retraining.
- Over 40% of agentic AI projects will fail by 2027 - mostly due to missing governance, not missing model performance.[9]
If someone says “we train our AI agents” in 2026, it is like saying “we fax our orders” in 2010. It works. But it shows a fundamental misunderstanding of the architecture.
From Training to Configuration
2018 - 2020
Training Is Required
BERT, GPT-2. 110M - 1.5B parameters.
Duration: Weeks
Cost: $10,000 - $100,000
Prerequisite: GPU cluster
2021 - 2023
Training Becomes Optional
GPT-3/3.5. 175B parameters.
Duration: Days
Cost: $1,000 - $10,000
Prerequisite: GPU required
2024
Training or Prompting?
GPT-4o, Claude 3.5. Multimodal.
Duration: Hours
Cost: $10 - $100
Prerequisite: API call
2025 - 2026
Configuration Is Enough
GPT-5, Claude Opus 4. Reasoning.
Duration: Minutes
Cost: $10 - $100
Prerequisite: API call
Kumar Gauraw puts it clearly: “Most reach for fine-tuning too early.”[5] Not because fine-tuning is bad. Because in 2026, it is no longer necessary for most enterprise tasks.
What a Trained Model Cannot Do: Explain an Individual Decision
A candidate is rejected by your recruiting agent. They ask: Why?
Two answers. Two architectures.
Trained model: “Our model learned from 50,000 historical hiring decisions that your profile has a 34% success probability.”
Configured agent: “Your qualification in mechanical engineering does not meet requirement 3 (electrical engineering or equivalent). Rule: job profile v2026-03. Contestable: Yes. Process: department reviews whether mechanical engineering qualifies as ‘equivalent’.”
The first answer is illegal from August 2026.
EU AI Act, Art. 13 (transparency), Art. 14 (human oversight), Art. 86 (right to explanation).[10] For high-risk systems - and recruiting is high-risk, Annex III(4) - every individual decision must be traceable, explainable and contestable. (US: No federal equivalent exists, but EEOC guidance increasingly demands similar explainability for automated hiring decisions.)
Not the model. The individual decision. For this candidate. With this justification.
A trained model cannot do that. It has no decision record. It has weights. And weights explain nothing to an employee representative body.
The Compliance Test: Trained vs. Configured
Architecture A
Trained Model
"Why this decision?"
"Model has learned" - black box
Not explainable
"Regulation changes?"
Retrain. 2 - 4 weeks, $5,000 - $20,000
Expensive and slow
"Can the affected person contest?"
Contest what? Model weights?
Not contestable
"New LLM model available?"
New training required. Weeks, lock-in.
Vendor dependency
"EU AI Act compliant?"
Art. 13: Transparency missing. Art. 14: Intervention = replace model. Art. 86: Explanation not possible.
Problematic
Lock-in: Yes | Audit: Difficult | EU AI Act: Problematic
Architecture B
Configured Agent
"Why this decision?"
"Travel expense rule v2026-01, absence 14h15min"
Rule, version, context documented
"Regulation changes?"
Update the rule. Effective immediately, $0.
Versioned and auditable
"Can the affected person contest?"
"Breakfast was not included." Reviewer checks.
Contestable with decision record
"New LLM model available?"
Rulebook stays. Zero effort, no lock-in.
Model-agnostic
"EU AI Act compliant?"
Decision record per Micro-Decision. Override the rule, not the model.
Compliant by design
Lock-in: No | Audit: By Design | EU AI Act: Compliant
The compliance problem is only the surface. Beneath it lies an architecture problem.
92% vs. 72%
Researchers tested in 2025 how well an LLM can review legal invoices against billing guidelines.[1] No fine-tuning. No training. Just the rulebook as context.
The result:
Legal invoice: compliant or not?
Better Bill GPT, Whitehouse et al. (April 2025). Peer-reviewed. LLM received rulebook as context, no fine-tuning.[1]
Overall accuracy
LLM (no training)
92%
Experienced lawyers
72%
Line item classification (F-Score)
LLM (no training)
81%
Best human group
43%
Time per invoice
LLM
3.6 sec
Lawyers
~250 sec
Cost per invoice
LLM
< $0.01
Lawyers
$4.27
Cost reduction: 99.97%.[4] The mechanism is transferable to any rule-based compliance task.
The LLM was not trained on invoices. It received the billing guidelines as context. And decided immediately.
Why the LLM Performed Better
Not because it is smarter. Because it applies the same rule at 3 PM exactly as it does at 9 AM. Inconsistency is the human problem, not incompetence.[1]
Experienced lawyers make 72% correct decisions - but each lawyer makes different wrong decisions. The errors are not systematic but random. Fatigue, time pressure, personal interpretation. An LLM knows no fatigue.
The Transferable Mechanism
Whether the rulebook is called “billing guideline”, “per diem regulation” or “travel expense policy”: check document against rule, identify deviation, document decision. The mechanism is identical.
| Dimension | Trained Model | Configured Agent |
|---|---|---|
| Rule change | Retraining (weeks, $5k - $20k) | Rulebook update (minutes, $0) |
| Explainability | "Model has learned" (black box) | Rule + version + context (decision record) |
| Contestability | Not possible (no decision record) | Yes (affected person sees rule and can object) |
| Model switch | New training required (lock-in) | Zero effort (model-agnostic) |
| Audit trail | Input + output (no justification) | Input + rule + version + confidence + result |
| EU AI Act (Aug 2026) | Art. 13, 14, 86: Problematic | Art. 13, 14, 86: Compliant by design |
| Break-even fine-tuning | From ~35,000 queries/month[6] | Economical immediately |
A study by Chauhan et al. (2025) puts the break-even point of fine-tuning versus prompting at roughly 35,000 queries per month.[6] Most enterprise HR and finance processes operate well below that.
Three Things Instead of Training
If not training, then what? Three components replace what fine-tuning promises but structurally cannot deliver.
1. Rulebook
Everything an agent needs to know is in a regulation, a directive, a collective agreement or a company policy. These rules change. Tax law changes annually. Per diem rates change annually. EU regulations change.
A trained model must be retrained with every change. A rulebook is updated. Effective immediately, versioned, auditable. No GPU cluster, no evaluation cycle, no regression risks.
RAG (Retrieval Augmented Generation) reduces factual errors by up to 50%.[11] Not because the model gets smarter. Because it sees the current rule instead of retrieving an outdated weight.
2. Context
The agent does not need 10,000 historical expense reports. It needs this one report: travel date, departure, return, hotel, breakfast included or not. That is the context of this decision.
It is supplied through structured inputs or RAG, not trained in. When the context changes - different trip, different employee - the decision changes. Not the model.
A concrete example: the travel expense engine checks per diem allowances against the applicable tax regulation. In Germany, this is Section 9 of the Income Tax Act (EStG). The context is the individual trip. The rulebook is the current tax law. The foundation model is interchangeable.
3. Decision Framework
Who decides what? Not every decision in a process is equal.
The per diem allowance is rulebook: tax regulation, deterministic, 100% confidence. The question of whether an entertainment expense is “reasonable” is judgement: human. The classification of an illegible receipt is AI: LLM extraction, probabilistic.
This decomposition into Micro-Decisions with assignment to human / rulebook / AI is the real architecture work. Not training. The Decision Layer formalises exactly this decomposition. Architecture details: Decision Layer explained.
Micro-Decision in Practice
Travel expense report: 8-hour day, domestic trip, hotel with breakfast
Each step has a fixed type: Rulebook (deterministic), AI (probabilistic, with confidence threshold) or Human (judgement). When the tax regulation changes, the rule is updated. No retraining. No new model.
The Three Layers: Architecture Instead of Training
The architecture behind a configured agent consists of three layers. Each layer is independently replaceable.
Everything above Layer 1 remains when the model changes. Rulebook, Decision Layer, decision records, audit trail - all model-agnostic. No retraining. No lock-in.
Why three layers? Because each has a different responsibility.
The foundation model provides language understanding and reasoning. It understands context, extracts information from documents, classifies inputs. It does not need to know what a specific tax regulation says. It needs to understand what a regulatory text is.
The rulebook contains the business logic. Regulations, directives, collective agreements, company policies. Every rule has a version. Every version has an effective date. When the regulation changes, the rule is updated. Not the model.
The Decision Layer governs who may decide what. It decomposes processes into decision steps. Defines for each: human, rulebook or AI. Documents every decision with rule, version, context and result.
What Training Really Costs
Not in dollars. In dependencies.
Lock-in
A fine-tuned model ties you to that vendor. The training set, the weights, the evaluation pipeline: all proprietary. Model switch = new training = new costs = new time loss.
A configured agent switches the foundation model without changing a single rule. Claude today, GPT tomorrow, an open-source model next week. The rulebook stays. The Decision Layer stays. The decision records stay.
Maintenance
Every regulatory change requires retraining. In finance, tax law, treasury guidance and social security contribution rates change annually. In HR, collective agreements, framework agreements and EU regulation change.
A trained agent needs continuous maintenance that looks like a software project. A configured agent needs a rulebook editor.
MIT and Stanford (Choi & Xie, 2025) show: AI reduces the monthly close by 7.5 days.[7] But 62% of accountants worry about AI errors.[8] The concern is justified - with trained models. With configured agents that have decision records and contestability, every error is identifiable and correctable.
Explainability
A trained model can tell you what it decided. It cannot tell you why.
“The model has learned” is not a justification a tax auditor accepts. No employee representative body accepts it. No rejected candidate accepts it.
“Travel expense rule v2026-01, applied to absence of 14h15min” is a justification.
If you cannot explain the decision, you cannot let it be contested. And if it cannot be contested, it is no longer legally compliant in the EU from August 2026.[10]
Does Fine-Tuning Have Its Place?
Yes. From roughly 35,000 queries per month with a stable rulebook, fine-tuning becomes economical.[6] Language adaptation, domain-specific jargon, latency optimisation: there are good reasons for it.
But where the industry sells it today - enterprise HR and finance with annually changing regulations - it is the wrong architecture decision. Gartner predicts that over 40% of agentic AI projects will fail by 2027.[9] Not because of model performance. Because of governance.
The Question Your Board Should Ask
Not: “What data was your agent trained on?”
But:
1. Which rulebook underlies the decision? Which version was in effect at the time of the decision?
If the answer is “that is in the model”, there is no version. No change history. No audit trail.
2. What happens when the rule changes? Retraining or update?
If the answer is “we retrain”, you are paying for maintenance that is unnecessary.
3. Can the affected person see and contest the individual decision?
If there is no answer, you have a compliance problem from August 2026. Art. 86 EU AI Act: right to explanation. Not optional.[10]
Gosign’s Approach
Gosign’s Decision Layer is an implementation of this architecture. It decomposes processes into decision steps. Defines for each: human, rulebook or AI. Rulebooks are versioned. Decisions are auditable. Results are contestable.
48 HR agents and 49 finance agents, each with a Micro-Decision table. No fine-tuning. No lock-in. No retraining when regulations change.
References
- Better Bill GPT, Whitehouse et al. (April 2025). Legal Invoice Review: LLM achieves 92% accuracy reviewing legal invoices against billing guidelines. Peer-reviewed.
- Better Bill GPT, Whitehouse et al. (April 2025). F-Score for line item classification: LLM 81% vs. best human group 43%.
- Better Bill GPT, Whitehouse et al. (April 2025). Processing time per invoice: LLM 3.6 seconds vs. experienced lawyers 194 to 316 seconds.
- Better Bill GPT, Whitehouse et al. (April 2025). Cost reduction in legal invoice review: 99.97% ($4.27 vs. <$0.01 per invoice).
- Kumar Gauraw (March 2026). "Most reach for fine-tuning too early."
- Chauhan et al., Journal of Information Systems Engineering (2025). Break-even fine-tuning vs. prompting: ~35,000 queries per month.
- MIT/Stanford, Choi & Xie (August 2025). AI reduces the monthly close by an average of 7.5 days.
- MIT/Stanford, Choi & Xie (August 2025). 62% of accountants express concerns about AI errors in financial processes.
- Gartner (June 2025). Prediction: Over 40% of agentic AI projects will fail by 2027.
- EU AI Act (Regulation 2024/1689), Crowell & Moring (February 2026). High-risk obligations from August 2026: Art. 13 (transparency), Art. 14 (human oversight), Art. 86 (right to explanation). Annex III(4): recruiting as high-risk system.
- IBM (2024). RAG reduces factual errors in LLM outputs by up to 50%.

Bert Gogolin
CEO & Founder, Gosign
AI Governance Briefing
Enterprise AI, regulation, and infrastructure - once a month, directly from me.