In 2023, one of the large recruitment systems operated for over a year before anyone verified whether the model was favoring specific demographic groups. The review revealed systematic bias. None of the automated tests detected this because they checked prediction accuracy, not fairness. A human in the loop, regularly reviewing results on real samples, would have been cheaper than the cost of remediation.
What the model actually does—and what it doesn’t
#Large language models excel at recognizing patterns in the data they were trained on. They struggle with several specific things:
- Novelty — If a situation has no precedent in the training data, the model interpolates and often makes confident mistakes.
- Distribution shift — A model deployed six months ago doesn’t know your company changed pricing, the law changed, or a customer relationship has a history outside the knowledge corpus.
- Confidence calibration — A model that responds with a confident tone isn’t always correct. Confidence in tone and probability of correctness are different variables.
- Ethics not reflected in data — Training data mirrors historical patterns, including historical inequalities. The model doesn’t correct them on its own.
These limitations don’t disqualify AI as a tool—they dictate where humans must remain in the loop.
When human intuition is a technical asset
#The word “intuition” sounds soft, but behind it lies a concrete ability: combining contextual knowledge not captured in any document with real-time situational assessment. An experienced credit analyst sees something in the background of an application that isn’t in any form field. A doctor connects test results with what they heard from the patient five minutes ago. A recruiter reads between the lines of a candidate through the lens of company culture, which can’t be described in a hundred words.
None of these insights are “unmeasurable” in the sense of being indescribable after the fact. They’re unmeasurable before the decision, in real time—and that’s what makes them irreplaceable in situations where the consequences of error are asymmetric.
Good AI architecture doesn’t remove these situations. It detects them and routes them to a human before the model takes irreversible action.
Human-gate: where humans enter the loop
#Human-gate is an architectural mechanism, not a policy note. In our deployments, it works like this:
- The agent classifies intent and assesses action weight.
- For actions marked as irreversible or high-risk, it generates an HMAC-signed confirmation token.
- The confirmation goes to a human (email, dashboard, push) with context: what, why, and the impact.
- The human approves or rejects. The model’s assertion that “the action is OK” isn’t enough.
- The token expires after 24 hours—no response = no action (fail-closed).
This pattern is more expensive than full automation, but it costs a fraction of an incident that would occur without it. We use it wherever errors are hard to reverse: external communications, customer data changes, financial decisions, publishing.
Explainability: humans must know what to comment on
#Human oversight is worthless if the person only sees the outcome but doesn’t understand the process. In high-risk systems under the AI Act, explainability isn’t optional—it’s a documentation requirement.
In practice, this means at least three things:
| What must be visible | Why it matters to humans |
|---|---|
| Which documents or data feed the response | Assessing whether the source is current and relevant |
| What logical steps the model took (chain-of-thought) | Detecting flawed reasoning before action |
| How confident the model is and where its knowledge limits lie | Calibrating trust—when to ask further |
In RAG architecture, source tracing is natural: every response includes citations pointing to documents. This is basic explainability that simultaneously limits hallucinations and gives humans a foothold for verification.
Bias and AI Act: human oversight as a legal obligation
#Systems that profile people, evaluate them, or make automated decisions about them fall under the AI Act’s high-risk category. The obligations are concrete:
- Technical documentation describing how the system works and what it was trained on.
- Risk assessment accounting for potential discriminatory impact.
- Human oversight with the ability to override any automated decision.
- Logs allowing reconstruction of why the system made a given decision.
Standard accuracy tests don’t catch discrimination. A model can have 93% accuracy while systematically favoring one group—because that accuracy isn’t measured separately for each subgroup. Human oversight here means: someone regularly reviews results not globally, but in demographic cross-sections, looking for deviations that aggregate statistics don’t show.
In our high-risk pilot deployments, we use shadow mode: the system runs in parallel with human decisions for the first few weeks. Only when comparisons show alignment and no systematic deviations does automation expand its scope. Not the other way around.
Four layers of oversight in practice
#Human oversight isn’t a single point—it’s several layers with different granularity:
Layer 1 — Design. Before the system is deployed, humans decide which actions fall within the agent’s scope and which absolutely require confirmation. This is an allow-list, not a blocklist.
Layer 2 — Operational. Human-gate for irreversible actions, as described above. Works in real time for every decision above the threshold.
Layer 3 — Review. Regular sampling of results by a domain expert—not technical logs, but actual decisions and their consequences. This is where model drift and systematic errors are detected.
Layer 4 — Structural. Compliance audit with AI Act, RODO, and company policy. Typically quarterly for high-risk systems, annually for others.
Each layer has a different owner. Without this structure, human oversight exists formally but doesn’t function.
When less oversight is appropriate
#The above doesn’t mean every action requires confirmation. Excessive oversight destroys the value of automation and leads to “alarm fatigue”—people stop reading notifications because there are too many.
The right level of oversight depends on three variables:
- Reversibility of action — Actions that can be undone in minutes tolerate more automation than those with external consequences.
- Stake of error — The difference between the cost of an automated error and the cost of delaying a decision for human review.
- Model maturity — A system after 3 months of shadow mode with documented alignment can have a wider scope of autonomy than a new system.
These three variables should be formally assessed before each deployment—and revisited every few months, because the model isn’t static.
Try it live
#Describe your automation, and the model will help identify which actions require human-gate and what level of oversight is proportional to the risk (playground: PII masked, zero retention):
FAQ
#What is human-in-the-loop and when is it required?
#Human-in-the-loop is an architectural pattern where a human approves or corrects a system’s actions before or after specific steps. It’s required wherever a model’s error is hard to reverse, the stakes are high, or the AI Act classifies the system as high-risk. In practice: not for every action, but for every irreversible one or those directly affecting rights and situations.
Doesn’t human oversight negate the point of automation?
#No. Automation handles volume and consistency—tasks humans would do the same way, just slower and less reliably. Human oversight reserves humans for exceptions, situations unknown to the model, and decisions with asymmetric consequences. Good design minimizes the number of required approvals while maximizing their relevance.
How does the AI Act regulate human oversight in high-risk systems?
#For high-risk systems, the AI Act requires operators to ensure effective human oversight enabling at least: observation of system operation, understanding of capabilities and limitations, anomaly detection, and the ability to override or stop the system. Logging alone, without someone regularly reviewing logs, doesn’t meet this requirement.
How can I check if my model is discriminating?
#Standard accuracy metrics aren’t enough. You need to measure results separately for demographic subgroups and look for systematic deviations. In high-risk systems, the AI Act requires documentation of this analysis. In practice, we recommend shadow mode before full deployment and quarterly reviews of results in cross-sections, not just globally. Details on our approach to high-risk systems.
Where do I start building human oversight into an existing system?
#First, inventory the actions the system takes—and divide them into reversible and irreversible. Irreversible actions get human-gate as a priority. Then implement result sampling: someone reviews 5-10% of decisions weekly and documents anomalies. This is the minimum that provides a foundation for later optimization. Tool to assess your company’s readiness: AI readiness assessment.