Design in the Loop
“Human in the loop” is a governance phrase useful in the media but disappears the moment technical specification and tactical product work begins. What replaces it are unresolved questions about the fundamental human-AI experience for the operator in the box.
TL;DR
“Human in the loop” alludes to participation while avoiding themes of comprehension, authority, or accountability.
The unresolved design problem in autonomy today is comprehension and. ahead, supervision at scale.
DoD programs fantasize about operators managing “2,000+” autonomous agents with no demonstrable interaction—or even mental—model for doing so.
As systems shift from automation toward mission autonomy, operators become orchestrators of intent rather than direct executors of action.
When autonomy platforms are poorly designed, the burden transfers to training and operator resilience.
Governance Language
Every program briefing I’ve been in describes the human as being in the loop. A pilot retains authority for lethal engagement decisions. An air defender has the ability to intervene. An operator sets objectives and monitors behavior. The governance framing is consistent. The human is there.
Two months later, you are sitting with the engineering team actually building the system. Defining what the interface shows. What the operator sees when the agent is making a decision. What happens when the system encounters a scenario outside its parameters. In those conversations, nobody says “human in the loop.” The phrase disappears because it does not mean anything specific enough to design to.
That gap matters. A term that shapes policy and satisfies oversight requirements evaporates the moment a team has to decide what information belongs on the screen. Not because the phrase is wrong, exactly, but because it is a starting point rather than a specification. It says a human should remain involved in some meaningful way. Everything after that is left undefined: what they perceive, what they understand, what they can influence, and who remains accountable when the system acts.
DoD Directive 3000.09, the governing policy on autonomous weapons, does not actually require a human in the loop. It requires “appropriate human judgment.” Functionally, I think those phrases occupy the same space. Both provide assurance that a human remains involved without defining what meaningful oversight looks like once systems begin operating faster than human cognition can realistically follow.
The actual design questions underneath that language are comprehension, monitoring, and intervention. Does the operator understand what the system is doing and why, in terms they can act on rather than simply observe? Can they track behavior as conditions change? Can they intervene before a bad decision creates operational or ethical consequences? None of those questions are answered through governance language alone.
Intervention is usually where defense teams begin. The kill switch. The ability to stop the system. Safety and control receive serious attention. Monitoring remains manageable at current system scales. Comprehension is where I consistently see the gap, and it is the problem that becomes harder as systems become larger, faster, and more autonomous.
Think about it in OODA terms. If the autonomous system is observing and orienting faster than the human can cognitively follow, then the human’s decision and action occur against incomplete understanding. The oversight gap becomes temporal before it becomes ethical. Policy cannot close that gap. Interface design, information architecture, and interaction models determine whether meaningful supervision remains possible at machine speed.
The Scaling Problem
Right now, most of the systems I work on are relatively small. Six to twelve agents. When I think about what operators can realistically command and control under actual mission pressure, within a single domain, the number is probably closer to six to eight. Most of these systems are mobile, which means the interaction model is still heavily map-driven: agents, routes, tasks, telemetry, and status displayed through top-down command and control interfaces.
DoD stakeholders consistently ask a very different question. Across programs and clients, I repeatedly hear some version of the same prompt: how can a single operator effectively command and control 2,000 autonomous agents?
I do not know where that number originated, but it appears constantly, usually with a level of confidence that does not match anything I have actually seen demonstrated in practice.
At 2,000 tracks on a map, the map view becomes meaningless. Density overwhelms comprehension. The display stops communicating useful information because no human can cognitively process that volume of simultaneous activity in a way that supports timely decision-making. The interaction model that works for eight agents does not scale to 2,000.
What replaces it, I do not think anyone actually knows yet.
That uncertainty reveals something more significant than a missing interface pattern. At the scale many DoD programs are imagining, a human is not going to remain directly involved with every individual agent. The level of autonomy required increases dramatically. The oversight model itself changes.
“Human in the loop” was written for a world where operators make decisions about discrete systems. It was not written for a world where operators supervise thousands of autonomous agents executing their own plans simultaneously. The governance framing persists, but the underlying cognitive model fundamentally changes.
The unresolved question is whether oversight, as currently imagined, remains cognitively achievable at that scale.
Automation vs Mission Autonomy
There is an important distinction that gets blurred in many autonomy conversations. Most of the systems I encounter today are closer to automation than autonomy. They are not generating novel plans or responding to evolving conditions with meaningful agency. They are executing defined procedures: sensor fusion, deterrent inventory checks, weapon pairing, route execution, rule-based responses. That is automation.
Mission autonomy is something different. Mission autonomy implies systems receiving goals, constraints, and intent while determining execution dynamically in response to changing operational conditions. That transition matters because the interface for a system executing instructions looks nothing like the interface for a system generating its own plans.
I think about the CCA program frequently in this context. Right now, most concepts still revolve around relatively constrained teaming and supervisory control. The future vision many programs discuss is significantly more autonomous than that. Operators define intent, constraints, and objectives while autonomous systems determine execution. That is not simply a more advanced version of today’s interaction model. It is a fundamentally different relationship between human and machine.
Today, the relationship is still relatively direct. The operator tasks the system. The system executes and reports back. The cognitive model remains close to supervision.
At scale, that relationship becomes something closer to orchestration.
The operator defines objectives, priorities, constraints, and rules of engagement. Autonomous systems determine execution. The human monitors behavior, adjusts boundaries, reallocates intent, and intervenes when behavior diverges from operational expectations. The operator is no longer directly executing actions through the system. They are supervising distributed autonomous behavior across compressed timelines and incomplete information.
I have never liked the term “puppet mastering” in this context. Puppets imply direct physical manipulation. That is not the relationship emerging inside autonomous systems, and it certainly is not achievable at scale. Conducting is closer. A conductor does not play every instrument. The conductor establishes coordination, timing, intent, and cohesion across systems operating simultaneously.
That cognitive shift, from direct control toward orchestration, is real. It requires a genuinely different interface model than the tasking-and-reporting systems most defense programs still rely on today.
The challenge is that these interfaces are entering environments that are already cognitively saturated. A fighter pilot is simultaneously managing sensor fusion, tactical coordination, navigation, communications, electronic warfare, threat interpretation, weapons systems, and mission execution under compressed timelines and extreme pressure. Autonomous wingman supervision does not arrive inside a clean workflow. It arrives inside one of the most cognitively dense operational environments that exists.
Trust Is Contextual
Trust calibration shifts across these workflows too, and not uniformly.
Last year I spent significant time working in air defense environments. What became clear very quickly is that trust is not a single dial operators turn up or down across the system. Trust changes depending on the phase of the workflow.
Trust is low during identification. That is the moment an operator decides whether a track represents a legitimate threat. The consequences of getting that wrong are enormous, and I have yet to see an interface that communicates system confidence in a way that consistently gives operators enough understanding to act without additional investigation.
Trust becomes considerably higher during deterrent selection and weapon pairing, where the logic is more constrained and procedural. Inventory checks, targeting sequences, availability verification, and rule-based pairing are places where operators are far more comfortable deferring to automation.
The oversight relationship changes across the kill chain itself. “Human in the loop” applies the same governance framing to all of it. A designed oversight model would need to account for those differences explicitly.
Complexity Transfers
When design is not involved early in defining these systems, the burden transfers somewhere else. Usually, it transfers to training.
Training cycles get longer. Operators are asked to adapt themselves to interfaces that were never designed around how they actually reason, prioritize, or make decisions under pressure. The assumption is often that autonomy reduces workload. In reality, autonomy frequently reduces execution workload while increasing supervisory workload. That is a fundamentally different cognitive burden, and a much harder one to compensate for through training alone.
The complexity does not disappear when design is absent. It transfers to the operator and to the organization responsible for preparing them.
Resilience should not become justification for avoidable complexity.
There is also an honest limitation inside the research itself that I think the industry needs to acknowledge more openly. Much of what defense teams observe happens during demonstrations rather than genuine research environments. Those are not the same thing. Demonstrations reveal whether systems function. They reveal much less about how operators actually reason under stress, uncertainty, fatigue, and mission pressure.
The conditions that expose comprehension failures, trust breakdowns, cognitive overload, or degraded decision-making are often the exact conditions most design teams never receive meaningful access to. That does not invalidate the work. It does mean the confidence placed in those observations should remain proportional to the environment in which they were gathered.
The explainability problem does not have a mature answer yet either. The most useful approaches I have seen are relatively simple: dashboards and interface layers that expose a handful of operationally meaningful data points about system behavior in real time. But explainability is highly contextual. What an air defender needs to understand about autonomous behavior is different from what a logistics operator or SOF operator needs. A universal explainability layer is probably the wrong objective.
What seems more realistic is a principled framework for building context-specific explainability models across domains. Right now, I do not think that framework exists yet.
Designing Oversight
The oversight model for autonomous systems is being written right now in policy documents, acquisition strategies, and program plans before the design work has validated whether those assumptions remain cognitively achievable in practice.
By the time design teams are brought in, if they are brought in at all, the architecture is often already fixed. The interface becomes an expression of the oversight model rather than an opportunity to meaningfully shape it.
The policy debate around autonomy will continue for years. The design problem needs answers much sooner than that.
The question is no longer whether a human remains involved. Everyone already agrees they should. The real question is whether oversight remains cognitively achievable once autonomous systems begin operating at machine speed and organizational scale.
DoD policy currently assumes that supervision naturally scales alongside autonomy. Interface design is where that assumption will either hold or fail.
Further Reading
Air Force Collaborative Combat Aircraft (CCA) Program – Congressional Research Service
Autonomy in Weapon Systems (PDF) – Department of Defense, Directive 3000.09
Bias, Explainability, and Trust for AI-Enabled Military Systems – SPIE
AI-Driven Human-Autonomy Teaming in Tactical Operations – ArXiv