12 Angry AI Agents:
Evaluating Multi-Agent LLM Decision-Making
Through Cinematic Jury Deliberation

12 robotic AI agents deliberating

12 Angry AI Agents instantiates a multi-agent benchmark for LLM deliberation, simulating a cinematic jury environment to evaluate persuasion, stubbornness, and decision-making dynamics.

Abstract

What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind?

This paper instantiates that scenario as a multi-agent benchmark for LLM deliberation: twelve agents, each conditioned on a film-faithful persona, debate the film's murder case using a multi-agent framework. We analyze how different initial voting states, prompt personas, and interaction constraints affect the final consensus and deliberation duration.

Through empirical analysis of agent communication transcripts, we measure persuasion, stubbornness, and alignment with human jury patterns, providing a novel methodology for evaluating complex social dynamics in LLM environments.

Key Takeaways

đź”’ The Anchoring Effect is Real

Out of 18 deliberation runs, an overwhelming 17 ended in a "hung jury" (no unanimous verdict). Unlike the movie, where one dissenter slowly persuades the rest, current LLMs stubbornly anchor to their initial positions and rarely change their minds.

⚖️ Alignment Makes AIs Rigid

The study compared GPT-4o (heavy safety alignment/RLHF) and Llama-4-Scout (lighter alignment). GPT-4o was incredibly inflexible, averaging just 1.0 vote change per run, completely ignoring explicit prompts to be "open-minded". Llama-4-Scout, however, was responsive, tripling its vote-change rate when asked to keep an open mind.

🏆 The Only Verdict

Across all tests, Llama-4-Scout was the only model capable of breaking the deadlock and reaching a unanimous "Not Guilty" verdict (in a scenario where initial votes were removed).

📉 Bigger Isn't Always Better

We often assume larger, highly capable models perform better at everything. But in multi-agent debates, flexibility, not raw capability is what actually mimics human persuasion. The same exact RLHF training that makes an AI a "safe" and consistent assistant also makes it a rigid, stubborn collaborator.

🗣️ Parallel Monologues, Not True Debate

While the AI agents perfectly mimicked their character personas and cited the right evidence (wearing the "costume" of a debate), they failed at the actual social mechanism of mind-changing. There were no emotional ruptures or coalition building—just parallel monologues and voices talking past each other.

BibTeX

@article{ersoz2026angry,
  title={12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation},
  author={Ersoz, Ahmet},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}