1Introduction
Coding agents have crossed the threshold where the bottleneck is no longer can the model write the change but can anyone trust it unattended. The common workflow — one chat session that plans, writes, self-reviews and declares victory — concentrates every role in a single context window of a single model. When that workflow fails, it fails silently: the author is also the judge, and the judge is predisposed to approve.
This paper examines a specific architectural decision: separating the generator (the agent that writes code) from the verifier (the agent that reviews, tests and debugs it), and running the two as distinct pipeline nodes — with distinct contexts, distinct objectives, and preferably distinct base models. The question is not whether review helps; nobody disputes that. The question is why the review must not come from the same agent that produced the code, and what the published record says about it.
2Verification is the cheaper half
The oldest result in this literature is also the most underused. Cobbe et al. trained small models to verify solutions sampled from a larger generator and reported that on GSM8K, “6B verification slightly outperforms a finetuned 175B model, thereby offering a boost approximately equivalent to a 30× model size increase”[3]. Reading candidate work and judging it required far less capability than producing it. Lightman et al. pushed the same lever further: a process-supervised reward model — a verifier that judges each step — solved 78% of a representative MATH subset via best-of-N selection, with process supervision significantly outperforming outcome-level supervision[4].
Song et al. formalized the underlying quantity as the generation–verification gap — how much better a model verifies than it generates — and found that it scales monotonically with pre-training compute across model families[15]. The practical reading for pipeline builders: verification capacity is abundant and cheap relative to generation capacity. An architecture that fails to spend a second, often smaller, model on checking is leaving the best-documented quality multiplier on the table.
3Why self-review under-delivers
If verification is cheap, why not let the generator verify itself? Because the evidence says intrinsic self-correction — a model revising its own output with no external signal — is unreliable. Huang et al. conclude that “LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction”[1]. Kamoi et al.’s critical survey reaches the same verdict at scope: successful self-correction with feedback from prompted LLMs is essentially undemonstrated outside tasks unusually suited to it, and several celebrated positive results relied on evaluation setups that leaked oracle information; self-correction works well precisely “in tasks that can use reliable external feedback”[11].
Code is the domain where this bites hardest. Olausson et al. studied self-repair on HumanEval and APPS and found that once you account for the cost of the repair loop, gains “are often modest, vary a lot between subsets of the data, and are sometimes not present at all” — in several settings, simply sampling fresh solutions i.i.d. matched or beat self-repair[2]. Self-Refine reported sizable average improvements (~20% absolute across seven tasks), but the average is dominated by open-ended preference tasks rather than correctness-critical ones[10] — the later, more adversarial evaluations[1],[11] are the ones that survived scrutiny.
4What changes with a second pair of weights
Three findings mark the boundary between self-review and independent review. First, in the same self-repair study, replacing the generator’s own feedback with feedback from a stronger, different model produced substantially larger gains — GPT-4 critiquing GPT-3.5’s code helped where GPT-3.5 critiquing itself did not, and human feedback on GPT-4’s code beat GPT-4’s self-feedback[2]. The repair step was never the problem; the source of the critique was.
Second, evaluator models demonstrably favor their own generations. Panickssery et al. show that LLM evaluators score their own outputs higher than equal-quality outputs of others (as judged by humans), find a linear correlation between a model’s ability to recognize its own text and the strength of this self-preference, and present controlled evidence that the link is causal rather than confounded[6]. Zheng et al. had earlier catalogued self-enhancement among LLM-judge biases — suggestive win-rate deltas, though their data could not settle it — while establishing the enabling result that a strong judge model agrees with human preferences at human–human levels (≈80%+)[5]. Judging is automatable; judging yourself is where the bias concentrates.
Third, dedicated critics work. OpenAI’s CriticGPT — a model trained specifically to find bugs in model-written code — caught more inserted and naturally occurring bugs than paid human reviewers, and its critiques were preferred over human critiques in 63% of cases on real LLM errors; human–critic teams were more comprehensive than humans alone while hallucinating less than the critic alone[8]. Debate-style setups, where multiple model instances propose and criticize answers across rounds, improve mathematical reasoning and factual validity over a single instance[7] — evidence that even fresh contexts of the same weights help, before any cross-model diversity is added.
5Six mechanisms
The findings above are not coincidences. Six mechanisms produce them:
- M1 — Asymmetry. Judging an artifact against tests, types, and a spec is an easier task than synthesizing it, so verifier quality per token is high[3],[4],[15] — the economic basis for spending a second agent.
- M2 — Correlated blind spots. A model re-reading its own output samples from the same distribution that produced the bug. Whatever prior made the mistake plausible the first time makes it plausible on re-read. Ensemble theory has said this for decades: combining helps when members are accurate and diverse — when they make differenterrors (Hansen & Salamon, via Dietterich)[12]. Same weights, same errors, no ensemble.
- M3 — Self-preference. Evaluators recognize and favor their own generations[6]. A reviewer that cannot favor “its own” code — because none of the code is its own — removes the bias at the source rather than prompting against it.
- M4 — Context contamination. The generator’s context holds its plan, its assumptions, its sunk reasoning. A separate reviewer starts from the artifact and the spec — the same reason debate across fresh instances helps even without changing models[7], and the reason human code review is done by someone who didn’t write the diff.
- M5 — Objective separation. “Make it work” and “find why it doesn’t” are different optimization targets. Critic models trained explicitly on the second objective outperform generalists and paid humans at it[8]; in a pipeline, the reviewer node’s instruction can be purely adversarial without degrading the generator’s constructive instruction.
- M6 — Process structure. Separation forces artifacts: a review has to be written down, a gate has to pass or fail. Staged, verification-heavy flows lifted GPT-4 from 19% to 44% pass@5 on CodeContests[13]; role-separated multi-agent frameworks report large drops in human revision cost against less-structured baselines[14]. Structure, not magic, accounts for much of the gain — and structure is exactly what a one-window chat lacks.
6Evidence summary
| Study | Setting | Finding |
|---|---|---|
| Cobbe 2021[3] | GSM8K, trained verifier reranks samples | 6B verifier ≈ 30× model-size boost (authors’ estimate, GSM8K) |
| Lightman 2023[4] | MATH, process-supervised reward model | 78%of a representative subset solved via verifier best-of-N; process > outcome supervision |
| Huang 2023[1] | Reasoning, intrinsic self-correction | No reliable gains without external feedback; sometimes degrades |
| Olausson 2023[2] | Code self-repair, HumanEval/APPS | Self-repair often ≤ i.i.d. resampling; stronger-model feedback yields substantially larger gains |
| Panickssery 2024[6] | LLM evaluators, controlled | Self-preference bias, causally linked to self-recognition |
| McAleese 2024[8] | Dedicated critic model on real code bugs | Critiques preferred over human critiques in 63% of cases; catches more bugs than paid reviewers |
| Ridnik 2024[13] | CodeContests, staged generate–test–fix flow | GPT-4 pass@5 19% → 44% from flow structure alone |
| Hong 2023[14] | Role-separated multi-agent SWE framework | Human revision cost 0.83 vs 2.5 rounds against a less-structured multi-agent baseline |
| Song 2024[15] | Self-improvement theory, cross-family | Generation–verification gap scales monotonically with pre-training compute |
7Design implications for pipelines
The mechanisms translate into five concrete rules, which Futsu implements as graph primitives rather than conventions:
- Separate the reviewer node. The agent that writes a change never approves it. In a Futsu canvas this is one edge: generator → reviewer, each with its own instruction and context (M3, M4, M5).
- Cross the provider line when you can. Different base models decorrelate errors (M2); a Claude coder with a Codex reviewer — or the reverse — is one node setting, not an integration project. Same-model review from a fresh context is the documented fallback[7], not the goal.
- Give the verifier external signal. Self-debugging becomes effective when execution results and unit tests are in the loop[9],[11] — so the reviewer node should see test output and run artifacts, not just the diff.
- Spend cheap tokens on judging. The asymmetry (M1) means the reviewer can run on a faster, cheaper model than the generator without giving up most of the value[3],[5] — in Futsu, an alias like @fast on the review node and hard cost caps on the loop.
- End at a human gate, with artifacts. Critics hallucinate too[8]; the pipeline’s last reviewer stays human, and every node’s output persists as plain files (state.json, events.ndjson) so the review trail can be grepped, diffed and replayed (M6).
8An open evaluation protocol
The honest status of the cross-model claim: directionally supported by published evidence[2],[6],[8],[12], not yet quantified on your codebase — or, in controlled form, on ours. Position papers that end there are marketing; here is the experiment instead. It runs on any repository in an afternoon, and because every Futsu run is a folder of plain files, the raw data outlives the conclusion:
- Pick N ≥ 20 bounded tasks (bug fixes, small features) with a runnable test suite.
- Pipeline A (self-review): generator writes the change, the same model in the same session reviews and revises, tests run, human gate.
- Pipeline B (separated): identical generator; the review node runs on a different base model with an adversarial instruction and access to test output; same human gate.
- Hold constant: task order, base generator, prompts, test suite, cost caps. Randomize task→pipeline assignment.
- Measure per task, straight from events.ndjson: first-pass test rate; defects found in review and confirmed by tests; defects that survived to the human gate; revision rounds; total tokens and dollars per merged change.
- Decision rule, pre-registered: B wins if it reduces gate-surviving defects without raising cost per merged change by more than the review node’s own spend.
We are running this protocol on our own development (Futsu builds Futsu — §7’s rules produced this paper’s codebase) and will publish the runs, not a summary of them. If you run it first, send us your artifacts: hello@futsu.cloud.
9Limitations and honest caveats
- Separation is not free: a reviewer node spends tokens and wall-clock on every change, and on small or formulaic tasks staged same-model flows already capture much of the value[13].
- Verifiers are fallible: critic models hallucinate bugs and nitpick[8] — which is why the protocol in §8 counts test-confirmed findings only, and why the last gate is human.
- Much of the multi-agent evidence (debate, role frameworks) uses instances of the same base model[7],[14]; the cross-provider increment over fresh-context same-model review is exactly what §8 isolates, and we treat its size as an open question.
- Self-correction is not uniformly hopeless: with reliable external feedback — execution traces, unit tests — it works[9],[11]. Separation complements tests; it does not replace them.
- This paper synthesizes others’ experiments and argues mechanisms; it reports no proprietary benchmark. We consider that a feature, but it bounds the strength of the claim until §8 data lands.
10Conclusion
The published record supports a simple architecture rule: the agent that writes the code should not be the only agent that judges it. Verification is the cheap half of the loop[3],[15]; self-review is the unreliable half[1],[2],[11]; self-preference is measurable[6]; dedicated critics beat both generalists and paid humans at finding bugs[8]; and structured, role-separated flows convert these effects into shipped-quality deltas[13],[14]. A pipeline that wires a generator into an independent reviewer — different context, different objective, ideally different weights — with execution feedback and a human gate is not a style preference. It is what the evidence, mechanism by mechanism, keeps pointing at.
Cite as: Futsu Research (2026). Write with one agent, verify with another: the case for generator–verifier separation in agentic coding pipelines. WP-001, v1.0. futsu.cloud/research/generator-verifier-separation
RReferences
- Huang, J. et al. (2023). “Large Language Models Cannot Self-Correct Reasoning Yet”. ICLR 2024 · arXiv:2310.01798. https://arxiv.org/abs/2310.01798
- Olausson, T. X. et al. (2023). “Is Self-Repair a Silver Bullet for Code Generation?”. ICLR 2024 · arXiv:2306.09896. https://arxiv.org/abs/2306.09896
- Cobbe, K. et al. (2021). “Training Verifiers to Solve Math Word Problems”. arXiv preprint · arXiv:2110.14168. https://arxiv.org/abs/2110.14168
- Lightman, H. et al. (2023). “Let’s Verify Step by Step”. ICLR 2024 · arXiv:2305.20050. https://arxiv.org/abs/2305.20050
- Zheng, L. et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”. NeurIPS 2023 Datasets & Benchmarks · arXiv:2306.05685. https://arxiv.org/abs/2306.05685
- Panickssery, A. et al. (2024). “LLM Evaluators Recognize and Favor Their Own Generations”. NeurIPS 2024 · arXiv:2404.13076. https://arxiv.org/abs/2404.13076
- Du, Y. et al. (2023). “Improving Factuality and Reasoning in Language Models through Multiagent Debate”. ICML 2024 · arXiv:2305.14325. https://arxiv.org/abs/2305.14325
- McAleese, N. et al. (OpenAI) (2024). “LLM Critics Help Catch LLM Bugs”. arXiv preprint · arXiv:2407.00215. https://arxiv.org/abs/2407.00215
- Chen, X. et al. (2023). “Teaching Large Language Models to Self-Debug”. ICLR 2024 · arXiv:2304.05128. https://arxiv.org/abs/2304.05128
- Madaan, A. et al. (2023). “Self-Refine: Iterative Refinement with Self-Feedback”. NeurIPS 2023 · arXiv:2303.17651. https://arxiv.org/abs/2303.17651
- Kamoi, R. et al. (2024). “When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs”. TACL vol. 12 · arXiv:2406.01297. https://arxiv.org/abs/2406.01297
- Dietterich, T. G. (2000). “Ensemble Methods in Machine Learning”. Multiple Classifier Systems (MCS 2000), LNCS 1857, Springer. https://link.springer.com/chapter/10.1007/3-540-45014-9_1
- Ridnik, T. et al. (2024). “Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering”. arXiv preprint · arXiv:2401.08500. https://arxiv.org/abs/2401.08500
- Hong, S. et al. (2023). “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”. ICLR 2024 (oral) · arXiv:2308.00352. https://arxiv.org/abs/2308.00352
- Song, Y. et al. (2024). “Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models”. ICLR 2025 · arXiv:2412.02674. https://arxiv.org/abs/2412.02674