WP-001 · Write with one agent, verify with another

Abstract

Agentic coding pipelines must decide who checks the work. The default — asking the model that wrote the code to also review and debug it — reuses the same weights, the same context, and therefore the same blind spots. We synthesize published evidence across three lines of research: (i) intrinsic self-correction without external feedback is unreliable and can degrade performance^[1],[11]; (ii) verification is systematically easier to obtain than generation, making a dedicated verifier one of the cheapest known quality multipliers^[3],[4],[15]; and (iii) evaluator models measurably favor their own generations, an effect linked to self-recognition^[6]. We argue from six mechanisms that the reviewer-debugger should be a separate agent — ideally on a different base model — and map the argument onto concrete pipeline design: a generator node, an independent reviewer node, an execution-feedback loop, and a human gate. Because we sell software built on this claim, we do not ask for trust: §8 specifies a falsifiable A/B protocol that any team can run on its own repository, with every artifact persisted as plain files.

1Introduction

Coding agents have crossed the threshold where the bottleneck is no longer can the model write the change but can anyone trust it unattended. The common workflow — one chat session that plans, writes, self-reviews and declares victory — concentrates every role in a single context window of a single model. When that workflow fails, it fails silently: the author is also the judge, and the judge is predisposed to approve.

This paper examines a specific architectural decision: separating the generator (the agent that writes code) from the verifier (the agent that reviews, tests and debugs it), and running the two as distinct pipeline nodes — with distinct contexts, distinct objectives, and preferably distinct base models. The question is not whether review helps; nobody disputes that. The question is why the review must not come from the same agent that produced the code, and what the published record says about it.

2Verification is the cheaper half

The oldest result in this literature is also the most underused. Cobbe et al. trained small models to verify solutions sampled from a larger generator and reported that on GSM8K, “6B verification slightly outperforms a finetuned 175B model, thereby offering a boost approximately equivalent to a 30× model size increase”^[3]. Reading candidate work and judging it required far less capability than producing it. Lightman et al. pushed the same lever further: a process-supervised reward model — a verifier that judges each step — solved 78% of a representative MATH subset via best-of-N selection, with process supervision significantly outperforming outcome-level supervision^[4].

Song et al. formalized the underlying quantity as the generation–verification gap — how much better a model verifies than it generates — and found that it scales monotonically with pre-training compute across model families^[15]. The practical reading for pipeline builders: verification capacity is abundant and cheap relative to generation capacity. An architecture that fails to spend a second, often smaller, model on checking is leaving the best-documented quality multiplier on the table.

Verification is the cheapest known way to convert extra inference into extra reliability — and it does not need to come from the biggest model in the pipeline.

3Why self-review under-delivers

If verification is cheap, why not let the generator verify itself? Because the evidence says intrinsic self-correction — a model revising its own output with no external signal — is unreliable. Huang et al. conclude that “LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction”^[1]. Kamoi et al.’s critical survey reaches the same verdict at scope: successful self-correction with feedback from prompted LLMs is essentially undemonstrated outside tasks unusually suited to it, and several celebrated positive results relied on evaluation setups that leaked oracle information; self-correction works well precisely “in tasks that can use reliable external feedback”^[11].

Code is the domain where this bites hardest. Olausson et al. studied self-repair on HumanEval and APPS and found that once you account for the cost of the repair loop, gains “are often modest, vary a lot between subsets of the data, and are sometimes not present at all” — in several settings, simply sampling fresh solutions i.i.d. matched or beat self-repair^[2]. Self-Refine reported sizable average improvements (~20% absolute across seven tasks), but the average is dominated by open-ended preference tasks rather than correctness-critical ones^[10] — the later, more adversarial evaluations^[1],[11] are the ones that survived scrutiny.

4What changes with a second pair of weights

Three findings mark the boundary between self-review and independent review. First, in the same self-repair study, replacing the generator’s own feedback with feedback from a stronger, different model produced substantially larger gains — GPT-4 critiquing GPT-3.5’s code helped where GPT-3.5 critiquing itself did not, and human feedback on GPT-4’s code beat GPT-4’s self-feedback^[2]. The repair step was never the problem; the source of the critique was.

Second, evaluator models demonstrably favor their own generations. Panickssery et al. show that LLM evaluators score their own outputs higher than equal-quality outputs of others (as judged by humans), find a linear correlation between a model’s ability to recognize its own text and the strength of this self-preference, and present controlled evidence that the link is causal rather than confounded^[6]. Zheng et al. had earlier catalogued self-enhancement among LLM-judge biases — suggestive win-rate deltas, though their data could not settle it — while establishing the enabling result that a strong judge model agrees with human preferences at human–human levels (≈80%+)^[5]. Judging is automatable; judging yourself is where the bias concentrates.

Third, dedicated critics work. OpenAI’s CriticGPT — a model trained specifically to find bugs in model-written code — caught more inserted and naturally occurring bugs than paid human reviewers, and its critiques were preferred over human critiques in 63% of cases on real LLM errors; human–critic teams were more comprehensive than humans alone while hallucinating less than the critic alone^[8]. Debate-style setups, where multiple model instances propose and criticize answers across rounds, improve mathematical reasoning and factual validity over a single instance^[7] — evidence that even fresh contexts of the same weights help, before any cross-model diversity is added.

5Six mechanisms

The findings above are not coincidences. Six mechanisms produce them:

M1 — Asymmetry. Judging an artifact against tests, types, and a spec is an easier task than synthesizing it, so verifier quality per token is high^[3],[4],[15] — the economic basis for spending a second agent.
M2 — Correlated blind spots. A model re-reading its own output samples from the same distribution that produced the bug. Whatever prior made the mistake plausible the first time makes it plausible on re-read. Ensemble theory has said this for decades: combining helps when members are accurate and diverse — when they make differenterrors (Hansen & Salamon, via Dietterich)^[12]. Same weights, same errors, no ensemble.
M3 — Self-preference. Evaluators recognize and favor their own generations^[6]. A reviewer that cannot favor “its own” code — because none of the code is its own — removes the bias at the source rather than prompting against it.
M4 — Context contamination. The generator’s context holds its plan, its assumptions, its sunk reasoning. A separate reviewer starts from the artifact and the spec — the same reason debate across fresh instances helps even without changing models^[7], and the reason human code review is done by someone who didn’t write the diff.
M5 — Objective separation. “Make it work” and “find why it doesn’t” are different optimization targets. Critic models trained explicitly on the second objective outperform generalists and paid humans at it^[8]; in a pipeline, the reviewer node’s instruction can be purely adversarial without degrading the generator’s constructive instruction.
M6 — Process structure. Separation forces artifacts: a review has to be written down, a gate has to pass or fail. Staged, verification-heavy flows lifted GPT-4 from 19% to 44% pass@5 on CodeContests^[13]; role-separated multi-agent frameworks report large drops in human revision cost against less-structured baselines^[14]. Structure, not magic, accounts for much of the gain — and structure is exactly what a one-window chat lacks.

6Evidence summary

Study	Setting	Finding
Cobbe 2021^[3]	GSM8K, trained verifier reranks samples	6B verifier ≈ 30× model-size boost (authors’ estimate, GSM8K)
Lightman 2023^[4]	MATH, process-supervised reward model	78%of a representative subset solved via verifier best-of-N; process > outcome supervision
Huang 2023^[1]	Reasoning, intrinsic self-correction	No reliable gains without external feedback; sometimes degrades
Olausson 2023^[2]	Code self-repair, HumanEval/APPS	Self-repair often ≤ i.i.d. resampling; stronger-model feedback yields substantially larger gains
Panickssery 2024^[6]	LLM evaluators, controlled	Self-preference bias, causally linked to self-recognition
McAleese 2024^[8]	Dedicated critic model on real code bugs	Critiques preferred over human critiques in 63% of cases; catches more bugs than paid reviewers
Ridnik 2024^[13]	CodeContests, staged generate–test–fix flow	GPT-4 pass@5 19% → 44% from flow structure alone
Hong 2023^[14]	Role-separated multi-agent SWE framework	Human revision cost 0.83 vs 2.5 rounds against a less-structured multi-agent baseline
Song 2024^[15]	Self-improvement theory, cross-family	Generation–verification gap scales monotonically with pre-training compute

7Design implications for pipelines

The mechanisms translate into five concrete rules, which Futsu implements as graph primitives rather than conventions:

Separate the reviewer node. The agent that writes a change never approves it. In a Futsu canvas this is one edge: generator → reviewer, each with its own instruction and context (M3, M4, M5).
Cross the provider line when you can. Different base models decorrelate errors (M2); a Claude coder with a Codex reviewer — or the reverse — is one node setting, not an integration project. Same-model review from a fresh context is the documented fallback^[7], not the goal.
Give the verifier external signal. Self-debugging becomes effective when execution results and unit tests are in the loop^[9],[11] — so the reviewer node should see test output and run artifacts, not just the diff.
Spend cheap tokens on judging. The asymmetry (M1) means the reviewer can run on a faster, cheaper model than the generator without giving up most of the value^[3],[5] — in Futsu, an alias like @fast on the review node and hard cost caps on the loop.
End at a human gate, with artifacts. Critics hallucinate too^[8]; the pipeline’s last reviewer stays human, and every node’s output persists as plain files (state.json, events.ndjson) so the review trail can be grepped, diffed and replayed (M6).

8An open evaluation protocol

The honest status of the cross-model claim: directionally supported by published evidence^{[2],[6],[8],[12]}, not yet quantified on your codebase — or, in controlled form, on ours. Position papers that end there are marketing; here is the experiment instead. It runs on any repository in an afternoon, and because every Futsu run is a folder of plain files, the raw data outlives the conclusion:

Pick N ≥ 20 bounded tasks (bug fixes, small features) with a runnable test suite.
Pipeline A (self-review): generator writes the change, the same model in the same session reviews and revises, tests run, human gate.
Pipeline B (separated): identical generator; the review node runs on a different base model with an adversarial instruction and access to test output; same human gate.
Hold constant: task order, base generator, prompts, test suite, cost caps. Randomize task→pipeline assignment.
Measure per task, straight from events.ndjson: first-pass test rate; defects found in review and confirmed by tests; defects that survived to the human gate; revision rounds; total tokens and dollars per merged change.
Decision rule, pre-registered: B wins if it reduces gate-surviving defects without raising cost per merged change by more than the review node’s own spend.

We are running this protocol on our own development (Futsu builds Futsu — §7’s rules produced this paper’s codebase) and will publish the runs, not a summary of them. If you run it first, send us your artifacts: hello@futsu.cloud.

9Limitations and honest caveats

Separation is not free: a reviewer node spends tokens and wall-clock on every change, and on small or formulaic tasks staged same-model flows already capture much of the value^[13].
Verifiers are fallible: critic models hallucinate bugs and nitpick^[8] — which is why the protocol in §8 counts test-confirmed findings only, and why the last gate is human.
Much of the multi-agent evidence (debate, role frameworks) uses instances of the same base model^[7],[14]; the cross-provider increment over fresh-context same-model review is exactly what §8 isolates, and we treat its size as an open question.
Self-correction is not uniformly hopeless: with reliable external feedback — execution traces, unit tests — it works^[9],[11]. Separation complements tests; it does not replace them.
This paper synthesizes others’ experiments and argues mechanisms; it reports no proprietary benchmark. We consider that a feature, but it bounds the strength of the claim until §8 data lands.

10Conclusion

The published record supports a simple architecture rule: the agent that writes the code should not be the only agent that judges it. Verification is the cheap half of the loop^[3],[15]; self-review is the unreliable half^[1],[2],[11]; self-preference is measurable^[6]; dedicated critics beat both generalists and paid humans at finding bugs^[8]; and structured, role-separated flows convert these effects into shipped-quality deltas^[13],[14]. A pipeline that wires a generator into an independent reviewer — different context, different objective, ideally different weights — with execution feedback and a human gate is not a style preference. It is what the evidence, mechanism by mechanism, keeps pointing at.

Cite as: Futsu Research (2026). Write with one agent, verify with another: the case for generator–verifier separation in agentic coding pipelines. WP-001, v1.0. futsu.cloud/research/generator-verifier-separation

RReferences

Huang, J. et al. (2023). “Large Language Models Cannot Self-Correct Reasoning Yet”. ICLR 2024 · arXiv:2310.01798. https://arxiv.org/abs/2310.01798
Olausson, T. X. et al. (2023). “Is Self-Repair a Silver Bullet for Code Generation?”. ICLR 2024 · arXiv:2306.09896. https://arxiv.org/abs/2306.09896
Cobbe, K. et al. (2021). “Training Verifiers to Solve Math Word Problems”. arXiv preprint · arXiv:2110.14168. https://arxiv.org/abs/2110.14168
Lightman, H. et al. (2023). “Let’s Verify Step by Step”. ICLR 2024 · arXiv:2305.20050. https://arxiv.org/abs/2305.20050
Zheng, L. et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”. NeurIPS 2023 Datasets & Benchmarks · arXiv:2306.05685. https://arxiv.org/abs/2306.05685
Panickssery, A. et al. (2024). “LLM Evaluators Recognize and Favor Their Own Generations”. NeurIPS 2024 · arXiv:2404.13076. https://arxiv.org/abs/2404.13076
Du, Y. et al. (2023). “Improving Factuality and Reasoning in Language Models through Multiagent Debate”. ICML 2024 · arXiv:2305.14325. https://arxiv.org/abs/2305.14325
McAleese, N. et al. (OpenAI) (2024). “LLM Critics Help Catch LLM Bugs”. arXiv preprint · arXiv:2407.00215. https://arxiv.org/abs/2407.00215
Chen, X. et al. (2023). “Teaching Large Language Models to Self-Debug”. ICLR 2024 · arXiv:2304.05128. https://arxiv.org/abs/2304.05128
Madaan, A. et al. (2023). “Self-Refine: Iterative Refinement with Self-Feedback”. NeurIPS 2023 · arXiv:2303.17651. https://arxiv.org/abs/2303.17651
Kamoi, R. et al. (2024). “When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs”. TACL vol. 12 · arXiv:2406.01297. https://arxiv.org/abs/2406.01297
Dietterich, T. G. (2000). “Ensemble Methods in Machine Learning”. Multiple Classifier Systems (MCS 2000), LNCS 1857, Springer. https://link.springer.com/chapter/10.1007/3-540-45014-9_1
Ridnik, T. et al. (2024). “Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering”. arXiv preprint · arXiv:2401.08500. https://arxiv.org/abs/2401.08500
Hong, S. et al. (2023). “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”. ICLR 2024 (oral) · arXiv:2308.00352. https://arxiv.org/abs/2308.00352
Song, Y. et al. (2024). “Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models”. ICLR 2025 · arXiv:2412.02674. https://arxiv.org/abs/2412.02674

Products