How hundreds of AI agents migrate a codebase safely — and why it isn't about trusting the model
TL;DR: When a team runs a swarm of AI agents to migrate a large codebase, what keeps it safe isn't a smarter model or more human reviewers — it's an architecture where no single agent decides whether its own work is correct. A deterministic check (does it compile, do the tests pass, does the behavior match the spec) makes that call, the same way on the first file and the eight-hundredth. Below: why one agent fails, the three-role pattern that works, the four-step loop, and where a human still belongs.
Why does one agent fail on a big codebase?
One agent loses track of context, invents APIs, and produces code that won't compile across a large project. The fix is splitting the work into roles, not using a bigger model.
Google's AI and Infrastructure team reported roughly a 6x speedup — 6.4 to 8x on real production models — migrating machine-learning code from TensorFlow to JAX. The headline number traveled; the interesting part didn't.
They tried the obvious thing first: point one capable agent at the repository, and it failed outright. That failure is worth sitting with, because every part of the working pattern exists to patch one of its three causes: context that overflows as the job grows, files migrated in the wrong order, and an agent grading its own homework with the same blind spots that caused the mistake. A bigger model doesn't fix any of those — a different architecture does.
What does the working architecture look like?
Three roles with genuinely different jobs, not one generalist agent wearing three hats.
A planner that doesn't use the model to plan. Deciding which file to migrate first isn't a creative question — it's a dependency graph with a correct answer. Google's planner uses compiler-based static analysis to build that graph and work from the leaf files (the ones nothing depends on) upward. The rule underneath it: if a problem has an algorithmic answer, use the algorithm, and save the language model for the part that actually needs judgment.
Migration agents that only transform code. Each one gets a single scoped task — ideally one file or one tight module — its own checkout, and a brief describing exactly what to change. It does the edits and hands back a patch — and that's where its authority ends. Whether the patch is actually correct is not its call to make. The architecture is designed so it can't quietly take that decision on.
A verification layer the agents can't talk their way around. The migrated code gets built, run, and tested in an isolated environment before anything is accepted. "Done" becomes a mechanical fact — it compiled, the tests passed, the behavior matches the spec — instead of an agent's opinion that it worked.
Why does each agent get its own sandbox?
So one agent's mistake breaks an isolated copy instead of the shared codebase — which is also what makes running hundreds of them in parallel safe.
Run hundreds of agents against one repo and the risk is obvious: one agent edits a file another is mid-way through depending on, and you get a broken in-between state.
The fix is isolation by default. Each task runs in its own sandbox — often its own git worktree, a separate checkout scoped to that one task. If an agent breaks something, it breaks a copy, not the shared codebase and not another agent's work. That isolation is also what makes the whole thing safely parallel: there's no shared state for hundreds of agents to step on, so they can run at once.
The orchestration layer that hands out tasks and collects patches stays outside every sandbox. The trusted process with repo access and credentials is kept separate from the untrusted place where generated code runs. So even a bad patch — or a prompt-injection attempt inside a sandbox — has no path to anything outside it.
What does the migration loop actually look like?
Each task repeats one four-step cycle until it passes or hits a hard limit:
- Plan the change. The migration agent takes its one file and target spec and produces a candidate patch. This is the only step where the model's judgment is doing the real work.
- Build the environment. A separate step stands up the dependencies and config so the patch can actually be compiled and run, not judged in the abstract.
- Run the tests. The code runs against the real test suite (or a generated equivalent if none exists). Pass or fail — no room to argue.
- Refine on failure. The specific failures and logs go back to the agent that owns the patch as targeted input, not a vague "try again."
One published version of this loop sets a stop condition — "no files match the old pattern anymore" — plus a ceiling of 200 iterations, past which it halts on its own. That cap matters: a task that's structurally impossible should surface as a bounded failure a person can look at, not burn resources forever or, worse, quietly report success it didn't earn.
Does this really run with no human review?
No — and that's the part that gets oversold. No human reads every diff, but humans still review the changes flagged risky. "No reviewer per diff" is not "no humans."
To be exact: the claim isn't that humans are removed. It's that no human reads every individual diff — because asking one person to genuinely scrutinize hundreds of repetitive changes at the speed a swarm produces them doesn't get you safety, it gets you rubber-stamping.
What replaces that isn't less checking — it's a different kind that scales. Deterministic build-and-test verification applies the same rigor to diff 1 and diff 800, with no attention fatigue. The safety comes from removing the single point of judgment ("the agent says it works"), not from trusting the agents more.
And humans stay exactly where they add value the machine can't. In practice it's tiered: the bulk of small, well-isolated files flow through the automated gate untouched, while files flagged complex, deeply connected, or business-critical route to a person before merging. Studies of agent-driven migrations find the same limit — agents reliably handle the mechanical API changes but struggle to preserve behavior in the genuinely complex cases, which is precisely the subset worth a human's time.
How do I build a smaller version of this?
Use the same four roles, just fewer agents. The pattern scales down cleanly. Whether it's 800 parallel agents or a handful of sequential tasks:
- Start with the planner, and don't make it the model. If your migration has any dependency structure, build the graph with real static-analysis tooling for your language. Every later step inherits its mistakes, so this is the highest-leverage decision you'll make.
- Scope each task as narrowly as the graph allows. One file or one bounded module, with an explicit brief. Narrow scope is what makes the pass/fail judgment unambiguous.
- Isolate every task. A fresh git worktree per task at minimum. This is what makes a mistake contained instead of contagious.
- Build the verification before anything else. Compilation, tests, behavioral checks where you can write them. This is the piece most homemade attempts skip — and the reason they fall over at scale.
- Cap the retry loop with a number, not a vibe. Decide what "stuck" means up front and auto-route stuck tasks to a human.
It scales down even further than that. You don't need a swarm — or even a migration — for the lesson to pay off. In a single Claude Code session, the same rule applies: don't accept the model's own "done," give it a step that actually runs the tests or builds the code, and let that be the judge. The verifier is the part that makes any agent loop trustworthy, whether it's one agent or eight hundred.
So what's the actual lesson?
The headcount — 8 agents or 800 — is the least interesting part. What makes it work is that none of those agents is trusted to judge its own output, and the system is designed around that distrust rather than in spite of it. The checking didn't disappear; it moved into compiler checks, test suites, and dependency graphs that give the same answer every time. That's the whole move: not bolder trust in a bigger model, but a structural rule that nothing — person or agent — gets to sign off on its own output.
FAQ
Can AI agents migrate a codebase with no human involved at all? No. A swarm can handle the bulk of mechanical, low-risk changes through an automated build-and-test gate, but files flagged complex or business-critical still route to a human before merging. "No reviewer per diff" is not "no humans."
Why not just use one large-context model instead of many agents? A single agent on a large, interdependent codebase loses context as the work accumulates, can hallucinate APIs, and grades its own output with the same blind spots that caused any mistakes. Splitting the job into planning, transformation, and verification roles fixes that; a bigger model alone doesn't.
What stops the agents from breaking each other's work? Each task runs in its own sandbox — typically a separate git worktree. A mistake breaks an isolated copy, not the shared repo, which is also what makes running many agents in parallel safe.
What decides when a migration is "done"? A deterministic verification step: the code compiles, the test suite passes, and behavior matches the spec. The agent that wrote the patch doesn't get to decide it's correct.
Does this only work at Google's scale? No. The same four roles — planner, migration agent, environment setup, verification — work for a handful of sequential tasks against a single repo. Start with a static-analysis planner, scope tasks narrowly, isolate each one, and build the verification first.
Sources: Google Cloud blog — 6x faster migration from TensorFlow to JAX; arXiv 2603.27296 — A Multi-agent AI System for Deep Learning Model Migration. The "dark factory" label from the source article is one writer's coinage, not Google's term — left out here. Facts verified 2026-06-23; figures and the 200-iteration example are as reported.