Agent Evals
When we started building i3ml, the obvious move was one giant model in a loop. It worked — for toys. The moment we shipped something a real team would touch, the cracks showed: lost context, half-written tests, and pull requests no human wanted to review.
Splitting the work across a crew changed everything. The Architect plans, the Builder writes, QA tests, and the Reviewer signs off. Each agent runs on the model that's actually best for its job — not the cheapest, not the trendiest. Just the right one.
The hardest part wasn't the agents. It was the seams between them. We rewrote the message bus three times before it felt right. The breakthrough was treating every handoff like a code review: typed, diffable, and rejectable.
Today, the crew ships 400+ production builds a day. We still find edge cases. We still fix them. But the foundation finally feels like one a team can trust.
"Multi-agent isn't a feature. It's a structural choice that compounds."
When we started building i3ml, the obvious move was one giant model in a loop. It worked — for toys. The moment we shipped something a real team would touch, the cracks showed: lost context, half-written tests, and pull requests no human wanted to review.
Splitting the work across a crew changed everything. The Architect plans, the Builder writes, QA tests, and the Reviewer signs off. Each agent runs on the model that's actually best for its job — not the cheapest, not the trendiest. Just the right one.