EngineeringMay 14, 20266 min read

Agent Evals

فريق i3ml

i3ml.ai

When we started building i3ml, the obvious move was one giant model in a loop. It worked — for toys. The moment we shipped something a real team would touch, the cracks showed: lost context, half-written tests, and pull requests no human wanted to review.

Splitting the work across a crew changed everything. The Architect plans, the Builder writes, QA tests, and the Reviewer signs off. Each agent runs on the model that's actually best for its job — not the cheapest, not the trendiest. Just the right one.

The hardest part wasn't the agents. It was the seams between them. We rewrote the message bus three times before it felt right. The breakthrough was treating every handoff like a code review: typed, diffable, and rejectable.

Today, the crew ships 400+ production builds a day. We still find edge cases. We still fix them. But the foundation finally feels like one a team can trust.

"Multi-agent isn't a feature. It's a structural choice that compounds."

تابع القراءة

Agent memory, finally in beta

Why we raised our seed in stealth

The vibe coding loop, explained