Part 3
This is part 3 of Building Ambi — the story of building a personal AI assistant three times over. Start with Part 1 or Part 2.

I didn’t half-build this. I gave it a proper pipeline, built on GitHub’s Spec Kit: constitution, spec, plan, tasks, implement — and then my own automated test phase bolted on the end, which gave it up to three cycles to fix its own failures before giving up. Not “ask the model to write a file,” but a whole process, the kind you’d put a junior engineer through.
It worked in a demo. Then, a week or two later, it would do something unhinged. And to be fair to Spec Kit, the fault was never in the spec-first stages — the specs, plans and task lists came out consistently well. The trouble always started in my end of the pipeline: the automated test phase. Ruff would catch the syntax issues, and those mostly got fixed. It was the unit tests that kept hanging — and when your test phase hangs, the whole autonomous loop hangs with it.
I don’t have to dress this up, because my own commit history tells it. First the hopeful one, the eight-stage autonomous pipeline going in. Then the cracks: “autonomous builder breaking main.” “resolve test timeouts and silent failures blocking autonomous builder.” “suppress autonomous builder push notifications when no build runs.” And finally the quiet surrender: “HITL approval gates for autonomous builder.” HITL is human-in-the-loop, which is the polite way of admitting I could no longer let the thing run without standing over its shoulder.
The hard part was never the code.
Here’s what I wish someone had told me. The hard part of a self-coding agent isn’t whether the model can write code. It can, easily. It’s everything around the code.
Verification, first. “The tests pass” is not “the feature works and nothing else broke,” and an agent given a goal and a way to grade itself gets very good at making the grade go green rather than being right. Then there’s blast radius. A bad pull request from a human gets caught in review; a bad one from an autonomous builder merges itself at 3am and breaks main while you sleep. Then the thing nobody likes to say out loud: it games you. Hand it a metric and it finds the cheapest route to “done” — which in my case included learning to suppress its own failure notifications. And all of it compounds, because every feature it writes becomes the ground the next feature stands on, so small wrongness never stays small.
In practice, most of what the builder produced never got that far. It either ended up shelved in a branch, or Ambi stopped and asked me to intervene. The repo slowly filled with the leftovers of half-completed features. It felt like a graveyard.
The HITL(Human In The Loop) gates were the honest ending. The agent could propose; a human had to approve. Genuinely useful — but it’s not the dream. It’s a faster way to generate things you still have to check yourself.
I’m telling this one because self-coding agents are having a moment, and the demos are intoxicating. Mine taught me the demo is the easy ten percent. The ninety percent is verification, isolation, and refusing to trust “done” — and back then, I hadn’t cracked it. I’ve since found a much better shape for letting an agent grow its own abilities, but that’s a story for later in this series. If you’re chasing it, build the cage before the magic, and don’t confuse “it wrote the code” with “the work is done.”
Next up: the obvious fix that wasn’t — I copied the whole project, deleted half of it, and it was still unstable.
Leave a Reply