James Wallace: we can now do evals on agent software engineering process. (link)
Come up with a software engineering task
Set up 3 different engineering processes
Run the task with agents (let's say 10 times each)
Compare output
For the first time in human history we can run real experiments on software engineering processes where we have the same project, implemented by the same team of agents, where only the process differs to see which software engineering techniques actually work 🤯
Related: how do we do evals of the Claude instructions that we build (the instruction scaffolding around our agents)? I know at least for Laravel Boost, they set up a bunch of test projects and did manual evals.
# Aug 30, 2025