James Wallace: we can now do evals on agent software engineering process. (link)
Come up with a software engineering task
Set up 3 different engineering processes
Run the task with agents (let's say 10 times each)
Compare output
For the first time in human history we can run real experiments on software engineering processes where we have the same project, implemented by the same team of agents, where only the process differs to see which software engineering techniques actually work 🤯
Related: how do we do evals of the Claude instructions that we build (the instruction scaffolding around our agents)? I know at least for Laravel Boost, they set up a bunch of test projects and did manual evals.
Writing and reading specs was always too hard for humans. I've written long technical specs. Nobody reads them. Nobody likes to read them. And they are out of date immediately.
LLMs do love to write and read though.
With today's models and coding agents, workflow > model
As in, a strong workflow beats a stronger model.
There is no way Anthropic isn't doubling down on Claude Code for non coders. It's by far the most effective agent right now, their model works really well with it, and it's just screaming to be adjusted for other use cases.
Which frameworks have an official MCP server?
Laravel. Laravel Boost just came out, and it is very good. (link)
You install it.
It picks up the versions of all the Laravel ecosystem tools and packages you use, from your Composer.json file (composer is the php package manager).
Then it provides an MCP server that gives your agent (any of them) access to documentation search (search instructions included), for the correct versions of all the packages you use.
That's just incredibly useful. Which other frameworks have MCP servers for their docs?
The pattern I'm at right now with Claude Code for larger features is: first write a spec together, then have it go through the spec and implement.
Specs are easy to read ("keep it concise"), easy to edit, and it's easy to see what they missed.
I ask it to put them in /claude/specs/
If there are database fields to be created, I do think that through myself and tell it ("create a model Post with fields title, published (boolean) and text")
Plus, you have the spec to reference if you want to build on this in the future.
It works incredibly well
Gradient text is a fun trick, just added some to my upcoming Model Context Experience site. Claude was, as usual, very helpful.
Claude Code system prompt
The Claude Code system prompt here is (as usual) fascinating.
There are a lot of instructions on things it got wrong out of the box, like these:
"Do what has been asked; nothing more, nothing less."
"NEVER create files unless they're absolutely necessary for achieving your goal."
"ALWAYS prefer editing an existing file to creating a new one."
but mostly it's surprising how few instructions there really are.
40% of the prompt are tool use instructions.
They're pretty standard, so let's skip those. There are some interesting bits like
"ALWAYS prefer editing an existing file to creating a new one".
20% are development workflows.
There's a lot of stuff about GIT workflows and testing.
"VERY IMPORTANT: run the lint and typecheck commands if they were provided"
Another 25% are behavioral instructions. Clearly developed while using it. Things like
"You are allowed to be proactive, but only when the user asks you to do something"
Then there are some I found particularly interesting:
"You MUST answer concisely with fewer than 4 lines of text"
"One word answers are best" <- HA!
"Avoid introductions, conclusions, and explanations"
"DO NOT ADD ***ANY*** COMMENTS unless asked"
"You should NOT answer with unnecessary preamble or postamble"
"Do not add additional code explanation summary unless requested"
And finally, a ton of examples.
What I'm thinking:
Reading system prompts is a great way to develop intuition on how to better use these tools.
It's also a great way to get better at writing prompts. (For example, add examples.)
But mostly, I'd LOVE to learn how the evals for this are set up.
Select a tech stack for your team, when LLMs are part of your team
I'm hearing from quite a few people that tech stack selection (which technology do you build your new thing on) is being influenced by AI. If a specific technology (say, Python, or React) is popular and widely used, it means that LLMs have been widely trained on them, and are therefore likely to be good at them. And teams are starting to take that into account as a tech stack selection criteria.
It kind of makes sense. Selecting a tech stack is typically done based on both your needs and the existing skillset of your team (who have to use the tech and maintain it). If you think of LLM agents as a growing set of members of your team, it makes sense to take their skills into account.
I wonder when that starts bleeding over into non-engineering use cases?
Gemini 2.5 Pro is really good at transcribing large and complex PDFs (it transcribes tables beautifully etc), but it only does the first 10 or so pages. Then it says "Due to the extensive length and dense data in the remaining 400+ pages, a complete verbatim transcription is not feasible here." Is there a better way?
Now that we're all figuring out how to use Claude Code, Conductor built a UI that makes it easy to run multiple Claude Code's in parallel. I can totally see multiple UI's being built on Claude Code - it's basically the best agent experience out there right now, but pretty geeky. (link)
I do believe Simon coined "vibe scraping" (and he's coined quite a few AI-y words) - vibe coding something that scrapes a website for data. (link)
More than the year of agents, this feels like the year of evals.
SQLite feels like the perfect working-memory container for agents. Small, self-contained in one file, powerful, well understood by LLMs. LLMs can save stuff in there in between sessions, it's structured, it's powerful, you can take it with you.
Reinforcement Learning from Human Feedback, a free online book by Nathan Lambert is a treasure trove of information. As an example, how are chatbots trained on personaility?
Simon: "My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect." Right now, in coding, that is very true. Then again, engineers are used to investing in their own productivity. They call it DX, developer experience. And they will spend as much time as available customizing their work setup, so customizing Claude Code, as an example, is a natural fit. (link)
I wonder what other professional groups are similar in that way - used to investing lots of time in optimizing their own productivity.
Living in Spain, I get a lot of official emails in Spanish, and I'm really enjoying the Gmail Gemini summaries at the top. My Spanish is good but those email threads are often a lot. Same for long emails from the kids' schools. The summaries help me be confident that I didn't miss anything. (link)
Simon Willison has been saying (and showing) that this might be a great time to start blogging again, so after 10 years or so, I've revived my blog. I expect I'll write mostly about AI and climate, but we'll see. I had (of course) to build my own blogging software, which took a day or so with a little help from Claude. It's really custom to what I like. (link)
As an example of how I'm using Claude, it's roughly in the "talk to a junior engineer" style that works really well right now. It's fairly specific, but I don't have to write the boring bits myself. As an example, I just wrote this:
Create a migration that inserts the first category in the db called "Other", and adjust the livewire component for creating new posts to select the first (id=1) category as default.