My Agentic Coding Process, Summer 2025
Around June 2025 I had time on my hands, so I tackled what I viewed as the next set of important problems with Agentic coding. In my view, the problems were:
- To make good use of my time, by having the agent run only involve me when there's a problem.
- To elicit good (...good enough) performance from models, without continuous monitoring.
- To avoid "collapse."
Make Good Use of My Time
Once a model is a much better engineer than you, sure, you should let it churn out source code and trust it to judge whether that code is good. 2025, models got better, more marginal engineers embraced the tools, so it follows that people increasingly handed the reins to coding agents. I'm an AI positivist prepared to do that at some point. But for strong engineers, it doesn't make sense to delegate everything to agents yet.
If the human is going to be involved, the question is: how do you use their time effectively?
For me, circa northern hemisphere summer 2025, that means getting out of the inner loop without getting out of the loop entirely. To work on the parts of the problem that complement the model's capabilities, then be free to walk away. Not waiting for the model to churn.
Elicit Good (Enough) Performance from Models
Models are steadily improving. The context systems around them are fitfully improving. But we can't throw arbitrary problems at these systems and expect good results. So what are we going to do while we wait for models to improve?
Classifying errors, then understanding and attacking their root causes, sounds dangerously like real work. Work which may be eclipsed by model improvements anyway.
Instead, we should work on this key problem: How do you pick tasks that are "just right" for your chosen model's capacity for attention and reasoning? By the way, you can err on both sides. If you pick a task that's too hard, performance is bad, and it is easy to see that. But if you pick a task that's too easy, model performance is good, but you have subtly wasted human time dealing with a relatively trivial task.
Avoid "Collapse"
By "collapse", I mean these specific failure modes:
- The agent f*cks up the codebase. Progress stops.
- The agent doesn't explosively f*ck things up, but it goes in circles, (break it, put it back, break it differently... often subtly degrading quality as it does so) or it gets distracted by other problems.
- The agent exudes a can-do attitude and claims problems are solved, but they are not.
My Meta-Coding System
Here's how I tackled these problems. I'll start with the what... a set of prompts and a tiny bit of Python driving the Claude Code SDK in the inner loop. The prompts explain three roles:
- Operator. Defines goals. Played by the human.
- Engineer. Realizes goals. Played by the agent.
- Compliance. Checks this process is being adhered to. A separate invocation of the agent.
The operator authors and checks in a file, GOALS.md. This is a bulleted list of requirements, in a tree structure. Requirements are named. They cover content ranging from something suitable for a product requirements document (PRD) that your product manager might produce, and technical requirements you might find in a design doc.
Here's an example of ONE goal related to an interactive visualization of a path planning system for an autonomous ship:
G-STAGE There is a 3D viewport which can be rotated using the mouse or touch and zoomed with pinch or the scrollwheel. The z=0 plane is suggested by subtle grid lines.
The view is centered on our ship in the simulation. Specifically, the camera is behind and above the ship, and is relative to the ship. This means as the ship moves and changes direction on its path, the camera makes compensatory pans and swings.
If the user reorients the camera while the simulation is playing, the simulation does not stop and there is no jank when they stop manipulating the view.
The engineer examines the codebase and the goals and picks a goal to pursue. This happens in a structured way:
- At the start of a goal attempt, create a git branch named after the goal.
- Solve the goal adhering to an engineering philosophy explained in the engineer's prompt. This philosophy is, roughly, the Google way of software engineering (at least in the late 2000s and early 2010s when I worked there.) That should be another blog post, but I'll briefly characterize it as objective, technical, critical, preferring cleanliness and consistency, and heavily emphasizing tests. There are some additions, though. For example, an added emphasis on making results visible to engineer and operator. This encourages practical, performance enhancing tool use like looking at screenshots and operator ergonomic improvements like storybooks and demo modes.
- Using objective measures of success--typically a claim that a unit test adequately captures a requirement--compliance scrutinizes whether a goal has been achieved. It literally reads the commit message to discover what goal is being attempted, reads the relevant goal, reads the relevant test to decide if it covers the relevant goal, and can run tests and examine output. It makes a judgement based on these facts.
When a goal is achieved, the engineer continues by selecting a new goal.
When a goal is not achieved, the engineer makes a new attempt at the goal. However, critically, they do this by creating a new git branch resetting them to the departure point. (The failed work remains in git as an evolutionary dead-end. Often interesting to inspect when improving the meta-process.)
After a failed attempt, engineer may edit GOALS.md to sub-divide a goal into simpler parts. Given the context of an attempt that failed and the judgement of the compliance agent about why the effort failed, these goal decompositions turn out to be feasible.
If all goals are achieved, the project is done. Alternatively, after three successive failed attempts, the process halts and escalates to operator.
The Good Parts
I find this system effective and a better use of my time than the coding agents I left behind. There's no more constant hand-holding. Write some goals, leave it alone, and come back much later to be delighted at how much progress it has made.
It is easy to incorporate feedback by writing new goals, or changing goals.
Interestingly, there's no checklist of completed goals. This is a deliberate choice. It avoids the Polyanna problem where the agent does some clowny sh*t, adds a green tick emoji to the task list, and stops thinking about that goal. Claude is surprisingly attentive to when goals are out of alignment with the code and will deal with updated requirements where goals have been refined.
The best goals are orthogonal, or at least not openly contradictory, and the software architecture is stronger as a result. By revising goals continuously the goals document is a living document and not a historical task list. (If you want history, that's what git log is for.)
The process of resetting branches, and subdividing goals, is roughly adapted from The Mikado Method. This slays the problem of the agent always wanting to just fix one more broken thing... oh you're right, let me just change that... oh I need to just fix... and going in circles.
Similarly, three failed attempts in a row stop the agent burning tokens if it is really stuck. It prevents endlessly hopping between problems because any three consecutive failures of stop the agent. And it prevents the agent wasting time on your poorly specified goals. (If the sub-sub-goal is too hard, you need to ask for help.)
Next Steps
For focus, I deliberately left some opportunities for later. I've been thinking about them, though.
One is parallelism. My system is single threaded. There's nothing inherently single-threaded about this process, but exploiting parallelism well will benefit from effectively merging related subgoals, and possibly flagging conflicting subgoals, arising from the concurrent agents. Interesting topic but not one I needed to solve this summer. Maybe when it is colder outside.
Another is learning more effectively from mistakes. I generated this process using this process, but the approach was hacky. Doing it "live" confused the process-improving agent, which tried to incorporate some learnings from first-order projects which were too domain specific. Waiting until projects were done to reflect and make improvements to the process went better, but Claude's overzealousness to prevent any errors made it attempt to compromise the inherent flexibility of the feedback loop from failures.
(It is prosaic, but I believe simply collecting all of the tweaks over the course of several projects, and using them for classic fine tuning, could be useful to try. The hard part there is just collecting the data.)
This system burns a lot of tokens (...which was kinda the point. Unblock machine time from the human attention bottleneck.) There's plenty of easy efficiency work like caching the process documentation. Using cheaper models is appealing: High capacity models can do goal decompositions for cheaper models to attempt. (Unblock cheap inference from the expensive inference bottleneck.)
Ultimately, I found this a useful point in the solution space. One I believe avoids a lot of problems I hear other senior engineers are having. At the same time, there's a lot of juice left to be squeezed out of these models. Let's keep sharing what we find that works (and doesn't work) when using AI to invent new ways of building software. Peace and love to you all, my fellow hackers.
Software engineer. This blog does not represent my employer.