Evolving my skills strategy for Claude and Codex

Over the last couple of weeks I have been building out a set of personal skills for both Claude and Codex in a dotfile style repo. I do not really think of these as clever prompts. I think of them as a way to make parts of my engineering process explicit: how I want planning to work, when I want review to happen, what should be delegated, what should be kept local, and where I want the agent to show a bit more discipline than “just have a go”.

This post is a short walkthrough of how that setup has evolved and what I have learnt from it so far.

Why I started doing this
The first pass: one plugin and a lot of structure
Splitting heavyweight and lightweight work
Making the approach work across Claude and Codex
Different models for different sub agents
What is working well
What still needs work
Summary

Why I started doing this

The basic problem was pretty simple. Blank-slate prompting gets old quickly.

If I sat down with Claude or Codex and asked for help on a non-trivial task, I found myself repeatedly restating the same preferences:

Investigate first
Do not ask lazy questions
Be explicit about tradeoffs
Separate planning from implementation when the task is risky
Review changes properly before pushing

That is not especially interesting work for me to repeat, and it is not a very good interface either. If I already know the shape of the workflow I want, I would rather encode it once and reuse it.

The other driver was consistency. If I ask an agent to review infrastructure changes one day and to plan a multi-step feature the next, I do not want both requests to be interpreted as the same kind of job. Skills give me a way to make that intent clearer up front.

The first pass: one plugin and a lot of structure

The initial version was very opinionated very quickly. It introduced a personal Claude plugin and a cluster of larger workflow skills such as:

/lets-work for deep planning
/implement-plan for orchestrated execution
/review for a proper review and push gate
/monitor-deploy for a fix and redeploy loop
/nix-project-init for bootstrapping new projects
/update-step-progress for explicit progress tracking

Looking back, the interesting part is not that I made a lot of skills at once. It is that I was trying to encode a full delivery loop rather than one-off prompts. Planning was not just “give me a list”. It had self-validation, user questions in a batch, and an explicit wave structure so work could be parallelised later. Execution was not just “write the code”. It had progress files, wave boundaries and verification requirements.

That structure has been useful, but it also taught me the first real lesson: if you only build heavyweight workflows, everything starts to look like a heavyweight workflow.

Splitting heavyweight and lightweight work

A few days later I had added /quick-work and /security-review.

That was a useful correction.

/lets-work, now renamed /technical-plan, is good when the task is genuinely large, ambiguous or risky. It is overkill when the work is local, well-scoped and you mostly just need disciplined investigation followed by implementation. That is exactly the gap /quick-work fills. It keeps the same quality bar around investigation and self-checking, but drops the ceremony of writing a plan file, managing waves and tracking progress on disk.

/security-review was another good lesson. Security concerns were previously mixed into broader review behaviour, which sounds tidy until you actually want an explicit gate for IAM, exposure, encryption and audit risks. Pulling that into its own skill made the intent much sharper.

This was probably the point where the overall strategy started to feel more solid. Instead of one giant “be a great engineering assistant” instruction set, I had a smaller set of modes with clearer entry points:

Deep planning
Light but rigorous task execution
Implementation from an existing plan
Review
Security review
Deploy and monitor flows

That separation has made the system much easier to trust.

Making the approach work across Claude and Codex

The next step was less about adding more workflows and more about making the workflows portable.

After working with these for a few more tasks, a few changes landed in quick succession:

The planning skill was refactored to pull shared rules into reference files
A routing reference was added for subagent roles and model tiers
Separate model mapping references were added for Claude and Codex

This is the part of the evolution I find most useful.

The current plugin structure reflects that shift as well. I now have a shared plugin manifest which renders both Claude and Codex plugin files from one source. That is the right direction. If the workflow is the product, I do not want two drifting copies of it.

Being able to reuse the same workflows across models, even when they do not produce exactly the same results, is useful in itself. Some models are plainly better than others at particular kinds of work, and model limits have a habit of running out at different times. Having skills that can move across runtimes gives me a bit more resilience there.

Different models for different sub agents

Earlier versions were still heavily shaped by one runtime. Within a few weeks the newer setup had become much more explicit about the abstraction boundary. The skill decides things like:

What kind of subtask this is
What evidence depth it needs
What role should handle it
What abstract model tier makes sense

Only after that does it translate the decision into runtime-specific choices for Claude or Codex.

The intention here is pretty practical. I want the skill to pick the right model for the task often enough that I get a better balance of speed, cost and output quality. In theory that means preferring a cheaper lighter model when the work is narrow or the evidence is simple, then upgrading when the first pass comes back weak or uncertain. I am not convinced that balance is really solved yet, but it is at least explicit now rather than accidental.

My experience so far has shown some promise. There have been a few good sessions where Haiku has been used for bounded fact discovery in sub-agents and done exactly what I wanted. That is nice to see because it suggests the routing can sometimes keep the expensive reasoning for where it is actually needed instead of spending it everywhere. The harder bit is deciding when the first output is not good enough and should be escalated. I have tried to encourage that behaviour in the skill, but I do not think it is fully reliable yet.

I also added a guardrails layer on top of the routing. The guardrails file is per project, so depending on the repo or environment I can block particular models entirely and define fallbacks. That gives me another lever when a project has cost constraints, environment-specific limits or just a model that I do not want used there for some reason. I like this because it keeps the routing policy mostly stable while still allowing local constraints to win.

I like this overall direction because it keeps the reasoning about the work separate from the quirks of the model vendor. The routing policy is the thing I actually care about. Model mapping should be an implementation detail.

What is working well

A few things are working particularly well at the moment.

Repeatability

The biggest win is that I no longer need to restate the same expectations in every session. The agent starts much closer to how I actually want to work.

Clearer task framing

Choosing between /technical-plan, /quick-work, /review or /security-review forces me to be clearer about the job itself. That sounds small, but it has a real effect on output quality.

Better decomposition

The subagent routing work has improved how larger tasks get broken down. Even when I disagree with the output, the structure makes it much easier to see why the agent made a choice and where the decision should be adjusted.

Runtime portability

Having shared workflow definitions and separate runtime mappings feels much healthier than rewriting the same intent for Claude and Codex independently. It reduces prompt drift, gives me one place to refine the actual method, and makes it easier to move between models when one is a better fit for the task or simply unavailable.

What still needs work

There are still a few obvious problems.

It can still get too ceremonial

I like structure, but there is a point where structure becomes friction. I have improved this with /quick-work, but there are still cases where the system wants to act like a mini operating model when a sharp local change would do.

The skills themselves are becoming a system to maintain

This is now real software, even if it is written in markdown and shell scripts. There are dependencies between skills, shared references, hooks, manifests and runtime mappings. That is powerful, but it also means the maintenance burden is real. If I am not careful, I end up needing tooling to manage the tooling.

Claude and Codex are not actually identical

Shared manifests and mapping files help a lot, but runtime parity is not free. The two environments have different strengths, different tool surfaces and different rough edges. A skill can abstract some of that, but not all of it.

I still need better feedback loops

At the moment a lot of my judgement is qualitative. I can say a workflow feels better, or that it reduced prompt repetition, or that a review was more thorough. What I do not yet have is a very good lightweight way of measuring which skills genuinely improve outcomes and which ones mostly make me feel organised.

Summary

The main thing I have learnt is that skills are most useful when they encode judgement, constraints and workflow boundaries, not when they just wrap a fancy prompt around a generic request.

More recently I have also started adding narrower skills like project-knowledge, and that has reinforced another lesson: the most valuable skills are usually the ones that encode specific judgement in a repeatable setting, not the ones that try to be universally smart.

The best parts of this setup so far are the parts that make expectations explicit: investigate first, separate heavy and light work, make review a real gate, route subtasks deliberately, and keep the workflow portable across runtimes where possible.

The part to watch is complexity. Once you start building a proper system around skills, it is very easy to create a second job for yourself maintaining the meta-layer. So that is the balance I am trying to keep now: more reusable judgement, less unnecessary ceremony.

Get in contact

If you have comments, questions or better ways to do anything that I have discussed in this post then please get in contact via LinkedIn or email.