29/05/2026
Microsoft has open-sourced SkillOpt: a framework that trains agent skills the way you'd train a neural network, without ever touching model weights.
The idea is to treat a plain markdown skill file as the trainable parameter of a frozen LLM agent, then optimize it with the same discipline you'd use on weights: learning rate, validation gate, batch size, and epoch schedule.
Here's how it works:
▪️The skill document is the parameter
▪️Trajectory-derived edits are the gradient
▪️The edit budget is the learning rate
▪️A held-out split is the validation check
How the loop runs:
▪️ A frozen model runs tasks with the current skill and logs scored trajectories
▪️A separate optimizer model analyzes failures in minibatches, proposes structured add/delete/replace edits, and ranks them under a budget cap
▪️An edit is accepted only if it improves the held-out split; rejected edits are stored so the optimizer stops repeating them
Deployment is a single best_skill.md, ~300-2,000 tokens. No weight changes, no extra inference-time calls.
The numbers hold up: best or tied on all 52 (model, benchmark, harness) cells, beating human, one-shot, Trace2Skill, TextGrad, GEPA, and EvoSkill skills.
On GPT-5.5, it adds +23.5 points in direct chat, +24.8 in the Codex loop, and +19.1 in Claude Code over the no-skill baseline.
What got me: the best skills landed with just 1-4 accepted edits across the entire run. The output reads like rules a careful engineer would write after a day with the benchmark, except they were found automatically.
SkillOpt isn't alone here. Hermes Agent reached the same idea independently through skill_manage, Curator, and a GEPA optimization loop that scores, mutates, and promotes skill docs across runs.
Two teams, different architectures, one conclusion: in a frozen-model agent, the skill file is the highest-leverage thing to optimize.