A Skill Is a Directory, Not a File

Most skills I see are a single markdown file with everything dumped into it. Instructions, examples, edge cases, all of it — one long scroll of text that the agent loads every time the skill triggers.

That works. Mostly. But it's not how skills are supposed to work, and over time the cracks start to show. The skill gets slower as the context balloons. Updating it means wading through a monolith to find the one paragraph that needs changing. And the agent loads instructions for six scenarios when the current task only needs one.

What a skill actually is

A skill isn't just a markdown file. It's a directory.

my-skill/
├── SKILL.md          ← the entry point
├── references/
│   ├── reference-a.md
│   └── reference-b.md
├── scripts/
│   └── script.py
└── assets/
    └── template.docx

The SKILL.md is the tip of the iceberg. It contains a name, a description, and enough instructions to get started. The real depth lives in the files alongside it.

Anthropic calls this progressive disclosure. Like a well-organised manual — table of contents first, then specific chapters, then appendices — skills let the agent load information only as it's needed.

Three levels of detail

The design is deliberately layered:

Level 1: Metadata — just the name and description from the YAML frontmatter. This is always in context, always loaded. Every skill you have installed sits here. It's how the agent decides whether a skill is relevant at all. Keep it tight.

Level 2: The SKILL.md body — when the agent decides the skill is relevant, it reads the full SKILL.md. This is the main body of instructions. Aim for under 500 lines. If you're approaching that limit, it's a sign you need the third level.

Level 3: Bundled files — reference documents, scripts, assets. The agent only reads these when it decides it needs them. You point to them from SKILL.md and explain when to reach for them.

A skill that keeps the SKILL.md lean and routes detail into references will perform better than one that front-loads everything. The agent isn't overwhelmed on entry, and it discovers specifics only when the task actually demands them.

How it actually works under the hood

It helps to understand the mechanism, because it makes the whole design click.

All installed skill metadata — every name and description across every skill you have — is baked into the agent's initial system prompt. The agent starts every session already aware of what skills exist. That's level 1, and it's always there.

When the agent decides a skill is relevant — based on matching the task at hand against those descriptions — it calls a tool, essentially get_skill("skill-name"), which returns the contents of SKILL.md. That's level 2, loaded on demand.

From there, the agent is on its own. If SKILL.md says "read references/aws.md when deploying to AWS", the agent uses its built-in file reading to go and fetch it. If it says "run scripts/package.py", the agent executes it directly. Those files are never loaded unless the agent decides it needs them.

So the layering isn't just a design philosophy — it's the literal call sequence. Level 1 is in the prompt. Level 2 is a tool call. Level 3 is whatever the agent chooses to reach for next. The directory structure is the API.

The context window problem

Here's the thing about dumping everything into a single file: context is expensive.

Every token loaded into context costs money and competes with the actual task. A bloated skill that loads detailed instructions for six different scenarios — when the current task only needs one of them — is burning tokens on things that don't matter.

This is the same problem I wrote about with MCP versus CLI tools. MCP loads complete JSON schemas for every tool at session start. CLI tools let the agent discover capabilities on-demand via --help. The pattern is identical: progressive disclosure dramatically reduces the context footprint.

A well-structured skill is the same idea applied to instructions.

What goes where

Some practical guidance:

In SKILL.md — the workflow, the decision points, and clear pointers to references. "When doing X, read references/x.md. When doing Y, use scripts/extract.py." The agent should be able to orient itself from SKILL.md alone, then reach for detail as needed.

In references/ — domain-specific content that only applies sometimes. Different output formats. Platform-specific instructions. Things that are mutually exclusive — if you're deploying to AWS you don't need the GCP reference loaded.

In scripts/ — code the agent can execute without reading into context. Parsing, extraction, transformation — anything deterministic. Code is more reliable than asking the agent to reason through it, and running a script doesn't cost context.

In assets/ — templates, icons, example files. Things the agent produces output from.

A real example

Anthropic's own skill-creator skill — the skill for building skills — is the clearest example of this in the wild:

skill-creator/
├── SKILL.md               ← workflow overview, under 500 lines
├── agents/
│   ├── grader.md          ← read only when spawning a grader subagent
│   ├── comparator.md      ← read only for blind A/B comparisons
│   └── analyzer.md        ← read only when analysing benchmark results
├── assets/
│   └── eval_review.html   ← HTML template, used not read into context
├── eval-viewer/
│   └── generate_review.py ← executed directly, never read
├── references/
│   └── schemas.md         ← read only when writing evals JSON
└── scripts/
    ├── aggregate_benchmark.py
    ├── run_loop.py
    └── package_skill.py   ← all executed directly, never read

The SKILL.md covers the core loop — draft, test, evaluate, iterate — and at each decision point tells the agent exactly which file to reach for next. Need to grade outputs? Read agents/grader.md. Need the eval JSON schema? Read references/schemas.md. Need to package the finished skill? Run scripts/package_skill.py.

None of those files are in context until that moment in the workflow. If you're just drafting a skill, you're not paying for the grader instructions or the benchmark schemas. They don't exist as far as the agent is concerned — until they do.

There's also a pleasing meta quality to it. This is a skill for building skills. It is itself a well-structured skill. The authors ate their own cooking.

The common mistake

The instinct when writing a skill is to ask "what does the agent need to know?" and then write it all down. That's a good start, but the second question — "when does it need to know each thing?" — is where most skills fall short.

The answer to that second question is your directory structure. It's your table of references. It's the decision logic in SKILL.md that routes the agent to the right file.

Anthropic's own documentation describes skills as being like an onboarding guide for a new hire. A good onboarding guide doesn't hand you everything on day one. It tells you where things are, introduces you to the most important bits, and points you to the right resources when you need more.

Skills work the same way. The SKILL.md is the orientation. The rest of the directory is the resource library.

Worth the structure

Setting up a skill directory properly takes a bit more thought than writing a single file. But that structure pays off. Skills that follow progressive disclosure are faster, cheaper to run, and easier to maintain. When something needs updating, you change the relevant reference file, not a monolith.

The file is just the entry point. The skill is the whole directory.