Operating Extensions at Scale

TL;DR Skills are the cheap scalable primitive — 30–50 tokens until invoked versus 50k+ for a five-server MCP setup [4]. The description field in SKILL.md is the routing layer, not documentation [5]. For headless CI use claude -p --bare; the Agent SDK brings the same loop into Python/TypeScript [8] [9]. Budget ~$150–250/dev/month at enterprise scale and set workspace spend limits before rollout [2].

Extension taxonomy

Claude Code ⭐ 130k [17] has four distinct extension primitives with different loading costs and scopes [1] [2]:

Primitive	Context cost at startup	How it loads	Best for
Skills	~100 tokens (name + description only)	Body loads on demand when relevant	Procedural knowledge, reusable workflows
MCP	Up to 18k tokens/server (deferred in Jan 2026)	Full schema unless Tool Search active	External system connectivity
Agents	Proportional to spawn prompt	Spawned on request; own context window	Isolated subtasks, parallel work
Hooks	Near zero (runs outside context window)	Shell commands at lifecycle events	Pre/post-tool validation, audit logging

The operative framing: MCP is connectivity, Skills are procedural knowledge [4]. When you find yourself re-explaining the same workflow across sessions, that’s a Skill. When Claude needs to reach a system it can’t access independently, that’s MCP.

Context budget reality

MCP token overhead

MCP injects full tool schema — name, description, all parameter definitions — into every message [3]. Measured startup costs per server:

MCP Server	Tools	Startup token cost
SQLite (minimal)	6	~385
Gmail	7	~2,640
Playwright	22	~3,442
GitHub MCP	—	8k–12,000
Full SQLite suite	—	~13,400
mcp-omnisearch	—	~14,100
Jira	—	~17,000

The most expensive single tool in the Gmail server (gmail_create_draft) costs 820 tokens alone; the cheapest (browser_close) costs 59 [3]. A five-server developer setup reaches 50k–66k tokens before the first user prompt [15] [3].

Tool Search deferral (Jan 2026)

MCP Tool Search auto-enables when tool definitions exceed 10% of the context window. Only tool names enter context; full schemas load on demand via search [2]. Users report up to 95% reduction in startup token cost [15]. Set ENABLE_TOOL_SEARCH=auto:5 to lower the activation threshold to 5% for more aggressive deferral.

In one measured session: 5,900 tokens recovered from MCP tool deferral + 7,300 from system tool deferral = 13,200 tokens saved [3].

Skills stay cheap at scale

Skills load only ~100 tokens at startup (name + description). Body content enters context only when Claude deems the skill relevant. Teams routinely run 20–50 Skills simultaneously with negligible overhead [4]. Keep SKILL.md under 500 lines; overflow content lives in linked reference files [5].

Skill authoring craft

The description is the routing layer

Claude reads only name and description when deciding whether to load a skill. This makes the description the most operationally significant token in your skillset [5]:

# Good: specific triggers; what AND when
description: >
  Extracts text and tables from PDF files, fills forms, merges documents.
  Use when working with PDF files or when the user mentions PDFs, forms,
  or document extraction.

# Bad: too vague to route correctly
description: Helps with documents

Rules enforced by the spec [5]:

Always third person (“Processes Excel files…”, not “I can help…”)
Include both what (capability) and when (concrete trigger phrases, file types)
Max 64-char name (lowercase, hyphens only); max 1,024-char description; no XML tags

If a skill isn’t triggering, the root cause is almost always a vague description. Verify registration with /plugin [4].

Invocation mode

Add an invocation key to frontmatter to control who fires the skill [1]:

auto (default): Claude decides when the skill is relevant
user: skill only runs when explicitly invoked via /skill-name

Long reference skills or rarely-needed playbooks should use user to prevent spurious loads.

Progressive disclosure structure

my-skill/
├── SKILL.md            ← loaded on trigger (≤500 lines)
├── reference.md        ← read only when Claude follows the link
├── examples.md         ← read only when needed
└── scripts/
    └── validate.py     ← executed, not loaded — only output consumes tokens

SKILL.md acts as a table of contents. Scripts execute without loading source; only their output costs tokens. Keep all reference links one level deep from SKILL.md — nested chains (A → B → C) cause partial reads [5].

Degrees of freedom

Match instruction specificity to task fragility [5]:

Specificity	When to use	Example
High freedom	Multiple valid approaches; context drives choice	Code review, analysis
Medium freedom	Preferred pattern exists; some variation OK	Report generation
Low freedom	Fragile sequence; consistency critical	Database migrations, exact scripts

Team deployment patterns

Scenario	Skill encodes	MCP provides
Code review	House style, guardrails	GitHub (diffs, comments)
Database analysis	Safety constraints, schema context	Postgres
Incident response	Regression identification, PR draft flow	Sentry + GitHub
Support triage	Classification logic	Slack + Linear

MCP beyond `tools/call`

Most builders only use the tool primitive. The spec defines six primitives across server- and client-side [6]:

Server-side (server exposes to client)

Tools — actions with explicit user approval per call. The one everyone knows.

Resources — read-only data identified by URIs and MIME types (travel://activities/{city}/{category}). Pulled by the client; never trigger operations. Treat them like structured files the server makes queryable.

Prompts — version-controlled instruction templates with typed variables. Servers define them; clients fill variables and build workflows. Enables centralized prompt governance: update once, propagate everywhere.

Client-side (client exposes to server)

These three unlock human-in-the-loop patterns that tool primitives alone can’t produce [6] [7]:

Sampling — server requests an LLM completion from the client’s model without needing its own API key. Users must explicitly approve the prompt before execution. Primary use: intent routing before tool dispatch — a 25-tool customer support server using Haiku to classify user intent adds ~200–800ms but removes 500 tokens of routing logic from the main agent prompt [7].

Server → sampling/create → Client (user approves) → LLM → Client → Server

Skip sampling for voice pipelines requiring <300ms response.

Roots — client declares filesystem URI boundaries the server may access during handshake. Prevents path guessing; enforces security perimeters when agents process customer-uploaded files or local knowledge bases [7]. Listen for RootsListChangedNotification when workspaces switch.

Elicitation — server pauses execution and requests schema-validated structured input from the user via the client’s UI (added in spec 2025-06-18) [6]. Replaces fragile multi-turn loops: define an enum schema, receive typed JSON back.

{ "action": { "enum": ["accept", "decline", "cancel"] } }

⚠ Check clientCapabilities?.elicitation before calling and fall back gracefully when the connected client doesn’t support it [7].

Human-in-the-loop pattern matrix

Pattern	Use case	Latency cost
Elicitation	User confirmations, structured input	<100ms
Sampling	Intent routing, classification	200–800ms
Roots	File scoping, security boundary	None
Hook system	Ops supervisor approval workflows	Async

Headless operation & Agent SDK

`claude -p --bare`

--bare skips auto-discovery of hooks, skills, plugins, MCP servers, and CLAUDE.md [8]. This is the recommended CI mode: identical results on any machine regardless of local developer configuration. It will become the default -p behavior in a future release.

# Pipe data in, text out
cat build-error.txt | claude --bare -p "explain the root cause" > output.txt

# Schema-enforced JSON output
claude --bare -p "Extract function names from auth.ts" \
  --output-format json \
  --json-schema '{"type":"object","properties":{"functions":{"type":"array","items":{"type":"string"}}},"required":["functions"]}'

# Session continuity across invocations
session_id=$(claude -p "Start review" --output-format json | jq -r '.session_id')
claude -p "Continue with database layer" --resume "$session_id"

stdin is capped at 10MB; for larger inputs pass a file path in the prompt instead of piping [8].

Agent SDK

The Agent SDK brings the same loop — tools, context, hooks, subagents, skills, MCP — into Python and TypeScript [9]. Repos: claude-agent-sdk-python ⭐ 7.2k [10] · claude-agent-sdk-typescript ⭐ 1.5k [11]

from claude_agent_sdk import query, ClaudeAgentOptions

async for message in query(
    prompt="Find and fix the bug in auth.py",
    options=ClaudeAgentOptions(allowed_tools=["Read", "Edit", "Bash"]),
):
    print(message)

import { query } from "@anthropic-ai/claude-agent-sdk";

for await (const message of query({
  prompt: "Find and fix the bug in auth.ts",
  options: { allowedTools: ["Read", "Edit", "Bash"] }
})) {
  if ("result" in message) console.log(message.result);
}

SDK vs Managed Agents vs CLI [9]:

Dimension	CLI (`claude -p`)	Agent SDK	Managed Agents
Interface	Shell	Python / TypeScript library	REST API
Runs in	Your machine	Your process	Anthropic-managed sandbox
Session state	JSONL on filesystem	JSONL on filesystem	Anthropic-hosted event log
Custom tools	Via MCP config	In-process callbacks	You execute, return results
Best for	One-off scripts, local dev	Production automation, CI/CD	Long-running async sessions

June 15 2026 billing change: Agent SDK and claude -p usage migrates from subscription quota to a separate monthly Agent SDK credit [18]. Plan for this now if you bill agent usage against team quotas.

Hooks in the SDK

SDK hooks use callback functions rather than shell commands. PostToolUse for audit logging, PreToolUse for validation or output filtering [9]:

const logFileChange: HookCallback = async (input) => {
  const path = (input as any).tool_input?.file_path ?? "unknown";
  await appendFile("audit.log", `${new Date().toISOString()}: ${path}\n`);
  return {};
};

options: {
  permissionMode: "acceptEdits",
  hooks: { PostToolUse: [{ matcher: "Edit|Write", hooks: [logFileChange] }] }
}

A PreToolUse hook can filter a 10,000-line test log down to failure lines before it ever enters context — turning tens of thousands of tokens into hundreds [2].

Operating at scale

Cost baselines

Enterprise deployments average ~$13/dev/active day, ~$150–250/dev/month; 90% of users stay below $30/active day [2]. Model choice and context size drive most variance — Sonnet for teammates and subagents, Opus for architectural decisions.

Agent teams (parallel Claude instances) use ~7× more tokens than standard sessions. Keep spawn prompts focused; clean up teams when work completes [2].

Rate limit allocation

Concurrent users are a fraction of total, so per-user TPM decreases as org size grows [2]:

Team size	TPM / user	RPM / user
1–5	200k–300k	5–7
5–20	100k–150k	2.5–3.5
20–50	50k–75k	1.25–1.75
50–100	25k–35k	0.62–0.87
100–500	15k–20k	0.37–0.47
500+	10k–15k	0.25–0.35

⚠ Live training events (large groups online simultaneously) need temporary higher allocations — the table assumes normal daily concurrency patterns.

Spend controls

API plans: workspace spend limits in the Claude Console; set a workspace rate limit to cap Claude Code’s share and protect other production workloads
Pro/Max: /usage-credits monthly cap via CLI
Bedrock/Vertex/Foundry: LiteLLM for per-key cost tracking (unaffiliated, not audited by Anthropic) [2]
/usage command shows 24h/7d token attribution per MCP server, skill, and subagent

Token reduction without removing capability

Move CLAUDE.md procedures into Skills — CLAUDE.md loads unconditionally; Skills load on demand [2]. Keep CLAUDE.md under 200 lines.
Prefer CLI tools (gh, aws, gcloud) over MCP where available — no per-tool schema overhead [2].
Filter verbose tool outputs in PreToolUse hooks before they reach context.
Delegate log processing and test runs to subagents — verbose output stays isolated; only summary returns to main conversation.
Always use --bare in CI to prevent local developer hooks and MCPs from bleeding into pipeline runs [8].

MCP production practices

The MCP servers registry ⭐ 87k [14] lists community implementations. For production deployments [13] [16]:

Single domain per server — one clear purpose per MCP; no monolithic servers
Strict schemas — typed inputs/outputs with documented side effects; schema drift is the primary cause of production tool failures
Circuit breakers — graceful degradation when downstream APIs are unavailable
Enterprise auth — OAuth 2.1 with re-authentication per tool call (not session-level token); route through an MCP Gateway for centralized policy enforcement and per-team audit trails [16]
Targets [13]: >1,000 req/s per instance · P95 <100ms · error rate <0.1% · >99.9% uptime

The 2026 MCP roadmap priorities: stateless horizontal scaling via HTTP transport evolution and .well-known capability discovery; Tasks primitive for agent-to-agent communication; governance via Working Groups [12].

Debugging reference

Symptom	First check	Fix
Skill never triggers automatically	`description` vague; missing “Use when…” phrase	Rewrite with specific trigger phrases; verify `/plugin`
Skill not in slash-command list	Not in watched skill directories	Check `~/.claude/skills/`, `.claude/skills/`
MCP tool “not found” error	Short tool name instead of `Server:tool_name`	Prefix with MCP server name
50k+ startup token burn	Many MCPs without Tool Search deferral	Enable Tool Search; disable unused servers via `/mcp`
CI builds inconsistent vs local	Local hooks/MCPs bleeding into `claude -p`	Add `--bare` flag
Skill content partially loaded	Deep reference nesting (`A → B → C`)	Flatten all references to one level from SKILL.md
Agent SDK auth failure in CI	OAuth blocked in bare mode; API key missing	Set `ANTHROPIC_API_KEY`; pass via `--settings`
Elicitation request silently ignored	Client doesn’t support elicitation	Check `clientCapabilities?.elicitation` first