Anthropic Engineering2025-09-11

에이전트를 위한 툴 설계 — 에이전트와 함께

Ken Aizawa

툴은 전통적 함수와 다르다. 결정론적 시스템과 비결정론적 에이전트 사이의 계약이다. 프로토타입→평가→에이전트와의 협업 루프로 툴을 개선하는 법, 그리고 툴 이름·설명·응답 구조에 대한 실전 원칙.

Writing effective tools for agents — with agents

생각 덩어리

툴은 결정론과 비결정론 사이의 새로운 계약

In computing, deterministic systems produce the same output every time given identical inputs, while non-deterministic systems—like agents—can generate varied responses even with the same starting conditions.

Tools are a new kind of software which reflects a contract between deterministic systems and non-deterministic agents.

에이전트용 소프트웨어는 다르게 써야 한다

instead of writing tools and MCP servers the way we'd write functions and APIs for other developers or systems, we need to design them for agents.

the tools that are most "ergonomic" for agents also end up being surprisingly intuitive to grasp as humans.

프로토타입 먼저 — 로컬 MCP나 DXT로 빠르게

Wrapping your tools in a local MCP server or Desktop extension (DXT) will allow you to connect and test your tools in Claude Code or the Claude Desktop app.

Test the tools yourself to identify any rough edges. Collect feedback from your users to build an intuition around the use-cases and prompts you expect your tools to enable.

강한 평가 과제 vs 약한 평가 과제

Strong evaluation tasks might require multiple tool calls—potentially dozens.

Strong tasks:

Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room.

Customer Sarah Chen just submitted a cancellation request. Prepare a retention offer. Determine: (1) why they're leaving, (2) what retention offer would be most compelling, and (3) any risk factors we should be aware of before making an offer.

Weaker tasks:

Schedule a meeting with jane@acme.corp next week.

Search the payment logs for purchase_complete and customer_id=9182.

Verifier는 너그럽게 — 사소한 차이로 반려 금지

Avoid overly strict verifiers that reject correct responses due to spurious differences like formatting, punctuation, or valid alternative phrasings.

평가 에이전트에 reasoning·feedback 블록을 먼저 쓰게 하라

Instructing agents to output these before tool call and response blocks may increase LLMs' effective intelligence by triggering chain-of-thought (CoT) behaviors.

에이전트의 침묵을 읽어라

LLMs don't always say what they mean.

Observe where your agents get stumped or confused. Read through your evaluation agents' reasoning and feedback (or CoT) to identify rough edges. Read between the lines; remember that your evaluation agents don't necessarily know the correct answers and strategies.

웹 검색 툴에 Claude가 '2025'를 자꾸 붙이던 사례

When we launched Claude's web search tool, we identified that Claude was needlessly appending 2025 to the tool's query parameter, biasing search results and degrading performance (we steered Claude in the right direction by improving the tool description).

에이전트로 툴을 고치게 하라

You can even let agents analyze your results and improve your tools for you. Simply concatenate the transcripts from your evaluation agents and paste them into Claude Code.

most of the advice in this post came from repeatedly optimizing our internal tool implementations with Claude Code.

툴 수가 많을수록 좋은 게 아니다 — affordance의 차이

agents have distinct "affordances" to traditional software—that is, they have different ways of perceiving the potential actions they can take with those tools

LLM agents have limited "context" ... whereas computer memory is cheap and abundant.

주소록의 brute-force 비유

if an LLM agent uses a tool that returns ALL contacts and then has to read through each one token-by-token, it's wasting its limited context space on irrelevant information (imagine searching for a contact in your address book by reading each page from top-to-bottom—that is, via brute-force search). The better and more natural approach (for agents and humans alike) is to skip to the relevant page first (perhaps finding it alphabetically).

툴 통합 — list_* 대신 의미 단위 하나로

Instead of implementing a list_users, list_events, and create_event tools, consider implementing a schedule_event tool which finds availability and schedules an event.
Instead of implementing a read_logs tool, consider implementing a search_logs tool which only returns relevant log lines and some surrounding context.
Instead of implementing get_customer_by_id, list_transactions, and list_notes tools, implement a get_customer_context tool which compiles all of a customer's recent & relevant information all at once.

Namespacing — prefix로 경계를 그어라

namespacing tools by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search, asana_users_search), can help agents select the right tools at the right time.

We have found selecting between prefix- and suffix-based namespacing to have non-trivial effects on our tool-use evaluations.

의미 있는 맥락만 반환 — UUID보다 자연어

tool implementations should take care to return only high signal information back to agents. They should prioritize contextual relevance over flexibility, and eschew low-level technical identifiers (for example: uuid, 256px_image_url, mime_type).

merely resolving arbitrary alphanumeric UUIDs to more semantically meaningful and interpretable language (or even a 0-indexed ID scheme) significantly improves Claude's precision in retrieval tasks by reducing hallucinations.

response_format enum — concise와 detailed 둘 다 노출

You can enable both by exposing a simple response_format enum parameter in your tool, allowing your agent to control whether tools return "concise" or "detailed" responses.

응답 구조도 성능 변수 — LLM은 학습 분포를 따른다

Even your tool response structure—for example XML, JSON, or Markdown—can have an impact on evaluation performance: there is no one-size-fits-all solution. This is because LLMs are trained on next-token prediction and tend to perform better with formats that match their training data.

토큰 효율성 — Claude Code 기본 25K 토큰 제한

For Claude Code, we restrict tool responses to 25,000 tokens by default. We expect the effective context length of agents to grow over time, but the need for context-efficient tools to remain.

에러 메시지도 프롬프트 엔지니어링

if a tool call raises an error (for example, during input validation), you can prompt-engineer your error responses to clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks.

Tool description — 신입 동료에게 설명하듯

When writing tool descriptions and specs, think of how you would describe your tool to a new hire on your team. Consider the context that you might implicitly bring—specialized query formats, definitions of niche terminology, relationships between underlying resources—and make it explicit.

input parameters should be unambiguously named: instead of a parameter named user, try a parameter named user_id.

작은 설명 수정이 극적 성능 차이

Even small refinements to tool descriptions can yield dramatic improvements. Claude Sonnet 3.5 achieved state-of-the-art performance on the SWE-bench Verified evaluation after we made precise refinements to tool descriptions, dramatically reducing error rates and improving task completion.

결론 — 결정론에서 비결정론으로

To build effective tools for agents, we need to re-orient our software development practices from predictable, deterministic patterns to non-deterministic ones.

Effective tools are intentionally and clearly defined, use agent context judiciously, can be combined together in diverse workflows, and enable agents to intuitively solve real-world tasks.

원본 사이트 →