Hamel Husain2025-03-24

AI 제품을 빠르게 개선하기 위한 현장 가이드

Hamel Husain

30곳 넘는 현장에서 관찰한 AI 제품 개선 루프 — 도구보다 측정, 대시보드보다 데이터 뷰어, 엔지니어가 아닌 도메인 전문가에게 프롬프트를 돌려주고, 피처가 아니라 실험 횟수로 로드맵을 짜라.

A Field Guide to Rapidly Improving AI Products

생각 덩어리

가장 흔한 장면 — 측정 없는 아키텍처 자랑

AI TEAM: Here's our agent architecture – we've got RAG here, a router there, and we're using this new framework for…

ME: [Holding up my hand to pause the enthusiastic tech lead.] "Can you show me how you're measuring if any of this actually works?"

… Room goes quiet

This scene has played out dozens of times over the last two years. Teams invest weeks building complex AI systems, but can't tell me if their changes are helping or hurting.

With new tools and frameworks emerging weekly, it's natural to focus on tangible things we can control – which vector database to use, which LLM provider to choose, which agent framework to adopt. But after helping 30+ companies build AI products, I've discovered the teams who succeed barely talk about tools at all. Instead, they obsess over measurement and iteration.

Tools Trap — 제네릭 메트릭이 진보를 막는다

This is the "tools trap" – the belief that adopting the right tools or frameworks (in this case, generic metrics) will solve your AI problems. Generic metrics are worse than useless – they actively impede progress in two ways:

First, they create a false sense of measurement and progress. Teams think they're data-driven because they have dashboards, but they're tracking vanity metrics that don't correlate with real user problems. I've seen teams celebrate improving their "helpfulness score" by 10% while their actual users were still struggling with basic tasks. It's like optimizing your website's load time while your checkout process is broken – you're getting better at the wrong thing.

Second, too many metrics fragment your attention. Instead of focusing on the few metrics that matter for your specific use case, you're trying to optimize multiple dimensions simultaneously. When everything is important, nothing is.

Error Analysis — 가장 높은 ROI 활동

The alternative? Error analysis - the single most valuable activity in AI development and consistently the highest-ROI activity.

When Jacob, the founder of Nurture Boss, needed to improve their apartment-industry AI assistant, his team built a simple viewer to examine conversations between their AI and users. Next to each conversation was a space for open-ended notes about failure modes.

After annotating dozens of conversations, clear patterns emerged. Their AI was struggling with date handling – failing 66% of the time when users said things like "let's schedule a tour two weeks from now."

Instead of reaching for new tools, they: 1. Looked at actual conversation logs 2. Categorized the types of date-handling failures 3. Built specific tests to catch these issues 4. Measured improvement on these metrics

The result? Their date handling success rate improved from 33% to 95%.

Bottom-Up vs Top-Down — 메트릭은 데이터에서 자라난다

The top-down approach starts with common metrics like "hallucination" or "toxicity" plus metrics unique to your task. While convenient, it often misses domain-specific issues.

The more effective bottom-up approach forces you to look at actual data and let metrics naturally emerge. At NurtureBoss, we started with a spreadsheet where each row represented a conversation. We wrote open-ended notes on any undesired behavior. Then we used an LLM to build a taxonomy of common failure modes. Finally, we mapped each row to specific failure mode labels and counted the frequency of each issue.

The results were striking - just three issues accounted for over 60% of all problems: Conversation flow issues (missing context, awkward responses) / Handoff failures (not recognizing when to transfer to humans) / Rescheduling problems (struggling with date handling).

가장 중요한 투자 — 커스텀 데이터 뷰어

The single most impactful investment I've seen AI teams make isn't a fancy evaluation dashboard – it's building a customized interface that lets anyone examine what their AI is actually doing.

I've watched teams struggle with generic labeling interfaces, hunting through multiple systems just to understand a single interaction. The friction adds up: clicking through to different systems to see context, copying error descriptions into separate tracking sheets, switching between tools to verify information. This friction doesn't just slow teams down – it actively discourages the kind of systematic analysis that catches subtle issues.

Teams with thoughtfully designed data viewers iterate 10x faster than those without them. And here's the thing: these tools can be built in hours using AI-assisted development (like Cursor or Loveable). The investment is minimal compared to the returns.

좋은 데이터 어노테이션 도구의 조건

Show all context in one place. Don't make users hunt through different systems to understand what happened.

Make feedback trivial to capture. One-click correct/incorrect buttons beat lengthy forms.

Capture open-ended feedback. This lets you capture nuanced issues that don't fit into a pre-defined taxonomy.

Enable quick filtering and sorting.

Have hotkeys that allow users to navigate between data examples and annotate without clicking.

프롬프트는 영어다 — 도메인 전문가가 직접 쓴다

I recently worked with an education startup building an interactive learning platform with LLMs. Their product manager, a learning design expert, would create detailed PowerPoint decks explaining pedagogical principles and example dialogues. She'd present these to the engineering team, who would then translate her expertise into prompts.

But here's the thing: prompts are just English. Having a learning expert communicate teaching principles through PowerPoint, only for engineers to translate that back into English prompts, created unnecessary friction. The most successful teams flip this model by giving domain experts tools to write and iterate on prompts directly.

Integrated Prompt Environments — 플레이그라운드 그다음 단계

But there's a crucial next step that many teams miss: integrating prompt development into their application context. Most AI applications aren't just prompts – They commonly involve RAG systems pulling from your knowledge base, agent orchestration coordinating multiple steps, and application-specific business logic. The most effective teams I've worked with go beyond standalone playgrounds. They build what I call integrated prompt environments – essentially admin versions of their actual user interface that expose prompt editing.

전문 용어 제거 — "RAG 구현한다" 대신 "모델에 올바른 맥락을 준다"

This happens everywhere. I've seen it with lawyers at legal tech companies, psychologists at mental health startups, and doctors at healthcare firms. The magic of LLMs is that they make AI accessible through natural language, but we often destroy that advantage by wrapping everything in technical terminology.

Instead of saying "We're implementing a RAG approach" — Say "We're making sure the model has the right context to answer questions"

Instead of "We need to prevent prompt injection" — Say "We need to make sure users can't trick the AI into ignoring our rules"

Instead of "Our model suffers from hallucination issues" — Say "Sometimes the AI makes things up, so we need to check its answers"

This doesn't mean dumbing things down – it means being precise about what you're actually doing. When you say "we're building an agent," what specific capability are you adding? Is it function calling? Tool use? Or just a better prompt? Being specific helps everyone understand what's actually happening.

합성 데이터 — 유저 0명에서도 평가는 가능하다

One of the most common roadblocks I hear from teams is: "We can't do proper evaluation because we don't have enough real user data yet." This creates a chicken-and-egg problem – you need data to improve your AI, but you need a decent AI to get users who generate that data.

"LLMs are surprisingly good at generating excellent - and diverse - examples of user prompts. This can be relevant for powering application features, and sneakily, for building Evals. If this sounds a bit like the Large Language Snake is eating its tail, I was just as surprised as you! All I can say is: it works, ship it." — Bryan Bischof

합성 데이터의 세 축 — Features × Scenarios × Personas

The key to effective synthetic data is choosing the right dimensions to test. While these dimensions will vary based on your specific needs, I find it helpful to think about three broad categories:

Features: What capabilities does your AI need to support?

Scenarios: What situations will it encounter?

User Personas: Who will be using it and how?

The key to useful synthetic data is grounding it in real system constraints. For the real-estate AI assistant, this means:

Using real listing IDs and addresses from their database

Incorporating actual agent schedules and availability windows

Respecting business rules like showing restrictions and notice periods

Including market-specific details like HOA requirements or local regulations

Criteria Drift — 기준은 출력을 보면서 비로소 정해진다

One of the most insidious problems in AI evaluation is "criteria drift" – a phenomenon where evaluation criteria evolve as you observe more model outputs. In their paper "Who Validates the Validators?", Shankar et al. describe this phenomenon:

"To grade outputs, people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria."

This creates a paradox: you can't fully define your evaluation criteria until you've seen a wide range of outputs, but you need criteria to evaluate those outputs in the first place. In other words, it is impossible to completely determine evaluation criteria prior to human judging of LLM outputs.

"Seeing how the LLM breaks down its reasoning made me realize I wasn't being consistent about how I judged certain edge cases." — Phillip Carter (Honeycomb)

The teams that maintain trust in their evaluation systems embrace this reality rather than fighting it. They treat evaluation criteria as living documents that evolve alongside their understanding of the problem space.

Binary 판정 + 상세 Critique — 명확함과 뉘앙스를 모두

When faced with a 1-5 scale, evaluators frequently struggle with the difference between a 3 and a 4, introducing inconsistency and subjectivity. What exactly distinguishes "somewhat helpful" from "helpful"? These boundary cases consume disproportionate mental energy and create noise in your evaluation data. And even when businesses use a 1-5 scale, they inevitably ask where to draw the line for "good enough" or to trigger intervention, forcing a binary decision anyway.

In contrast, a binary pass/fail forces evaluators to make a clear judgment: did this output achieve its purpose or not? This clarity extends to measuring progress – a 10% increase in passing outputs is immediately meaningful, while a 0.5-point improvement on a 5-point scale requires interpretation.

I've found that teams who resist binary evaluation often do so because they want to capture nuance. But nuance isn't lost – it's just moved to the qualitative critique that accompanies the judgment.

When included as few-shot examples in judge prompts, these critiques improve the LLM's ability to reason about complex edge cases. I've found this approach often yields 15-20% higher agreement rates between human and LLM evaluations compared to prompts without example critiques.

AI를 과신하는 본능을 의심하라

Research shows people tend to over-rely and over-trust AI systems. For instance, in one high profile incident, researchers from MIT posted a pre-print on arXiv claiming that GPT-4 could ace the MIT EECS exam. Within hours, [the] work [was] debunked … citing problems arising from over-reliance on GPT-4 to grade itself.

This over-trust problem extends beyond self-evaluation. Research has shown that LLMs can be biased by simple factors like the ordering of options in a set, or even seemingly innocuous formatting changes in prompts. Without rigorous human validation, these biases can silently undermine your evaluation system.

When working with Honeycomb, we tracked agreement rates between our LLM-as-a-judge and Phillip's evaluations... It took three iterations to achieve >90% agreement, but this investment paid off in a system the team could trust. Without this validation step, automated evaluations often drift from human expectations over time, especially as the distribution of inputs changes.

로드맵은 피처가 아니라 실험 횟수로 — Capability Funnel

I've watched teams commit to roadmaps like "Launch sentiment analysis by Q2" or "Deploy agent-based customer support by end of year," only to discover that the technology simply isn't ready to meet their quality bar. They either ship something subpar to hit the deadline or miss the deadline entirely. Either way, trust erodes.

The fundamental problem is that traditional roadmaps assume we know what's possible. With conventional software, that's often true – given enough time and resources, you can build most features reliably. With AI, especially at the cutting edge, you're constantly testing the boundaries of what's feasible.

Bryan Bischof, Former Head of AI at Hex, introduced me to what he calls a "capability funnel" approach to AI roadmaps... For example, in a query assistant, the capability funnel might look like: 1. Can generate syntactically valid queries (basic functionality) 2. Can generate queries that execute without errors 3. Can generate queries that return relevant results 4. Can generate queries that match user intent 5. Can generate optimal queries that solve the user's problem (complete solution)

Timebox — 불확실성 속에서 의사결정 지점 만들기

"Here's a common timeline. First, I take two weeks to do a data feasibility analysis, i.e "do I have the right data?" […] Then I take an additional month to do a technical feasibility analysis, i.e "can AI solve this?" After that, if it still works I'll spend six weeks building a prototype we can A/B test." — Eugene Yan

"I try to reassure leadership with timeboxes. At the end of three months, if it works out, then we move it to production. At any step of the way, if it doesn't work out, we pivot."

GitHub Copilot이 증명한 것 — 평가 인프라에 먼저 투자한다

I saw this firsthand during the early development of GitHub Copilot. What most people don't realize is that the team invested heavily in building sophisticated offline evaluation infrastructure. They created systems that could test code completions against a very large corpus of repositories on GitHub, leveraging unit tests that already existed in high-quality codebases as an automated way to verify completion correctness. This was a massive engineering undertaking – they had to build systems that could clone repositories at scale, set up their environments, run their test suites, and analyze the results, all while handling the incredible diversity of programming languages, frameworks, and testing approaches.

This wasn't wasted time—it was the foundation that accelerated everything. With solid evaluation in place, the team ran thousands of experiments, quickly identified what worked, and could say with confidence "this change improved quality by X%" instead of relying on gut feelings.

실패 공유를 제도화하라 — Fifteen-Five

"In my fifteen-fives, I document my failures and my successes. Within our team, we also have weekly 'no-prep sharing sessions' where we discuss what we've been working on and what we've learned. When I do this, I go out of my way to share failures." — Eugene Yan

This practice normalizes failure as part of the learning process. It shows that even experienced practitioners encounter dead ends, and it accelerates team learning by sharing those experiences openly.

돌파는 오래 정체된 뒤에 온다

"I was asked to do content moderation. I said, 'It's uncertain whether we'll meet that goal. It's uncertain even if that goal is feasible with our data, or what machine learning techniques would work. But here's my experimentation roadmap. Here are the techniques I'm gonna try, and I'm gonna update you at a two-week cadence.'"

"For the first two to three months, nothing worked. […] And then [a breakthrough] came out. […] Within a month, that problem was solved. So you can see that in the first quarter or even four months, it was going nowhere. […] But then you can also see that all of a sudden, some new technology comes along, some new paradigm, some new reframing comes along that just [solves] 80% of [the problem]."

This pattern – long periods of apparent failure followed by breakthroughs – is common in AI development. Traditional feature-based roadmaps would have killed the project after months of "failure," missing the eventual breakthrough.

핵심 원리 요약

Look at your data. Nothing replaces the insight gained from examining real examples. Error analysis consistently reveals the highest-ROI improvements.

Build simple tools that remove friction. Custom data viewers that make it easy to examine AI outputs yield more insights than complex dashboards with generic metrics.

Empower domain experts. The people who understand your domain best are often the ones who can most effectively improve your AI, regardless of their technical background.

Use synthetic data strategically.

Maintain trust in your evaluations. Binary judgments with detailed critiques create clarity while preserving nuance.

Structure roadmaps around experiments, not features. Commit to a cadence of experimentation and learning rather than specific outcomes by specific dates.

The key metric for AI roadmaps isn't features shipped – it's experiments run. The teams that win are those that can run more experiments, learn faster, and iterate more quickly than their competitors. And the foundation for this rapid experimentation is always the same: robust, trusted evaluation infrastructure that gives everyone confidence in the results.

원본 사이트 →