Hamel Husain2024-03-29

당신의 AI 제품에는 Evals이 필요하다

Hamel Husain

부동산 AI 어시스턴트 Lucy를 사례로, LLM 제품을 데모에서 벗어나게 하는 3단계 평가 시스템(Unit Test → Human/Model Eval → A/B) 설계법. 평가 인프라가 곧 파인튜닝·디버깅·합성 데이터 엔진이 된다.

Your AI Product Needs Evals

생각 덩어리

실패하는 LLM 제품의 공통 원인 — 평가 시스템 부재

I've found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.

I'm currently an independent consultant who helps companies build domain-specific AI products. I hope companies can save thousands of dollars in consulting fees by reading this post carefully. As much as I love making money, I hate seeing folks make the same mistake repeatedly.

빠른 반복 == 성공 — 세 가지를 모두 해야 하는 이유

Like software engineering, success with AI hinges on how fast you can iterate. You must have processes and tools for:

Evaluating quality (ex: tests).

Debugging issues (ex: logging & inspecting data).

Changing the behavior or the system (prompt eng, fine-tuning, writing code)

Many people focus exclusively on #3 above, which prevents them from improving their LLM products beyond a demo. Doing all three activities well creates a virtuous cycle differentiating great from mediocre AI products.

If you streamline your evaluation process, all other activities become easy. This is very similar to how tests in software engineering pay massive dividends in the long term despite requiring up-front investment.

Lucy의 정체 — 성공하는 AI 제품이 맞닥뜨리는 벽

During Lucy's beginning stages, rapid progress was made with prompt engineering. However, as Lucy's surface area expanded, the performance of the AI plateaued. Symptoms of this were:

Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.

There was limited visibility into the AI system's effectiveness across tasks beyond vibe checks.

Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.

평가의 3단계 — Unit Test → Human/Model → A/B

There are three levels of evaluation to consider:

Level 1: Unit Tests

Level 2: Model & Human Eval (this includes debugging)

Level 3: A/B testing

The cost of Level 3 > Level 2 > Level 1. This dictates the cadence and manner you execute them. For example, I often run Level 1 evals on every code change, Level 2 on a set cadence and Level 3 only after significant product changes. It's also helpful to conquer a good portion of your Level 1 tests before you move into model-based tests, as they require more work and time to execute.

Level 1 Unit Test — Feature × Scenario 단위 단언

Unit tests for LLMs are assertions (like you would write in pytest). Unlike typical unit tests, you want to organize these assertions for use in places beyond unit tests, such as data cleaning and automatic retries (using the assertion error to course-correct) during model inference. The important part is that these assertions should run fast and cheaply as you develop your application so that you can run them every time your code changes.

The most effective way to think about unit tests is to break down the scope of your LLM into features and scenarios. For example, one feature of Lucy is the ability to find real estate listings, which we can break down into scenarios like so:

Feature: Listing Finder — "Please find listings with more than 3 bedrooms less than $2M in San Jose, CA"

Scenario Assertions
Only one listing matches user query len(listing_array) == 1
Multiple listings match user query len(listing_array) > 1
No listings match user query len(listing_array) == 0

Scenario	Assertions
Only one listing matches user query	len(listing_array) == 1
Multiple listings match user query	len(listing_array) > 1
No listings match user query	len(listing_array) == 0

Rechat has hundreds of these unit tests. We continuously update them based on new failures we observe in the data as users challenge the AI or the product evolves.

합성 테스트 케이스 — LLM으로 LLM 테스트를 만든다

To test these assertions, you must generate test cases or inputs that will trigger all scenarios you wish to test. I often utilize an LLM to generate these inputs synthetically; for example, here is one such prompt Rechat uses to generate synthetic inputs for a feature that creates and retrieves contacts:

"Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM... For each of the instructions, you need to generate a second instruction which can be used to look up the created contact."

You don't need to wait for production data to test your system. You can make educated guesses about how users will use your product and generate synthetic data. You can also let a small set of users use your product and let their usage refine your synthetic data generation strategy. One signal you are writing good tests and assertions is when the model struggles to pass them - these failure modes become problems you can solve with techniques like fine-tuning later on.

On a related note, unlike traditional unit tests, you don't necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.

Level 2 전에 반드시 — 트레이스 로깅

A prerequisite to performing human and model-based eval is to log your traces.

A trace is a concept that has been around for a while in software engineering and is a log of a sequence of events such as user sessions or a request flow through a distributed system. In other words, tracing is a logical grouping of logs. In the context of LLMs, traces often refer to conversations you have with a LLM.

데이터를 보는 마찰을 전부 없애라

You must remove all friction from the process of looking at data. This means rendering your traces in domain-specific ways. I've often found that it's better to build my own data viewing & labeling tool so I can gather all the information I need onto one screen. In Lucy's case, we needed to look at many sources of information (trace log, the CRM, etc) to understand what the AI did. This is precisely the type of friction that needs to be eliminated.

These tools can be built with lightweight front-end frameworks like Gradio, Streamlit, Panel, or Shiny in less than a day.

I often start by labeling examples as good or bad. I've found that assigning scores or more granular ratings is more onerous to manage than binary ratings. There are advanced techniques you can use to make human evaluation more efficient or accurate (e.g., active learning, consensus voting, etc.), but I recommend starting with something simple.

데이터를 얼마나 봐야 하나 — 공짜 점심은 없다

I often get asked how much data to examine. When starting, you should examine as much data as possible. I usually read traces generated from ALL test cases and user-generated traces at a minimum. You can never stop looking at data—no free lunch exists. However, you can sample your data more over time, lessening the burden.

(footnote) A reasonable heuristic is to keep reading logs until you feel like you aren't learning anything new.

자동 평가 — Correctness는 주관적이다, 사람과 정렬해야 한다

Many vendors want to sell you tools that claim to eliminate the need for a human to look at the data. Having humans periodically evaluate at least a sample of traces is a good idea. I often find that "correctness" is somewhat subjective, and you must align the model with a human.

You should track the correlation between model-based and human evaluation to decide how much you can rely on automatic evaluation. Furthermore, by collecting critiques from labelers explaining why they are making a decision, you can iterate on the evaluator model to align it with humans through prompt engineering or fine-tuning.

Excel 스프레드시트로 정렬하기 — 로우테크의 힘

I love using low-tech solutions like Excel to iterate on aligning model-based eval with humans. For example, I sent my colleague Phillip the following spreadsheet every few days to grade for a different use-case involving a natural language query generator. This spreadsheet would contain the following information:

model response: this is the prediction made by the LLM.

model critique: this is a critique written by a (usually more powerful) LLM about your original LLM's prediction.

model outcome: this is a binary label the critique model assigns to the model response as being "good" or "bad."

This information allowed me to iterate on the prompt of the critique model to make it sufficiently aligned with Phillip over time.

Judge 모델 운영 원칙

Use the most powerful model you can afford. It often takes advanced reasoning capabilities to critique something well. You can often get away with a slower, more powerful model for critiquing outputs relative to what you use in production.

Model-based evaluation is a meta-problem within your larger problem. You must maintain a mini-evaluation system to track its quality. I have sometimes fine-tuned a model at this stage (but I try not to).

After bringing the model-based evaluator in line with the human, you must continue doing periodic exercises to monitor the model and human agreement.

Agreement 메트릭의 함정 — 불균형 클래스에서는 오도한다

In this example, we used agreement between the model and human evaluator because our dataset was roughly balanced (about 50% of instances were failures). However, using raw agreement is generally not recommended and can be misleading when classes are imbalanced. Instead, you should typically measure precision and recall separately to get a more accurate picture of your judge's alignment.

파인튜닝은 평가 시스템이 이미 만든 부산물

Rechat resolved many failure modes through fine-tuning that were not possible with prompt engineering alone. Fine-tuning is best for learning syntax, style, and rules, whereas techniques like RAG supply the model with context or up-to-date facts.

99% of the labor involved with fine-tuning is assembling high-quality data that covers your AI product's surface area. However, if you have a solid evaluation system like Rechat's, you already have a robust data generation and curation engine!

평가 인프라 = 디버깅 인프라

When you get a complaint or see an error related to your AI product, you should be able to debug this quickly. If you have a robust evaluation system, you already have:

A database of traces that you can search and filter.

A set of mechanisms (assertions, tests, etc) that can help you flag errors and bad behaviors.

Log searching & navigation tools that can help you find the root cause of the error. For example, the error could be RAG, a bug in the code, or a model performing poorly.

The ability to make changes in response to the error and quickly test its efficacy.

In short, there is an incredibly large overlap between the infrastructure needed for evaluation and that for debugging.

핵심 원칙

Remove ALL friction from looking at data.

Keep it simple. Don't buy fancy LLM tools. Use what you have first.

You are doing it wrong if you aren't looking at lots of data.

Don't rely on generic evaluation frameworks to measure the quality of your AI. Instead, create an evaluation system specific to your problem.

Write lots of tests and frequently update them.

LLMs can be used to unblock the creation of an eval system. Examples include using a LLM to: Generate test cases and write assertions / Generate synthetic data / Critique and label data etc.

Re-use your eval infrastructure for debugging and fine-tuning.

원본 사이트 →