Phil Hetzel (Braintrust)

Why building eval platforms is hard

Phil Hetzel

Braintrust의 Solutions Engineering 리드가 풀어놓는 eval 플랫폼 성숙도 단계론. 스프레드시트 → 바이브코딩 UI → 플레이그라운드 → 관찰가능성 통합 → 데이터 시스템 문제. eval은 본질적으로 UI/UX가 아니라 시스템 문제이며, 핵심은 production trace로부터 failure mode를 발견해 다시 offline eval로 닫는 flywheel.

Why building eval platforms is hard

생각 덩어리

LLMs have extreme variability — 그래서 eval이 필요하다

Evals are important because um this sounds obvious but LMS have extreme variability. Uh we love LMS because they're highly variable. There are so many different types of problems that LMS can reason to solve. That's why we're that why you know we're so attracted to to them as a technology.

Agents are becoming the norm in how customers are interacting with companies. People expect an agentic experience now. So if you combine both of those things together, you really need to be confident in how your agent is going to perform once it is in production.

The spreadsheet floor — 시작은 부끄러움 없이

How many people are like they're doing eval right now, but it's just on a Google sheet or or some spreadsheet. There's probably there's no shame in that, my friend. Raise that hand high.

All I need to know is how to loop through my agent with a couple of different inputs and be able to display some, you know, handwritten notes and scores about that agent.

The iceberg — eval은 그 세 가지가 전부가 아니다

It would be a really short presentation if if this is all eval was, I would I would thank you for your time and I would walk out the room, but that's not that's not what you're here for. There is a whole other part of the iceberg.

There are a lot of things that you end up having to build when you're really serious about evals.

Eval is a multi-persona problem

It's also a multi-persona problem uh bu building these agents. It's not just something that engineers do in isolation. It's something where engineers uh whether whether they're product engineers or AI engineers or both systems engineers to get the thing running.

Uh and then lastly it uh eval themselves become a systems problem.

Documenting vs experimenting — 스프레드시트의 한계

While you have this spreadsheet of you know a bunch of input examples maybe you keep track across each time you are tweaking your agent the the different output that that emitted um that can become cumbersome to to manage over time.

This is more I would call this documenting. It it's not really experimenting.

Evals are a team sport

Evolves are are a team sport kind of what I was talking about before we want to make sure that we're bringing a ton of people into the fold not just technical folks but also nontechnical folks they can add a lot of value to uh to your agent because of their domain unique domain expertise and proximity to users. They're probably not coming into the spreadsheet is is my point.

"I can just vibe code Brain Trust" — 자체 UI 단계

Probably the one of the, uh, um, most fun conversations that, uh, I have in my job is I'll have, you know, a very proud product engineer that gets on a call with me and, you know, they they puff their chest out and they smirk at me and they say, "Well, I can just vibe code branch. It's no problem."

So now you have a a better story around persistence of eval. Uh and because of this you're bringing more people into the fold. You are making UIs that are a little bit more bespoke for your specific users. Um the thing that's that's a problem here is that you're still not really iterating yet.

Playground — 비기술 사용자도 만질 수 있는 sandbox

Experimentation to me means that you can give a user access to a an agent a a configuration of an agent and a sandbox and you allow them to tweak certain parameters within within that agent.

This is where the rubber starts beating the road because the best way to perform eval is to um really think about the failure modes that your agent can fall into um and build scoring functions around those failure modes.

Failure mode 발견의 출발점은 production trace

The best way to find those failure modes in the first place is to have access to production trace data i.e. your agent in front of real users users and and real usage.

The flywheel — observability와 eval은 같은 문제

We want to make sure that we can connect what we at least internally we call the flywheel. Observability and eval to us it's actually the same problem from a from a systems perspective.

Funny story we used to be uh three years ago when we started we were only an eval platform and then we noticed one of our customers was running this massive eval like a every hour of every day. So we reached out to this person and they said, "Oh yeah, I'm just piping all of my production traffic into this database and I'm running an eval against it."

This is a loop so it's it's not just a process. uh you're going to be performing this loop hopefully for the lifetime of the agent that that you're pushing to production.

"If you build it, you have to manage it"

The bad here uh if you build it, you have to manage it. So just because you've you know uh vibe coded a platform, guess what? You might get a promotion for it, but also like that's going to be your job now.

Agent traces are nasty — 일반 application trace가 아니다

Agent traces specifically, if you kind of look on the on the screen, these are really nasty. They're not like normal application traces. Um they are they are really semistructured. A lot of times they're unstructured. there's just a ton of text inherent to LM problems that we're that we're solving.

If you're trying to cram, you know, a a one gigabyte trace into a Postgress row, that can lead to a lot of performance problems. Uh and they're numerous. It's high velocity uh because there's so much usage happening in production.

BTQL — 우리가 만들었지만 우리도 싫어했던 DSL

We used to use an open- source data warehouse for this. And we used to stitch these two sources together uh through a a a domain specific language that we created called BTQL that no one liked and and including us we we hated it. And then we would perform like a third level of of aggregation with using duct db in the in the browser.

A customer like notion for um as an example just a ton of a lot of of unstructured data that they were sending us. They wanted to be able to perform things like full text search across a trace. None of these technologies are really equipped to perform text style um analytics which is a challenge with with the LM domain because there just so much text.

Eval is a systems problem, not a UI/UX problem

Measuring Asian quality performing eval performing observability. It's actually a systems problem. It's not just a UI UX problem. We recognize that it's quite easy to vibe code the UI of evals. Um, but it's way way way more challenging to create that data layer of of running a successful eval and observability platform.

거대한 span — 한 span이 10~20MB

The data comes in really fast. uh the data are are are like just really large when they come in. So even even though traditional uh spans in a trace span is just like one part of a trace, traditional span would be like a couple kilobytes. Here we've seen spans that are 10 20 megabytes in size just so much context within those spans.

Coding agent을 eval 플랫폼에 풀어놓기

Let's say that you want to let a coding agent loose on your evals platform so that you can be a little bit more self-healing with grabbing data in aggregate from your evals platform using a coding agent to grab that into context and change your agent uh within um you know with with within a a coding agent session. That's something that's going to be really challenging to do if you can't run a lot of just pure SQL on the data back end of of your evals platform.

Headless eval — UI에 관심 없는 사용자들

We've actually noticed a lot of these headless style use cases come up where people aren't in interested in the UI at all. The only thing that they're interested in is how can I perform eval in a way where I can use a codeex or I can use a cloud code to to help uh um help increase the quality of my agent for me.

Unknown unknowns — topic modeling으로 사용 패턴 발견

What you can expect to build into your evals platform is you want to be able to tell folks the unknown unknowns of your agent. I.e. don't make me look across a whole bunch of traces. Just tell me how people are are using uh our our agent. Uh so you want to be able to uncover those unknown unknowns unknown unknowns through topic modeling techniques so that you know where to spend your engineering time.

Build for agents, not just humans

You want to make sure that you are building your platform not just for humans but also for agents because that's one of the main media for how people are are creating technology. Now um we didn't even talk about the non-functional requirements that go into building these platforms like role-based access control data masking.

AI proxy — 자동 tracing의 거버넌스

A consideration for adding automatic tracing through some type of AI proxy or or gateway so that people don't even have a choice but to trace their their LMS. Um you can govern very centrally by adding tracing automatically uh to uh to your eval platform.

YouTube 원본 →원본 사이트 →