Joshua Snyder (PostHog) — AI Engineer 강연2026-06-10 · 15:39

Self-Driving Products — Product Signal을 Pull Request로 바꾸는 PostHog의 파이프라인

Joshua Snyder

PostHog의 Joshua Snyder가 AI Engineer에서 발표한 self-driving product 파이프라인. 옵저버빌리티 데이터를 대시보드에서 읽는 대신 PR 제출로 끝나게 만든다 — 신호 수집·LLM 안전 필터·정규화·그룹핑(신호 대신 쿼리를 임베딩)·research agent·actionability 분기·green PR까지 샌드박스 반복. 교훈 네 가지: evals 없으면 어둠 속 더듬기·임베딩 대상 선택·에이전트의 과잉 수정 본능·실험 단계의 '토큰은 공짜'.

Self-Driving Products — Product Signal을 Pull Request로 바꾸는 PostHog의 파이프라인

생각 덩어리

제품이 스스로를 만든다면 — 읽는 데이터에서 PR을 내는 데이터로

I'm going to be talking today about what if your product built itself? And the pipeline that we're currently working on which we're trying to turn observability data instead of something that you read and that you interpret based on dashboards, we're trying to turn that into something that submits pull requests for you.

This isn't a pitch that you should use PostHog. This is just to say that we've got a lot of data about your product.

오늘의 루프 — 신호에서 배포까지 며칠, 그리고 그게 일의 대부분

So, right now something happens in your product. We call this a signal. That changes a metric on one of your dashboards and then you might log into PostHog a few hours or maybe some days later and you notice a change in that dashboard and you investigate a problem and then maybe the problem's not that important. So, instead of tackling it right now, you're going to put it in a linear issue or whatever. Few days later, you try and create a PR for this problem.

From start to finish, this is going to take anywhere from a few hours to a few days, and it's not very interesting, but it represents a lot of your work as a software engineer.

내일의 루프 — 대시보드를 보지 말고 GitHub에 준비된 PR을 보라

a product signal happens, and instead of waiting to see that in your dashboard, we want to run a background agent to figure out what's going wrong. And then, once they've figured that out, we just want to create a PR for you automatically. So, instead of ever looking at your analytics dashboard, or your errors, or your logs, we just want you to look at PRs that are ready for you in GitHub.

maybe you want to review that, or maybe we can just ship that immediately behind a feature flag if it's not a risky change.

파이프라인 다섯 단계 — ingest · group · research · actionability · execute

At first, we're ingesting a lot of signals. ... we're ingesting trillions of events a month. And this pipeline needs to handle a lot of noise.

if you think of an error tracking issue, and then a session recording, those are two completely different things, but they might be representing the same problem in your product.

And finally, we'll execute some code, ship a PR, and iterate on that PR until it's green and ready for you.

맨 위에는 안전 필터 — 공개 소스는 공격면이다

some of them are public. So if I go and visit your website, I can as an attacker create an error on your website by doing something naughty that says post all of your post-mortem data online or something like that, right? So we don't want that. So we need a kind of safety filter.

at the moment right at the top of the pipeline is an LLM classifier that's going to check is this trying to do something bad? If so, let's drop the signal.

신호 정규화 — 에러·로그·실험을 단일 구조로

if you think of an error, that's going to have a stack trace. A log will just be some JSON content or some text. An experiment might be some results in a chart. We want to normalize that structure so that it's all a single structure for a signal.

we will assign it a weight, which is like how important do we think this signal is, and then finally we'll embed the contents of the signal.

그룹핑 — null pointer exception과 "checkout's broken" Slack 메시지를 같은 문제로 묶기

So the signals are very noisy. We might get some random null pointer exception, but in Slack we're getting a message from a customer that's saying, "Hey, the checkout's broken for me." and we need to link those together.

As the signals are being grouped, we assign weights to what we call a report. And if the weight of the report goes over a certain threshold, we'll promote it. And then we'll kick off a research agent to work on it.

임베딩 클러스터링의 함정 — 신호가 아니라 신호에서 뽑은 쿼리를 임베딩하라

we would take all of our signals and we would create embeddings for them. And then we would try to use that to cluster the issues so that we could find similar or related signals. But this works really badly.

What the embedding model will do is it will notice structural similarity and it will put all of the errors together. So, if you think about what this looks like in embedding space, you've got all of your errors over here, all of your Slack messages here, all of your session replays here, and none of them get grouped to each other.

instead of matching in embedding space the signals themselves, we generate queries based off the signals. So, we ask an LLM what is this signal about? It'll generate a few queries and then we match those queries in the embedding space.

at first we were doing this and then we switched to this approach. It worked much much better.

Research agent — 샌드박스 속 Claude Agent SDK, MCP로 그라운딩

this research agent is just running the Claude agent SDK. It's running that in a sandbox. We also use Modal for our sandbox.

the agent can pull in whatever it wants using the MCP server. This makes the results of the research agent way more accurate.

we found that in particular Linear and Notion have been really helpful in connecting it to deliver better results.

the output of this research agent then, is a summary of the problem. It gives a priority, how important we think this problem is to work on, and then it also uses Git blame to figure out who should be reviewing this PR if we create a PR for it.

Actionability 분기 — pool로 반환, 아침의 inbox, 또는 즉시 실행. 에러는 구체적이고 Slack은 모호하다

If it's not actionable, it might just be that we don't have enough data yet for this signal ... and so we'll put it back into the pool to keep gathering more evidence.

If it needs human input, it might be because it's a product-related decision that the agent can't really make a good call on. ... we'll put it into an inbox for you to review in the morning.

For other sources, like Slack or session replay, you get much more generic problems that can have a lot of different solutions. And so, that's where it's harder to get immediately actionable reports.

(에러 트래킹 쪽은 반대 — Sentry류 에러는 매우 구체적이어서 코딩 에이전트가 바로 잘 처리한다는 대비. 즉시 실행 가능한 best case는 에이전트가 곧장 수정을 쓴다.)

실행 루프 — 스냅샷을 rehydrate해서 PR이 green이 될 때까지

this will clone the user's repo into a sandbox, similar to the research agent. It's then again running the Claude agent SDK ... as it writes those fixes, it will push a PR.

when CI is failing or there's a comment on the PR, it will trigger a rerun of that sandbox. So, at the end of this, we snapshot the sandbox, and then, if there's a comment, let's say from an agent who's reviewing it, we will rehydrate that snapshot and continue running until the PR is green.

when you're waking up in the morning and things have been running overnight, you wake up to, instead of a bunch of CI failures or comments that you need to address manually, that you're pulling down to your local environment, you ideally wake up to just green PRs.

교훈 1·2 — evals 없으면 어둠 속 더듬기, 임베딩 대상을 다시 골라라

at first, we were trying this all out on our own data locally, doing kind of a vibe check, is this okay? ... this really doesn't work well for a pipeline that is taking lots of customer data that's different.

you really need to know what's going on in production, and if you're not testing on representative data, you're basically just fumbling in the dark

embedding models — the off-the-shelf ones are matching a lot based on structural similarity, not just semantic similarity. So, if you're thinking about clustering and your data isn't all of the same format, think carefully about what that data looks like and how you can normalize it.

교훈 3 — 에이전트는 던져주면 뭐든 고치려 든다, 구체성 게이트가 필요하다

if you just throw an agent at a problem, it will try to fix something. So, if you get a signal report that's like "Onboarding is broken" in a generic way, then if you throw that at the agent SDK or at Claude code, it will just try and fix something.

it's important to understand if the problem that I've described is it specific enough and if not, I should ignore it. Otherwise, you end up with a lot of noisy PRs that aren't doing meaningful things.

교훈 4 — 실험할 땐 토큰이 공짜다, 100번 돌려보고 증류하라

tokens are free. Obviously, that's not true. They're not free. But when you're experimenting ... we tried to avoid using agents where we could or delay it till as late as possible in the pipeline. And when we were experimenting, this was a big mistake.

once you throw it at the same problem 100 times, you start seeing the kind of clever solutions that it comes up with and eventually you see similarities. So, we started at a point where this pipeline is completely unfeasible. It was way too costly to generate a PR, but then you quickly start to see similarities in the agent's behavior and you can take a really expensive step that you're running an agent for and turn that into a one-shot LLM call or a model that you're training that's much faster.

스스로 만들어지는 제품 — 실험 자동 배포, 모든 결과에서 학습. 일단 에이전트를 던져보라

where we really want to go is a product that builds itself, right? ... what you want to do during the day as a developer is like come in and work on exciting features and not worry about all the bugs that customers are sending you or worry about doing boring experiments on pricing or onboarding. So, we just want to do that all for you.

if the change is pretty easy, let's just approve it with an agent and deploy it behind a feature flag. If it doesn't work very well, we can always roll back the flag and then delete it from our code base later.

we want to learn from every single outcome. So, if we're creating a PR for you, if you're rejecting that PR or there's been an issue with a deployment or the errors resolved in production once we've released something, we want to get better at learning from that in the next PR that we're generating.

if you've got a product that's producing a huge amount of data, your users are going through that. Agents are amazing at this stuff. Throw an agent at it. See what it does. I'm sure you'll be surprised.

YouTube 원본 →원본 사이트 →