Latent Space

토큰 억만장자를 위한 익스트림 하네스 엔지니어링 — Ryan Lopopolo (OpenAI Frontier)

swyx · Vibhu

OpenAI Frontier 팀이 5개월간 0줄의 사람 작성 코드로 100만 LOC Electron 앱을 출하한 실험. Harness Engineering의 본질은 '모델이 실패할 때 무엇을 더 잘 프롬프트할지'가 아니라 '어떤 능력·컨텍스트·구조가 빠졌는가'를 묻는 것. Symphony, Ghost Libraries, 1분 빌드, agent-legible 소프트웨어, 그리고 인간이 곧 병목이라는 인식.

Extreme Harness Engineering for Token Billionaires

생각 덩어리

"내가 코드를 쓰지 않는다"는 제약 — 에이전트를 isomorphic한 엔지니어로

Starting with this constraint of I can't write the code meant that the only way I could do my job was to get the agent to do my job.

If we're trying to make agents that can be deployed into enterprises, they should be able to do all the things that I do. And having worked with these coding models, these coding harnesses over 6, 7, 8 months, I do feel like the models are there enough, the harnesses are there enough where they're isomorphic to me in capability and the ability to do the job.

첫 달 반은 10배 느렸다 — 그 비용을 치렀기 때문에 이후가 빨라졌다

Whenever the model just cannot, you always pop open at the task, double click into it, and build smaller building blocks that then you can reassemble into the broader objective.

Honestly, the first month and a half was 10 times slower than I would be. But because we paid that cost, we ended up getting to something much more productive than any one engineer could be because we built the tools, the assembly station for the agent to do the whole thing.

모델 세대 교체마다 빌드 시스템을 갈아엎다 — Make → Bazel → Turbo → Nx

With 5.3 and background shells, it became less patient, less willing to block. So we had to retool the entire build system to complete in under a minute.

This is not a thing I would expect to be able to do in a code base where people have opinions. But because the only goal was to make the agent productive over the course of a week, we went from a bespoke makefile build to Bazel, to Turbo to Nx and just left it there because builds were fast at that point.

1분 빌드 — 사람 팀에서는 불가능한 규율, 토큰이 싸기 때문에 가능

The hill we were climbing, right, was make it fast.

We want the inner loop to be as fast as possible. One minute was just a nice round number and we were able to hit it.

Because tokens are so cheap, and we're so insanely parallel with the model, we can just constantly be gardening this thing to make sure that we maintain these invariants, which means there's way less dispersion in the code and the SDLC.

인간이 병목이다 — 유일하게 희소한 건 팀의 동기적 주의력

We've moved beyond even the humans reviewing the code as well. Most of the human review is post merge at this point.

The model is trivially parallelizable, right? As many GPUs and tokens as I am willing to spend, I can have capacity to work with my code base. The only fundamentally scarce thing is the synchronous human attention of my team.

You have to step back, right? Like you need to take a systems thinking mindset to things and constantly be asking where is the agent making mistakes? Where am I spending my time? How can I not spend that time going forward?

환경을 코덱스에 넣지 말고, 코덱스가 환경을 깨우게 하라

One neat thing here is we have tried to invert things as much as possible, which is instead of setting up an environment to spawn the coding agent into, instead we spawn the coding agent, like that's the entry point. It's just Codex. And then we give Codex via skills and scripts the ability to boot the stack if it chooses to.

This I think is like the fundamental difference between reasoning models and the 4.1s and 4os of the past, where these models could not think so you had to put them in boxes with a predefined set of state transitions. Whereas here we have the model, the harness be the whole box. And give it a bunch of options for how to proceed with enough context for it to make intelligent choices.

모델은 텍스트를 갈망한다 — skills, tech tracker, quality score

The models fundamentally crave text. So a lot of what we have done here is figure out ways to inject text into the system.

When we get a page, because we're missing a timeout, for example, I can just add Codex in Slack on that page and say, I'm gonna fix this by adding a timeout. Please update our reliability documentation to require that all network calls have timeouts. So I have not only made a point in time fix, but also like durably encoded this process knowledge around what good looks like.

리뷰어에게 괴롭힘당하지 않게 — 양쪽에 push-back 권한을 부여

Initially the Codex driving the code author was willing to be bullied by the PR reviewer, which meant you could end up in a situation where things were not converging.

The reviewer agents were instructed to bias toward merging the thing to not surface anything greater than a P2 in priority... We gave it a framework within which to score its output.

On the code authoring agent side, we also gave it the flexibility to either defer or push back against review feedback... without the context that this is permissible, the coding agents are gonna bias toward what they do, which is following instructions.

완전한 위임 — $land 스킬이 PR 전 라이프사이클을 처리

We invoke a $land skill and that coaches Codex to push the PR, wait for human and agent reviewers, wait for CI to be green. Fix the flakes if there are any, merge upstream. If the PR comes into conflict, wait for everything to pass. Put it in the merge queue. Deal with flakes until it's in main.

This is what it means to delegate fully... in a very large model repo, probably a significant tax on humans to get PRs merged, but the agent is more than capable of doing this and I really don't have to think about it other than keep my laptop open.

의존성을 in-house화하라 — 디버깅·보안 코드 수정이 훨씬 쉬워진다

Software dependencies are going away. Basically they can just be vendored.

A couple thousand line dependency is a thing that we could in-house no problem. ... most of that code you don't even need. Like by in-housing an abstraction, you can strip away all the generic parts of it and only focus on what you need to enable the specific thing you're building.

When we deploy Codex Security on the repo, it is able to deeply review and change the internalized dependencies in a much lower friction way than it would be to push patches upstream, wait for them to be released, pull them down, make sure that's compatible with all the transitive.

비인간 가독성 vs 에이전트 가독성 — 로컬 뷰어를 만드는 게 오히려 틀렸다

We asked them to export a trace for us... our on-call engineer did a fantastic job of working with Codex to build this beautiful local devrel tool, next JS app, the drag and drop the tar ball in, and it visualizes the entire trace. It's fantastic. Took an afternoon, but none of this was necessary. Because you could just spin up Codex and give it the tar ball and ask the same thing and get the response immediately.

So in a way, optimizing for human legibility of that debugging process was wrong. It kept him in the loop unnecessarily when instead he could have just like Codex cooked for five minutes and gotten this same.

Ghost Libraries — 소스 대신 스펙으로 소프트웨어를 배포한다

The way we distribute it as a spec, which I think folks are calling Ghost Libraries on Twitter. ... It does mean it becomes much cheaper to share software with the world, right? You define a spec, how you could build your own.

Spin up a team of Codex to implement the spec. Wait for it to be done. Spawn another Codex and another team to review the implementation compared to upstream and update the spec so it diverges less. And then you just loop over and over Ralph style until you get a spec that is with high fidelity able to reproduce the system as it is.

하드×새로움 — 아직은 인간이 드라이브해야 하는 유일한 사분면

There's this axis where you have things that are easier, hard, or established or new, right? And I think things that are hard and new is still something that the models need humans to drive.

Those other quadrants are largely solved. Given the right scaffold and the right thing that's gonna drive the agent to completion.

The humans, the ones with limited time and attention get to work on the hardest stuff, like the problems where it's pure white space out in front. Or like the deepest refactorings where you don't know what the proper shape of the interfaces are.

Symphony — 터미널에서 인간을 제거하는 elixir 오케스트레이션

At the end of December we were at about three and a half PRs per engineer per day. This was before 5.2 came out... Everyone gets back from holiday with 5.2 and no other work on the repository. We were up in the five to 10 PRs per day per engineer. And I don't know about y'all, but like it's very taxing to constantly be switching like that.

In Symphony, there's this like rework state where once the PR is proposed and it's escalated to the human for review, it should be a cheap review. It is either mergeable or it is not. And if it's not, you move it to rework. The elixir service will completely trash the entire work tree and PR and start it again from scratch.

This is that opportunity again to say, why was it trash? What did the agent do that was bad. Fix that before moving the ticket to in progress again.

10,000명 조직의 아키텍처 — 7명 팀이지만 패키지는 500개

We have such a rigid, like 10,000 engineer level architecture in the app because we have to find ways to carve up the space so people are not trampling on each other.

The structure of the repository is like 500 NPM packages. It's architecture to the excess for what you would consider, I think normal for a seven person team. But if every person is actually like 10 to 50, then like numbers on being super deep into decomposition and sharding and proper interface boundaries make a lot more sense.

스킬 증류 — 자기 세션 로그를 들여다보게 하라

There's one neat thing you can do with Codex, which is just point it at its own session logs to ask it to tell you how you can use the tool better.

We're actually slurping these up for the entire team into blob storage and running agent loops over them every day to figure out where as a team can we do better and how do we reflect that back into the repositories? So that everybody benefits from everybody else's behavior for free.

These are all feedback. That means the code as written, deviated from what was good, a PR comment, a failed build. These are all signals that mean at some point the agent was missing context.

MCP에 대한 회의 — 토큰을 강제로 주입당하는 구조

MCPs I'm pretty bearish on because the harness forcibly injects all those tokens in the context, and I don't really get a say over it. They mess with auto compaction. The agent can forget how to use the tool. There's probably only what three calls in playwright that I actually ever want to use. So I pay the cost for a ton of things.

Somebody vibed a local daemon that boots playwright and exposes a tiny little shim CLI to drive it. And I had zero idea that this had occurred because to me, I run Codex and it's like, oh, it's better.

에이전트 친화적 CLI 출력 — 토큰 효율이 곧 친화성

CLIs are nice because they're super token efficient and they can be made more token efficient really easily. ... in our PNPM distributed script runner, when you do --recursive, it produces an absolute mountain of text. But all of that is for passing test suites. So we ended up wrapping all of this in another script which can channel only output the failing parts of the tests.

You're gonna want to patch --silent to prettier because the agent doesn't care that every file was already formatted. Just wants to know it's either formatted or not.

UI도 텍스트로 — 래스터라이즈 후 OCR

We want the agent to be able to see the UI. Agents do not perceive visually in the same way that we do. They don't see a red box, they see red box button, right? They see these things in latent space.

If we wanna actually make it see the layout, it's almost easier to rasterize that image to ask OCR and feed it in to the agent.

현재 모델이 못하는 것 — 제로→원과 gnarly refactoring

They're definitely not there on being able to go from new product idea to prototype single one shot.

The gnarliest refactorings are the ones that I spend my most time with, right? The ones where I'm interrupting the most, the ones where I am now double clicking to build tooling to help decompose monoliths.

The only thing that's different is figuring out how to get what's in here into context for the model and for these white space sort of projects. I, myself, I'm just not good at it. Which means that often over the agent trajectory, I realize the bits that we're missing, which is why I find I need to have this synchronous interaction.

하네스와 트레이닝의 긴장 — on-policy로 지어라

There's this fundamental tension between whether or not we invest deeper into the harness or we invest deeper into the training process to get the model to do more of this by default. And I think success for the way we are operating here means the model gets better taste because we can point the way there and none of the things we have built actively degrade agent performance.

If we were building an entire separate rust scaffold around Codex to restrict its output, that I think would be like additional harness that would be prone to being scrapped. But... if instead we can build all the guardrails in a way that's just native to the output that Codex is already producing, which is code, I think no friction with how the model continues to advance.

The RL equivalent is on policy versus off policy. And you're basically saying that you should build an on policy harness, which is already within distribution and you modify from there. But if you build it off policy, it's not that useful.

core_beliefs.md — 밈과 Slack 문화까지 에이전트에 주입

One thing that's in core beliefs.md is who's on the team, what product we're building, who our end customers are. Who our pilot customers are, what the full vision of what we want to achieve over the next 12 months is. These are all bits of context that inform how we would go about building the software. So we have to give it to the agent too.

One thing that I think is gonna break your mind even more is we have skills for how to properly generate deep fried memes and have Jira culture and Slack. Because with the Slack ChatGPT app that you're able to use in Codex, like I can get the agent to shit post on my behalf.

원본 사이트 →