Harness Engineering: How to Build Software When Humans Steer, Agents Execute
Ryan Lopopolo
OpenAI Frontier의 Ryan Lopopolo가 AI Engineer 컨퍼런스에서 발표한 Harness Engineering 키노트와 Q&A. '에디터에 손도 못 대게 막은' 9개월의 실험 — 모델은 이미 isomorphic 엔지니어이고 구현은 더 이상 희소하지 않으며, 희소한 것은 인간 시간·주의력·컨텍스트 윈도우다. 코드를 disposable build artifact로 보고, 환경을 codex에 spawn하지 말고 codex가 환경을 깨우게 하라.
Harness Engineering: How to Build Software When Humans Steer, Agents Execute
생각 덩어리
토큰 억만장자가 되어 — 모델이 풀 잡을 하게 만들겠다는 제약
For the last nine months I have had the privilege of building software exclusively with agents. I am a token billionaire and I believe that in order for us to get into our AGI future, we want everybody to be token billionaires to use the models to do the full job.
I've lived that experience by banning my team from even touching their editors, to have to work through the models in order to get the job done.
구현은 더 이상 희소 자원이 아니다 — 코드는 무료다
Implementation is no longer the scarce resource of what it means to do the job of software engineering. Code is free.
Hiring the hands on the keyboards as part of our teams is only constrained by GPU capacity and token budgets. And each engineer today in this room has access to five, 50, or 5,000 engineers worth of capacity 247 every day of the year.
GPT 5.2 — 풀 잡 가능해진 매직 모먼트
For me, the magic moment was GPT 5.2, which when it came out was able to do the full job of a software engineer. The models at this point are good enough where they're isomorphic to you and I in terms of the ability to produce code at high quality that solve real user problems in real code bases.
코드는 무료지만, 사람의 동기적 주의력은 그렇지 않다
We think of code as burden because it's a synchronous attention drain on the human engineers on our team. But the models are incredibly patient. They are infinitely parallel.
The scarce resources in this world that we see today are three things. Human time, human and model attention and model context window.
P3가 사라지는 세계 — 4개 동시 시도, 하나 픽
In a world where human time is scarce and human time is required to produce code, we have a stack rank. Things are either P zeros or P2s. Those P3s will never get done. However, in a world where code is free and infinitely abundant, all those P3s get kicked off immediately, maybe 4x in parallel. We pick one that solves the problem and in it goes.
모든 내부 도구가 day one부터 i18n — 무역 비용 없는 품질
When code is free, all these internal tools can have good localization and internationalization from day one. I can make tools that my colleagues in London, Dublin, Paris, Brussels, Zurich, and Munich are able to experience in their native languages without really having to trade against any of my other teams capacity.
인간은 구현을 신경 쓸 필요 없다 — 중요한 건 prompt와 guardrail
Humans no longer need to concern themselves with implementation. The important thing is not the code but the prompt and the guardrails that got you there. This is why leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
6개월 묵힌 마이그레이션은 이제 끝낼 수 있다
Large scale refactoring in this world is free. So making things the same is something that you are all able to do. There's never going to be a migration that hangs open for six months now that you can't get the last parts of the codebase to do because you can just fire off 15 agents to drive that work to completion.
좋은 일을 한다는 것 — 500개의 작은 결정들
To do a single patch well probably requires 500 little decisions along the way around the underspecified non-functional requirements that go into producing good code. The agents, the models during their training have seen trillions of lines of code that make every possible choice of those non-functional requirements that you could ever imagine.
You can just simply say do not produce slop. Don't accept slop. You won't get slop in your codebase.
한 사람의 전문성을 모든 에이전트 trajectory에 — 페르소나 기반 문서화
I have a diverse full stack team that is experts in front-end architecture, backend scalability, being product-minded. ... Getting teammates to write those down actually means that every engineer driving agents gets the best of every single person on my team. I don't need to block on low signal code review in order to learn what it means to write a good QA plan.
컨텍스트는 paged out된다 — 리뷰어 에이전트로 지속 리프레시
Auto compaction... GPT 5.4 and Codex is fantastic at auto compaction. I essentially never have to write /new anymore. ... in this world, you have to kind of build for that expectation that context will get paged out over time. We need to be continually refreshing context as the agent goes about doing a task.
We have security and reliability review agents in our codebase that are continually running as part of every push and CI that look at those documentations and the proposed patch and do simple things like say, are there timeouts and retries on this bit of network code?
신뢰할 수 없는 리뷰어인 나를 인정하라 — 그 다음 lint로 영구화
I'm sure everyone here has been paged at some point for network code that failed in production... I am not a reliable reviewer or author of code with respect to this non-functional requirement. However, taking the time to write some docs, write a lint that is bespoke to my codebase that is going to look at every time I call fetch to make sure that there's a retry and a timeout wrapped around it means I've durably solved this problem.
소스 코드에 테스트를 — 350줄 제한도 가능하다
One really neat trick I use here is that you can write tests about the source code as well that are separate from lints. If we know that context is limited, we can write a test that limits the fact that files are no longer than 350 lines. We're adapting our codebase to the harness to the models to do a little bit of engineering to be context efficient.
에러 메시지가 곧 prompt — Zod는 AI 미래의 load-bearing 인프라
Provide a prompt via a lint or a test failure that says no no no you shouldn't have an unknown here at all because we parse don't validate at the edge and you certainly have a type here which was derived from Zod loadbearing infrastructure for our AI future.
모든 곳이 prompt 주입 지점 — prompts powers prompts rules files prompts skills
Each advancement we've had in the complexity of the way we write code to interact with these models comes from both increasing capability in the models and increasingly niche ways for injecting prompts into those models. prompts I'm sure you're aware are prompts powers prompts rules files prompts skills prompts these lint error messages... prompts review agents that inject comments onto the PR.
Prompt를 쓰는 prompt를 쓰는 prompt — 재귀적 leverage
If I find myself spending a ton of time writing prompts, we can actually shell out to the agent for that as well. I've pointed Codex at all of the prompting cookbooks we have on the OpenAI developer guide and told to synthesize a skill out of them for how to write prompts. Which means when I find a need to write prompts in order to improve my agent performance locally in the code, I use the skill to write prompts that I wrote with the agent looking at the prompts to write the prompts.
QA 플랜의 leverage — 한 명이 만든 것이 모든 에이전트에 깔린다
A single product-minded engineer on my team was able to give us a big lift. They know what it means to write a good QA plan. ... Once you write those down on how to write a good QA plan with the expectation that all user-facing work has a QA plan, now a review agent is able to assert expectations around what it means to prove that you have effectively written the feature.
그냥 가서 만들어라 — 루프에서 자신을 빼라
Do not hesitate to remove yourselves from the loop by getting the agents to do the full job because they can.
Q&A — 컨퍼런스 라이브 토크
노트북 없는 셋업 — Beach, Margarita, Linear
Can you show us your actual working setup with no laptop? — Beach, Margarita, Linear, right?
환경에 codex를 spawn하지 말고 codex가 환경을 깨우게 하라
We want the entry point to the development process to be Codex not an environment which we build around it. So we kind of do things outside in. Codex is the entry point the same way you would be and we give it tools we give it instructions on how to cook. So rather than like creating a shell that our app and Codex get spawned into, we have a skill that teaches Codex how to launch the app that teaches Codex how to spin up that local observability stack to give it logging and telemetry.
5~10개 스킬에 leverage 집중 — wide보다 deep
We kind of centralize our leverage around five to 10 skills. We don't go super wide on skills preferring to make the existing skills better because at least I find that the infrastructure within the repository all the local developer tools change super frequently and I don't really have the bandwidth to keep track of this. So we hide all that complexity beneath the skills.
When we moved from using Chrome DevTools protocol directly to having this like Daemon thing — I didn't know that had happened for like three weeks. It was like totally fine because Codex was able to do the thing.
Bitter lesson에 흡수되지 않는 harness — context는 obsolete되지 않는다
How do I make sure the work that I do isn't completely obsoleted by an increase in model capability... context is a thing that I don't think will ever be obsoleted. The models must be told the requirements of the task which guardrails to pay attention to. So a good harness is really operationalized around giving the model text at the right time so it can look at the work it has done and the information around what a good job looks like.
Just-in-time prompt — 프론트로딩하지 마라
You don't want to frontload all those instructions because then you kind of overwhelm the agent, but all of these sort of requirements around what a good job do need to be paid attention to over the entire course of a PR. So figuring out ways to either defer or just in time surface those instructions is kind of what a good harness should do.
You should kind of let the agent cook and prototype and experiment with the UI you want to build and then at lint or test time say, "Okay, you've done the work. In order to finish it, you have to break this apart so that your components are small and as stateless as possible."
Lab의 1st-party harness — post-training의 wave를 타라
The labs are not just post-training the models but post-training the models in the context of the harness in which they are primarily deployed in. The apply patch tool or the specific quoting semantics of how to invoke the bash tool are in the loop for the post-training process for the harnesses from the labs which means there is leverage to be had by depending on these sort of firstparty harnesses directly.
PR이 collaboration hub — block하지 말고 throughput 최적화
In this world it has largely been just markdown files in the repository and GitHub that have been the primary sort of hub and spoke. ... a PR kind of has a similar purpose. So we kind of treat that as a big hub and spoke broadcast domain where all of the agents and humans collaborate together. Because we optimize for throughput, we don't block on any sort of contribution to that.
Being super prescriptive around like every bit of feedback must be addressed can kind of have this catastrophic failure mode of your coding agent being bullied by all of the reviewers when really we want to bias toward code being accepted, not perfect, not drowning in minutia.
시작하는 방법 — 더 많은 테스트로 자신감을 만들어라
The agents are super good at looking at the existing code with some context around how it is meant to be used and writing tests that assert that behavior. So kind of using this to improve your confidence in the quality of the code will also increase the agents ability to successfully navigate it which means you don't have to worry as much around doing super detailed review.
차에 노트북 묶고 통근 — 30분의 token saturation
Usually what I'll do is kick off a task right before I leave the office. Tether my laptop to my phone, buckle it into the back seat, and kind of let it cook in the 30 minutes it takes me to get home. ... The dream here is that I actually have 50 agents running 247 and I don't have to interact with them at all.
Every time I have to type continue to the agent is like a failure of the harness to provide enough context around what it means to continue to completion.
7명 팀에 750개 패키지 — 10,000명 조직처럼 자르기
Eventually ended up with a mess... we ended up going like full 10,000 engineer organization heavy on the architecture. 750 packages in the PNPM workspace isolated by business logic domain or layer of the stack.
Code in the file system is also text which means it's effectively prompts that you're giving to your coding agent. Making the code as much the same as possible kind of makes it so that regardless of where in the repository your agent is looking, it develops a ton of transferable context.
Garbage Collection Day — 매주 금요일은 slop 박멸
I essentially asked every engineer on the team to take one day a week, Fridays, we called it garbage collection day, where our entire job was to take every bit of slop we had observed over the course of the week that was making a PR difficult to merge and figure out ways to categorically eliminate it from ever happening in the first place.
The feedback that humans were giving on the PR indicates some context failure on behalf of the agent, getting that into the repository and then figuring out ways to automatically prompt inject the agent so that it would self-heal when it produced this bad behavior.
페르소나별 리뷰어 에이전트 — 모든 push에서 작동
We kind of asked folks to basically bucket the types of review feedback they were giving into the persona they were operating as — front-end architect, reliability engineer, scalability sort of thing. And then basically for each of those personas, we spun up a review agent that gets triggered on every push that says, is this code good?
토큰 사용 분포 — 1/3 plan, 1/3 implementation, 1/3 CI
Probably a third a third a third between like planning ticket curation documentation implementation and stuff that runs in CI.
Plan mode에 회의적 — 안 읽고 승인하면 나쁜 지시를 인코딩하는 셈
If you do use a plan and you approve it without reading it at all, you're actually encoding a bunch of instructions that you don't necessarily want followed. So if you are going to use plans, my recommendation is to push those up as single PRs with just the plan where you actually have human review every line of it and like block on human approval before they get merged and then kicked off.
코드는 disposable build artifact — LLM as fuzzy compiler
Is code a disposable build artifact? — Yes. Yes.
All of the context that we're putting in the codebase for harness engineering is effectively like constraints and optimization passes on which code is acceptable to build in the first place. This is pretty similar to the static analysis and optimization passes that something like LLVM would do in the process of compiling Rust code. Swapping out one model for another is sort of like changing your code generation backend from LLVM to cranelift in the Rust compiler.
만들고 싶은 미래 — 토큰 예산 + 분기 목표를 주면 알아서 돈다
The future I want to build toward is where I'm able to take a token budget and a quarter, a half or a year's worth of work, take the human input to rank what is most important success metrics, reliability metrics, give it to the machines and have them continually work and advance my product forward without my hands explicitly on the wheels at all.
코드 작성 바깥의 SE — triage, 운영, 트위터 vibes
There's a whole universe of software engineering outside of writing code. I am triaging user feedback. I'm triaging pages. I am making sure that we don't have any PII leaking in the logs in production. I'm making sure that the Twitter vibes are good and people are enjoying my software. ... as I no longer have to produce code my mind can shift to these other higher level or more squishy activities but the agents are good enough to do these things too.