RAG 이후의 리트리벌 — Hybrid Search, Agents, Database Design (Simon Hørup Eskildsen, Turbopuffer)
swyx · Alessio
모델은 세상의 지식을 몇 테라바이트 weights에 압축할 수 없다 — 그래서 외부 시스템이 진실을 full fidelity로 들고 있어야 한다. Readwise의 $5k 인프라에 $30k 추천엔진 비용이 붙는 순간 시작된 Turbopuffer. S3 일관성(2020.12), NVMe SSD(2017), Compare-and-Swap(2024.말) — 15년 만에 열린 창문. 에이전틱 워크로드는 한 번의 retrieval이 아니라 수십 개 병렬 쿼리를 요구한다.
Retrieval After RAG: Hybrid Search, Agents, and Database Design
생각 덩어리
검색 엔진이 아니라 — weights에 압축할 수 없는 진실의 저장소
We can take all of the world's knowledge, all of the exabytes and exabytes of data that there is, and we can use those tokens to train a model, but we can't compress all of that into a few terabytes of weights, right? Compress into a few terabytes of weights, how to reason with the world, how to make sense of the knowledge.
But we have to somehow connect it to something externally that actually holds that like in full fidelity and truth. ... being the search engine for unstructured data is the focus of turbopuffer at this point in time.
큰 데이터베이스 회사를 만드는 세 가지 조건
If you wanna build a really big database company, sort of, you need a couple of ingredients to be in the air... which only happens roughly every 15 years. You need a new workload. You basically need the ambition that every single company on earth is gonna have data in your database.
The second condition to build a big database company is that you need some new underlying change in the storage architecture that is not possible from the databases that have come before you.
The third thing you need to do to build a big database company is that over time you have to implement more or less every query plan on the data. ... when someone has data in the database, they over time expect to be able to ask it more or less every question.
Readwise의 napkin math — $5k 인프라에 $30k 추천 엔진
We were like, oh, maybe we should build a little recommendation engine and some features to try to hook in the LLMs. ... I found out that I got articles about having a child. I'm like, oh my God, I didn't, I didn't know that they were having a child.
This was a company that was spending maybe five grand a month in total on all their infrastructure and... when I did the napkin math on running the embeddings of all the articles, putting them into a vector index, putting it in prod, it's gonna be like 30 grand a month. That just wasn't tenable.
That haunted me. I couldn't stop thinking about it. I was like, okay, there's clearly some latent demand here. If the cost had been a 10th, we would've shipped it.
왜 아무도 만들지 않았나 — NVMe 위에 S3로 puff-in하는 DB
Why hasn't anyone build a database where you just put everything on object storage and then you puff it into NVMe when you use the data and you puff it into dram if you're querying it alive?
The only real downside to that is that if you go all in on object storage, every write will take a couple hundred milliseconds of latency, but from there it's really all upside.
You just couldn't have done that 10 years ago... you really have to build a database where you have as few round trips as possible... no databases were designed that way. Within NVMe SSDs you can drive like within a very low multiple of DRAM bandwidth if you use it that way. And same with S3.
15년 만에 열린 세 개의 창 — NVMe(2017), S3 일관성(2020), CAS(2024)
NVMe SSDs were also not in the cloud until around 2017... And then the second thing is like S3 becomes consistent in 2020. So now it means you don't have to have this like big foundation DB or like zookeeper or whatever sitting there contending with the keys.
Compare and swap allows you to... download the file, you make the modifications, and then you write it only if it hasn't changed while you did the modification and if not you retry. ... That primitive was not available in S3. It wasn't available in S3 until late 2024, but it was available in GCP.
Notion을 위해 dark fiber를 사다 — 멀티클라우드 레이턴시 허들
We're closing deals with Notion actually that was running in AWS and we're like, trust us. You, you really want us to run this in GCP? And they're like, no, I don't know about that. ... the latency across the cloud were so big and we had so much conviction that we bought like, you know, dark fiber between the AWS regions in Oregon.
We had so high conviction in not doing like a metadata layer on S3.
GCP is like, we've never seen a startup like do like, what's going on here? And we're just like, no, we don't wanna do this. We were tuning like TCP windows, like everything to get the latency down.
상태를 두 곳에 두지 않는다 — on-call 고통에서 온 원칙
It doesn't have state, I don't want state and two systems. I think all that is just informed by Justine, my co-founder and I had just been on call for so long. And the worst outages are the ones where you have state in multiple places that's not syncing up.
It really came from a very pure source of pain, of just imagining what we would be okay being woken up at 3:00 AM about and having something in zookeeper was not one of them.
Cursor 스토리 — 공개 다음날 이메일, 4AM 통화, 95% 비용 절감
The day after we launched... one of the cursor co-founders Arvid reached out and he just, you know, the cursor team are like all IOI, IMO contenders, right? So they just speak in bullet points and facts. It was like this amazing email exchange just of, this is how many QPS we have, this is what we're paying, this is where we're going.
The way that I remember it is that Postgres was down when I showed up at the office... I was trying my best to see if I could help in any way. Like I knew a little bit about databases back to tuning auto vacuum.
We reduced their cost by 95%, which I think like kind of fixed their per user economics.
코드 검색의 보안 — 자체 임베딩, 파일 경로 난독화, 고객 키 암호화
Cursor's security posture into Turbopuffer is exceptional, right? They have their own embedding model, which makes it very difficult to reverse engineer. They obfuscate the file paths. ... And the other thing they do too is that for their customers, they encrypt it with their encryption keys in turbopuffer's bucket.
RAG는 죽지 않았다 — 하이브리드가 디폴트
We see lots of demand from the coding company ... I like case studies. I don't like just doing thought pieces on this is where it's going. And trying to be all macroeconomic about AI, that has turned out to be a giant waste of time because no one can really predict any of this. So I just collect case studies.
All workloads are hybrid. Like you want the semantic, you want the text, you want the regex, you want SQL. ... it's silly to like be all in on like one particularly query pattern.
캐시된 compute로서의 embedding — swyx/sualeh framing
The way [sualeh at cursor] describes is that this is just like cached compute, right? It's like you have a point in time where you're looking at some particular context and focused on some chunk and you say, this is the layer of the neural net at this point in time. That seems fundamentally really useful to do cached compute like that.
에이전틱 워크로드는 쿼리 패턴을 뒤집는다 — 한 번이 아니라 수십 병렬
When I think of RAG, I think of, Hey, there's an 8,000 token context window and you better make it count. ... Now, everything is moving towards the just let the agent do its thing. The LLM is very good at reasoning with the data, and so we're just the tool call.
What we're seeing more demand from our customers now is to do a lot of concurrency, right? Like Notion does a ridiculous amount of queries in every round trip... When I use the cursor agent, I also see them doing more concurrency than I've ever seen before.
It means just an enormous amount of queries all at once to the dataset while it's warm in as few turns as possible.
쿼리 가격 5x 인하 — 에이전틱 볼륨에 맞춰
So we've like tried to reduce query, we've reduced query pricing. ... this is probably the first time actually I'm saying that, but the query pricing is being reduced, like five x. Um, and we'll probably try to reduce it even more to accommodate some of these workloads.
Vibe pricing으로 시작 — 그리고 신용카드 한도까지 밀어붙였다
It was vibe pricing. ... Turbopuffer wasn't at the first principle pricing... I didn't know any VCs... I just saw that my GCP bill was a lot higher than the cursor bill. So Justine and I was just like, well, we have to optimize it.
My liability and my credit limit was like actively like calling my bank. It was like, I need a bigger credit.
open cards로 플레이 — Lachy Groom을 고른 이유
I just called Lachy and was like local Lachy. Like if this doesn't have PMF by the end of the year, we'll just like return all the money to you... Lachy was the only person that didn't freak out. He was like, I've never heard anyone say that before.
Someone just gave me the advice at the time of just choose the person where you just feel like you can just pick up the phone and not prepare anything. And just be completely honest.
The other people were talking at the time were database experts. ... What we needed was just someone who didn't know a lot about databases, didn't pretend to know a lot about databases, and just wanted to help us with candidates and customers.
AI 시대의 build vs buy — "만들 수 있나"가 아니라 "시간이 있나"
One of the notion engineers told me that they'd sat and probably on a napkin, like drawn out like, why hasn't anyone built this? And then they saw Turbopuffer. It was like, well, it literally that.
AI has also changed the buy versus build equation in terms of, it's not really about can we build it, it's about do we have time to build it? ... if this is a team that can do that and they feel enough like an extension of our team, well then we can go a lot faster.
P99 엔지니어 — 싸울 누군가가 없으면 reject가 디폴트
In every recap we end with some version of I'm gonna reject this candidate completely regardless of what the discourse was, because I wanna see people fight for this person because the default should not be, we're gonna hire this person. The default should be, we're definitely not hiring this person.
Someone needs to have both fists up and be like, I'd fight. And if one person said, then okay, let's do it.
궤적을 의지에 굽히기 — napkin math 선을 넘기
The P99 engineer has some history of having bent like their trajectory or something to their will. Right? Some moment where it was just, they just made the computer do what it needed to do.
If you sit down and do the napkin math... you can sit down and do that, and then you observe the real system and you see, oh, we're off by like 10x. Bending trajectory to your will is like just making the software get closer and closer to that first principle line. The P99 might even be able to cross the line by finding even more optimizations.
We launched this thing called ANNv3. ... can search a hundred billion vectors with a P50 of around 40 milliseconds and a P99 of 200 milliseconds. ... That was an engineer, the chief architect of Turbopuffer, Nathan... he just made it capable for a very particular workload in like a six to eight week period.
트레이드오프를 또렷이 말하는 능력
I think the P99 is very good at articulating the trade offs in every decision.
The P99 thinks in offs, right? The P99 is very clear about, hey, turbopuffer, you can't run a high transaction workload on turbopuffer. ... the write latency is a hundred milliseconds. That's a clear trade off.
지도와 기차 — P99의 취미는 의외로 집착
The P99 spends a lot of time looking at maps. Generally it's their preferred UX. They just love looking at maps.
The origin of this was that at some point I read an interview with some IOI gold medalist... it's like, what do you do in your spare time? I was just like, I like looking at maps.
스타트업의 유일한 해자 — focus
If you wanna build a big database company, the database over time has to implement more or less every query plan. Because when you have your data in a database, you expect it to over time, not just search, but also, hey, I want to aggregate this column, I want to join this data.
But when you're a startup, your only moat is really just focus. You have to lay out the vaccine and you have to not get overeager. And I think we've seen some of our peers get very overeager and overextend themselves. ... what we're most likely to regret at the end of the year is having tried to do too much.
Full-text search로 Lucene을 이기기 시작
Turbopuffer today has a fairly state-of-the-art full text search engine. We beat Lucene on some queries, in particular very long queries that we've optimized for because those are the text search queries we see today. They're generated by LLMs or augmented by LLMs.
If you go in and you press Command K and you search for "si"... embedding based search might be like, oh, this is something agreeable because that seed that's yes in Spanish. But in full night search, that's the prefix of maybe a document of like "these are all the reasons I hate Simon."