Register or log in to access this video
Most of my career has been spent making one big backend stay up, and go fast. My new job description is harder: keep a product fast when half the latency budget lives inside somebody else’s GPU cluster, every model provider degrades differently, and the honest answer to “”what is our time to first token?”” is “”which of the five numbers do you want?””
So, what does building a reliable and performant product on top of a notoriously unreliable and under-performant AI models look like at Cursor? We’ll walk through the time-to-first-token pipeline from a user’s keystroke through client, network, agent server, inference proxy, and model provider: all the way through the (sometimes comical) ways each layer lies to you about where time went. We’ll learn about how you can’t skimp on the basics (good observability), how to set aggressive-but-achievable goals, and how sometimes, just a handful of relatively simple changes can make all the difference.
If you work on a product that sits downstream of dependencies you don’t own, providers you can’t fire mid-request, and users who think your product is slow when really it’s the weather, this one’s for you.