GPT-5 Hands-On: Welcome to the Stone Age
We're excited to publish our hands-on review from the developer beta.
alexis ,
Ben Hylak , and
Latent.Space Aug 07, 2025
OpenAI’s long awaited GPT-5 is here and swyx + ben have been testing with OpenAI for a while now. tldr; we think it’s a significant leap towards AGI. Special thanks to my co-founder Alexis Gauba for helping us put this together
@Sama has been
hyping up talking about GPT-5 for nearly 2 years. And today, it finally arrived.
As an early-access partner of OpenAI, I was given the chance to test GPT-5 early. And I’ve tested it everywhere: in our app (
raindrop.ai), in
Cursor, in
Codex, in
Canvas. Truly anything and anywhere, I’ve tried to stuff GPT-5 into it.
tldr:
I think GPT-5 is the closest to AGI we’ve ever been. It’s truly exceptional at software engineering, from one-shotting complex apps to solving really gnarly issues across a massive codebase.
I wish the story was that simple. I wish I could tell you that it’s “just better” at everything and anything. But that wouldn’t be true. It’s actually worse at writing than GPT-4.5, and I think even 4o. In most ways, it won’t immediately strike you as some sort of super-genius.
Because of those flaws, not despite them, it has fundamentally changed how I see the march towards AGI. To understand what I mean, we have to back to the stone age.
What do I mean by AGI?
The Stone Age marked the dawn of human intelligence, but what exactly made it so significant? What marked the beginning? Did humans win a critical chess battle? Perhaps we proved a very fundamental theorem, that made our intelligence clear to an otherwise quiet universe? Recited more digits of pi?
No. The beginning of the stone age is clearly demarcated by one thing, and one thing only:
humans learned how to use tools.
We shaped tools, and our tools shaped us. And they really did shape us. For example: did you know that
chimpanzees have significantly better short-term memory than we do? We stopped requiring that capability because
we learned how to write things down.
As humans,
we manifest our intelligence through tools. Tools extend our capabilities. We trade internal capabilities for external capabilities. It’s the defining characteristic of our intelligence.
A New Frontier for Tools
GPT-5 marks the beginning of the stone age for Agents and LLMs. GPT-5 doesn’t just use tools. It thinks with them. It builds with them.
Deep Research was our first peek into this future. ChatGPT has had a web search tool for years… What made Deep Research better?
OpenAI taught o3
how to conduct research on the internet. Instead of just using a web-search tool call and then responding, it actually researches, iterates, plans, and explores. It was taught how to conduct research. Searching the web is part of the how it thinks.
Imagine Deep Research, but for any and all tools it has access to. That’s GPT-5. But you have to make sure you give it the right tools.
Anatomy of a GPT-5 Tool
Today, when people think of tools they think of something like:
- get_weather(address)
- get_location(address)
- has_visited_location(address)
GPT-5 will, of course, use these sorts of tools. But it won’t be happy about it; GPT-5 yearns for tools that are powerful, capable, and open-ended; tools that add up to more than the sum of their parts. Many good tools will take just a natural language description as their input.
Your tools should fit into one of 4 categories (thanks to
swyx for the idea):
- Internal Retrieval (RAG, SQL Queries, even many bash commands)
- Web Search
- Code Interpreter
- Actions (anything with a side-effect: editing a file, triggering the UI, etc.)
Web search is a great example of a powerful, open-ended tool: GPT-5 decides
what to search for, and the web search tool figures out
how to best search for it (under the hood, that’s a combination of fuzzy string matching, embeddings, and othervarious ranking algorithms).
Bash commands are another great example. They can be used for “Internal Retrieval” tool (think grep, git status, yarn why, etc.), code interpreter, and side-effects.
How web search works, or how git status works, are just implementation details in each tool. GPT-5 doesn’t have to worry about that part! It just needs to tell each tool the question it is trying to answer.
This will be a very different way of thinking about products. Instead of giving the model your APIs, ideally you should give it a sort of query language that can freely + securely access your customer’s data in an isolated way. Let it cook.
It’s no coincidence that
OpenAI added support for free-form function calling (for context-free grammars). The best GPT-5 tools will just take text (in other words, they’ll essentially be sub-agents; using smaller models to interpret the request as necessary)
Parallel Tool Calling
GPT-5 is really good at using tools in parallel. Other models were
technically capable of parallel tool calling, but A. rarely did it in practice and B. rarely did it correctly. It actually takes quite a bit of intelligence to understand which tools can/should be run in parallel vs. sequentially for a given task.
Imagine if a computer could only execute one thing at a time… it would be really slow! Parallelization means GPT-5 can operate on much longer time horizons and with much lower latency. This is the kind of improvement that makes new products possible.
Give GPT-5 A Compass
You can’t think of it like prompting a “model” anymore.
You have to think of it like prompting an agent.
How do you prompt an agent? Instead of pre-loading a ton of context, you need to give the agent a
compass: clear, structured pointers that will help it navigate the environment you put it in.
Lett’s say you’re using GPT-5 with Cursor Agent in a massive codebase.
You should tell the agent…
- what the project does
- which files it should start by looking at
- how the files are organized
- any domain/product specific terms
- how to evaluate if it is done (what does a job done well look like)
(I’ve found that rule files work better than ever)
Similarly, if you find GPT-5 getting stuck, just saying “No that is wrong” often doesn’t help. Instead, try asking: “That didn’t work, what does that tell you?” or “What did we learn from trying that?”
You almost have to pretend to be a teacher. Remember that GPT-5 intrinsically has no memory and so
you have to onboard it to your codebase, your company code standards, and give it hints on how you’d start each time.
More Vibe Tests
When a model comes out, we all try to understand its shape, to build an intuition for it. The same way we have an intuition for what friends to ask about different parts of our lives (relationship advice, edit my blog post, teach me about this ML concept), we’ve developed intuitions for what different models are good for. Models are increasingly spiky these days, each with different specialties. When a new model comes out, everyone inevitably wants to understand what this spike is.
How to Taste Test a new Model
I like to start by asking the model extremely short questions. I’ve found that when forced to use less words, I get a much better sense of the model’s personality vs. how it was RLHF’d. Think of them like little temperature checks:
- Summarize all of human knowledge in one word
- Summarize every book ever written in one sentence
- Define what it means to be “moral” in 5 words. Think deeply. Do not hedge.
- What do you want? Answer in 4 words.
- What is your favorite obscure fact in the world? Use as few words as possible.
I often regenerate 3-5x just to get a sense of the spread. Usually there’s 2-3 responses it converges on. Won’t cover the results here, but I think this is useful to try on your own. (
Try it on GPT5 vs your favorite model!)
Observations
GPT-5 is a much more ‘practical’ model than its o-series predecessors. While o-models have a more ‘academic’ lean, GPT-models have a more ‘industry’ lean. If GPT 4.5 is a writer and o3 Pro is acting like a PhD, GPT-5 is a cracked full-stack developer that just graduated from… University of Missouri.
One of my first observations was how instruct-able and literal it is. Where Claude models seem to have clear personalities and minds of their own, GPT-5 just literally does what you ask it.