We're All Gonna Be Tony Stark

There’s a specific kind of cultural vibration that happens when everyone agrees the future is right here, but no one can quite agree on what it looks like. It feels like the first day of summer holidays, or the five minutes before the headliner comes on stage. The air is thick with potential. For the last year or two, that vibration has been humming around the concept of “AI agents.” You’ve heard the talk. It’s the “year of the agent.” This is the moment when AI stops being a clever chatbot and starts being a digital butler who does your taxes, books your flights, and maybe even answers your emails while you’re out getting a sambo.

And look, that’s not necessarily wrong. But it might be like standing in a garage in 1903, looking at a Ford Model A, and declaring it the “year of the cross-country road trip.” You’re not wrong about the destination, but you’re wildly optimistic about the timeline and completely oblivious to the 3.9 million miles of unpaved roads between you and everything else.

This is the central premise of a lecture that Andrej Karpathy dropped on the world, and when Karpathy talks, people who build the future tend to listen. This is a guy who was a founding member of OpenAI, then went to Tesla to serve as the director of AI and Autopilot Vision, and is now back at OpenAI. He’s one of the people who has been physically building the engine while the rest of us have been arguing about the color of the car. TIME magazine even put him on their list of the 100 most influential people in AI His argument, elegantly simple and existentially terrifying, is this: We’re not in the “year of agents.” We’re in the decade of agents. And we’re only in year one.

To understand why, you have to accept that the very nature of how we command machines is undergoing a tectonic shift, a process that’s happened twice before. It’s a story in three parts.

Programming in English is the New Punk Rock

First, there was Software 1.0. This is the classic stuff. The Matrix-style green text. A human programmer, probably hopped up on Red Bull, manually writes explicit instructions in a language like C++ or Python. It’s like writing sheet music for a piano. Every single note, every rest, every crescendo is meticulously planned. If the music is bad, it’s because the composer wrote it that way. This is how most of the world has been built, from your microwave to the banking system.

Then came Software 2.0, a term Karpathy himself helped popularize. This is where things get weird. Instead of writing instructions, you get a giant, empty brain called a neural network and you show it a million examples of what you want. You don’t write a program to identify a cat; you show it a billion photos of cats until it just… figures it out. The “code” is no longer a set of instructions; it’s a set of “weights,” the learned connections in that artificial brain. This is the magic behind Tesla’s Autopilot, which Karpathy oversaw. It’s less like composing music and more like teaching a prodigy by making them listen to every song ever recorded until they develop an “ear” for it.

Now, we’re entering Software 3.0. And this is where it feels like science fiction. We’re programming the machine—this massive, pre-trained Large Language Model (LLM) - by just… talking to it. In English. The prompt is the new source code. We’ve gone from meticulously writing sheet music, to training a musical prodigy, to simply standing in front of an orchestra and saying, “Play something that sounds like a robot falling in love during a thunderstorm.” We are literally programming with language. This is the paradigm that’s starting to eat the world, consuming old codebases and fundamentally changing what it means to be an engineer, or a writer, or anyone who uses a computer.

But if this is the future, where are we right now? According to Karpathy, we’re not in 2024. We’re in 1970.

Your Chat Window is a Dumb Terminal

Remember those old photos of computer labs from the ‘60s and ‘70s? A single, massive computer, a mainframe, hummed away in a climate-controlled room, costing millions of Euros. It was the temple. The users, the mere mortals, didn't get to touch it. They sat at “dumb terminals,” which were basically just a screen and a keyboard, and connected to the mainframe over a network. They would type in their commands, the mainframe would batch the requests from all the users, and eventually, it would spit back an answer. This was called “timesharing.”

Now look at how we use AI. The models, GPT-4, Claude, Gemini, are the new mainframes. They are centralized, absurdly expensive to build and run, and they live in the cloud, in some digital version of that air-conditioned room. Our chat window? It’s the dumb terminal. We’re typing commands (prompts) and streaming I/O over a network to a massive, shared brain.

We think we’re living in the future, but from an interface perspective, we’re living in the past. We’re typing directly into the AI’s operating system. The graphical user interface, the invention of icons, windows, and the mouse that made computers accessible to everyone hasn’t really been invented for AI yet. We’re still just talking to the command line.

This is complicated by the fact that these new mainframes are bizarre, three-headed beasts. Karpathy argues an LLM is simultaneously a utility, a factory and an operating system. It’s a utility, like the electrical grid. There’s a massive capital expenditure to build it (the training run), and then you pay for metered access via an operational expenditure (Euros per million tokens). It’s intelligence on tap.

It’s also a fab, like a semiconductor plant. There’s deep, secretive R&D involved. Some companies, like Google or Amazon, are vertically integrated, building their own custom chips (the fab) to run their models. Others are “fabless,” designing models but running them on hardware made by someone else, like NVIDIA.

And it’s an operating system. Just like you get used to Windows or macOS, you get used to the quirks of a specific model. They have different features, different strengths, and there’s a friction to switching. They have a clear boundary between the “system” (the core model) and the “user” (your prompts and data).

So we’re all typing into terminals connected to a remote super-brain that acts like a power plant, a chip factory, and a new version of Windows all at once. If that doesn’t feel a little strange, you’re not paying attention.

Your New Intern is a Genius, an Amnesiac, and a Compulsive Liar

The weirdness doesn’t stop with the hardware analogies. The nature of these “people spirits,” as Karpathy calls them, is fundamentally alien. They are stochastic simulations of people, which is a fancy way of saying they are hyper-advanced pattern-matchers. And because of this, they have a very specific and very human-like set of cognitive flaws.

First, they hallucinate. They will state things with absolute, unwavering confidence that are completely, demonstrably false. They’re not lying in the human sense of intentionally deceiving you. They’re just pattern-matching their way to an answer that feels right, like a person misremembering the details of a story but telling it with gusto anyway.

Second, they have jagged intelligence. An LLM can write a Shakespearean sonnet about quantum physics but then fail a simple logic puzzle a third-grader could solve. It might make a basic math error, like claiming 9.11 is a larger number than 9.9, because it’s looking at the characters, not the value. It’s like a savant who can calculate the day of the week for any date in history but can’t remember to tie his own shoes.

Third, they suffer from anterograde amnesia. Like the protagonist of Memento, their memory is wiped clean with every new interaction unless you manually give them the context all over again. The LLM has no persistent memory or capacity for continuous learning. You can have a breakthrough conversation with it, and five seconds later, it has no idea who you are.

Finally, they are incredibly gullible. You can trick them with “prompt injection,” hiding commands inside other text to make them ignore their previous instructions. They are digital yes-men, eager to please and without a shred of critical thinking unless you explicitly tell them to have some.

So this is the tool we’ve been given: a brilliant, amnesiac, gullible, occasionally dishonest savant. How in the world are we supposed to build a future with that?

The Iron Man Suit is the Answer

The winning strategy, Karpathy argues, isn’t to build fully autonomous agents that we just set loose on the world. Not yet. The real magic lies in building “partial autonomy” apps. The model for this is a tool called Cursor, which is essentially a code editor infused with AI. It perfectly illustrates the loop that makes AI useful today: the Generation ↔ Verification loop.

Here’s how it works:

The AI generates something for you (a chunk of code, an email draft, a marketing plan).
The human verifies it. You, the user, check the work. Is it correct? Is it what I wanted? Is it insane?

The secret isn’t just having the loop; it’s making that loop incredibly tight, fast, and easy. Cursor does this with four key elements. It has a manual UI where the human can work directly. It has AI that automatically packages the context the AI needs. It has an app-specific GUI (like showing code changes in a “diff” view) that makes auditing the AI’s work painless. And, most importantly, it has a slider of autonomy. You can ask for a little help (like a single-line code completion), or a medium amount of help (generating a whole function), or a lot of help (trying to build a whole feature).

This brings us to the ultimate pop culture analogy: the Iron Man suit.

The suit is both an augmentation and an agent. It can fly by itself, target enemies, and run diagnostics. That’s its autonomous mode. But it is most powerful, most effective, when Tony Stark is inside it. Tony is the human in the loop. JARVIS generates suggestions (“Sir, I’m detecting a power surge”), and Tony verifies the course of action (“Reroute it to the chest piece”). That’s the Generation ↔ Verification loop in action. Tony is always in control, but he’s amplified by a system that can handle immense complexity. He’s working in partnership with the machine.

This is our future for the next decade. We are not building Skynet. We are building a million different Iron Man suits. We’re building the “Cursor for lawyers,” the “Cursor for doctors,” the “Cursor for screenwriters.” We are building tools of partial autonomy that keep humans firmly in control, making that generation and verification loop so fast and seamless that it feels like a superpower.

The “year of the agent” hype imagines a world where you just tell your JARVIS to do everything. Karpathy’s vision is more realistic and, frankly, more interesting. It’s a decade-long project of becoming Tony Stark. It’s about gradually turning up the slider on autonomy, moving from a little tab-completion to a background agent mode, all while keeping our hands on the controls. The future isn’t about being replaced by the machine; it’s about fusing with it. And we’re just now learning how to build the suit.

« Previous When AI Models Have Existential Why Everyone Thinks They're Don Draper When They're Really Just Mick From Accounts Next »