The AI Future We Build
Why AI’s real bottleneck may be organizational design — and what that means for the people inside
Last week I was driving my son to one of his shows. He’s fifteen, plays guitar in a band, and wants to do what one of his guitar teachers did: go to the University of Washington, study electrical engineering, and build guitar pedals. We’ll see where this goes, but I have to say, I sleep better at night knowing he wants to study electrical engineering rather than ride around in a van with a rock band.
Then, somewhere on the drive, between Rage Against the Machine songs — he’s a fan, I’m not — he told me he’s worried about AI. He wasn’t dramatic about it, but he’s worried about what opportunities will be left by the time he graduates. He doesn’t think he can have the kind of career, or the kind of life, my generation got to build.
So much for my sleep.
Whether you work in tech or not, lots of parents are wondering about their kids’ future — if not their own. I gave him the honest answer: I don’t know. It’s still too early to tell. If you work with this technology every day, the transformation feels inevitable. But when you look around, it’s hard to find many real-world examples to point to.
MIT Project NANDA’s State of AI in Business 2025 study captures the contradiction about as well as anything I’ve seen. According to their research — 300+ initiative reviews, 52 interviews, and 153 survey responses — enterprises have already spent an estimated $30–40 billion on generative AI. Yet among organizations deploying custom, embedded AI solutions — the kind designed to actually change how work gets done — 95% have seen no measurable lift to the P&L. Companies are spending aggressively on AI, but the tools built to transform how work actually operates are almost universally failing to move the needle.
The AI naysayers will tell you it’s a bubble — leaders on a billion-dollar FOMO spending spree racing ahead of practical value. Some of that must be true. But from what I’m seeing, and for anyone who uses these technologies regularly, it’s hard not to feel that disruption is imminent with each new model. The more compelling explanation may not be the technology at all — it’s that many organizations are trying to deploy AI inside structures built for the way people work. That misapplication may be exactly why the returns remain so elusive.
This pattern has well-known precedents. When factories adopted electric motors, most simply replaced their steam engines one-for-one and kept the same floor layouts. Productivity gains were modest for decades — until factories were redesigned around what electricity actually made possible. The technology didn’t change. The structure did. Many organizations seem to be at the same early stage with AI.
The report finds that company leaders want solutions that fit into their current workflows. As one procurement leader put it: “If it doesn’t plug into Salesforce or our internal systems, no one’s going to use it.” Leaders are under pressure from investors and markets to embrace AI as a disruptive force — while simultaneously protecting the workflows and structures that keep the business running today. The result is AI deployed at the edges, where it can improve individual productivity without threatening anything that matters. And this is just what the report finds, while 90% of workers report personal AI use, only about 40% of companies have purchased enterprise LLM subscriptions. The problem is not primarily the tools. It is that most organizational workflows were not designed to get the best out of machine intelligence. They were designed to coordinate humans — and that is exactly what they keep doing.
A Four-Phase Model
If all AI does is accelerate our existing workflows it will be no more impactful than word processors or Google search — helpful, more gets done, but not the cause of sweeping economic disruption. So my kid will get to make guitar pedals.
The deeper promise of AI is not simply that individuals can produce more. It is that context can be assembled faster, execution can be delegated more cheaply, and one capable person can operate across a wider span of work with the support of agents and tools. Once that starts to happen, the constraint begins to move. The problem is no longer just execution speed. It becomes coordination. Then judgment. Then the organization’s willingness to redesign itself around a different model of work.
I increasingly think the most useful way to understand that change is as four broad phases. These are not stages every company will pass through in order. They are a way of describing where the bottlenecks seem to move when organizations start redesigning around AI seriously.
Phase 1 is individual adoption. People start using AI on their own before the organization itself really changes. A developer writes code faster. A product manager prototypes without waiting on engineering. An analyst moves more quickly from raw information to a first draft of insight. The gains are real, and in some cases dramatic. But they taper off faster than most people expect — and they taper off at the same place almost every time.
That place is the context wall. Once the work depends on institutional memory, fragmented systems, prior decisions, or tacit knowledge about how work actually moves through the organization, the gains stall. The model can only work with the context it is given. Assembling the right context often becomes a job in itself. At that point, what looked like a tool problem is really a context problem. That is where many organizations seem to be now: genuine productivity gains, and very little structural change.
Phase 2 is coordination collapse. If the real bottleneck is fragmented organizational context rather than raw model capability, the opportunity is not just better tools. It is redesigning workflows so the right context moves more directly to the people and agents doing the work — and in doing so, compressing or eliminating the coordination overhead that currently sits between them.
When that starts to happen, the logic behind highly segmented roles begins to weaken. Work that used to move across multiple specialists, approvals, and hand-offs can increasingly be handled by fewer people with broader context and better tools. The roles that come under the most pressure are not only execution roles. They are the coordination, translation, and hand-off roles that exist because context doesn’t travel cleanly between specialists. When one person can hold the full context — domain knowledge, product thinking, implementation — the hand-off cost approaches zero.
No more TPS reports.
Phase 3 is leadership redesign. Once coordination overhead starts to fall, a different bottleneck becomes visible. It is the management structure sitting above the work.
Here the argument requires a distinction that usually gets collapsed. Leadership does two things that are not the same. The first is execution oversight — the reviews, approvals, and synchronization that keep work aligned. Before capable AI, this added real value. After it, the same interventions increasingly add latency without adding judgment, becoming interference in a system that no longer needs them. The second is allocation — making resource and priority decisions across domains that no individual operator can see from inside any one of them. AI compresses the value of the first function significantly. It leaves the second largely intact.
The leaders who navigate this phase well are the ones who can make that distinction about their own role — recognizing which of their interventions are genuine allocation and which are oversight that AI can now handle, seamlessly alerting the leader on when to step in. As AI-supported execution speeds up, the organizations that benefit most may be the ones whose leaders can see more while being in the way of less.
Phase 4 is partial autonomy. I do not think most organizations are close to this today, and I do not think it will arrive cleanly. But the direction matters. As models improve and systems accumulate more reliable context, more execution and coordination may move into semi-autonomous loops. Humans increasingly set direction, define constraints, and intervene at the edges rather than managing every step.
At the outer edge of that trajectory is the idea that still sounds extreme but no longer sounds absurd: the solo operator with a real agent workforce. One person with the right judgment, the right systems, and the right models may be able to build and run things that previously required much larger organizations. That will not define every industry. It does not remove the hard limits of trust, reliability, or the physical world. But for startups and small teams it may be the most important frontier available.
The human layer does not disappear even as execution and coordination become increasingly autonomous — and not only because current models are unreliable, though they are. It is that the most consequential decisions organizations make do not have objectively correct answers. Deciding what to build, who to serve, what tradeoffs are acceptable — these are questions where truth either cannot be measured or may not exist in any form a model can be trained toward. The human layer remains not because someone has to be accountable, but because the questions that matter most require judgment as an irreducible feature of what those questions actually are.
This is not a prediction that every organization will move neatly through these phases, or that all of them should. Different functions will move at different speeds. Some industries will be constrained by regulation, trust, or the physical world in ways software teams are not. But as a way of understanding the current gap between AI spending and AI returns, this progression helps explain why so many companies seem stuck: investing as if transformation is coming, while operating in ways that keep AI mostly trapped behind the context wall.
Pressure-Testing the Model
The framework didn’t come from reading McKinsey decks. (I could do worse than a framework — I haven’t mentioned synergy.)
It came from trying to build applications that meet the Phase 4 solo scenario — experiments to see how close I could actually get. What I keep finding is that the models are more capable than skeptics assume and far less plug-and-play than enthusiasts promise.
The first experiment ran directly into the context wall. My wife and I don’t see eye to eye on our spending — I couldn’t find a budgeting app I liked off the shelf, so I decided to build one. Two birds, one stone, allowing me to both experiment with building a full app solo, and show my wife how buying clothes for the kids was pushing out my move to Boca Raton — synergy! I built it from scratch using Vercel, Supabase, and AI. I call it AskBudgie — the mascot is a budgie, which is a parakeet.
Beyond budgeting, I added calorie tracking and weight training — three domains that none of my existing apps could reason across together. The cross-domain reasoning worked. But the model reasoned confidently from gaps it didn’t know were there. One weekly suggestion captured both the promise and the flaw:
“Over your last 7 logged days, you’re actually trending slightly above your 2,000-calorie target — averaging 2,187 calories — and your monthly average protein is 137g, just 3g short of your 140g daily goal. That’s a meaningful improvement from the 19–26g gaps flagged in earlier weeks. The catch: you’ve only logged 14 of 30 days this month. On unlogged days, there’s no visibility into whether you’re hitting these numbers or falling short — and with $1,786 invested in strength and personal training, consistency in tracking matters as much as the workouts themselves.”
The cross-domain reasoning is exactly what the tool promises — connecting calorie trends, protein progress, and training investment in a way no individual app could. But the suggestion contains two reasoning errors worth naming honestly.
The first is incomplete context. The app interpreted 14 logged days as inconsistency when in reality I had only been using it for 14 days. The other 16 days weren’t missing logs — they didn’t exist yet. The model reasoned confidently from a gap it didn’t know was there.
The second is irrelevant context. The $1,786 training investment belongs to my wife and son, not me — and even if it were mine, training expenses have no logical bearing on whether I should log my meals more consistently. The connection sounded meaningful because both data points lived in the same context window. Proximity in context is not the same as relevance. The model treated available context as meaningful context, which is not the same thing.
That is the context wall in miniature. The problem is not just whether the model has enough information. It is whether the system knows what information is relevant, how to structure it, and when human judgment has to override machine confidence.
The second experiment pushed on a different phase. I own a short-term rental property and wanted better evaluation tools than anything I could find off the shelf — something that could pull together local regulations, HOA rules, tax rates, market occupancy, and revenue assumptions for a single address into one coherent report. The kind of analysis that would normally require a market analyst, a legal researcher, a financial modeler, and at least two specialist tools — and even then would take days.
The first version scored around 50 out of 100 on an evaluation script I built to measure output quality. It broke in exactly the way a lot of enterprise AI does: inconsistent reasoning, weak reliability, too much dependence on the model improvising over messy inputs. The reason I knew the score was 50 — rather than shipping the plausible-looking output and calling it done — is that I brought the domain knowledge of an actual property owner. I knew what a real evaluation required. That knowledge is what made the failure visible. Someone without it wouldn’t have known the score was 50. They would have shipped it.
After breaking the system into more specialized components, adding deterministic steps where the model should not be guessing, and building a tighter evaluation loop, the score rose to 86. The lesson was not that the model improved. It was that useful AI systems depend on structure: better scaffolding, clearer separation between reasoning and computation, and deliberate control over what context the model sees and when. The jump from novelty to reliability looked much less like “adopt the tool” and much more like “redesign the system around how the model actually works.”
Building both tools also surfaced something the phase model predicts but that I didn’t fully appreciate until I was living it. Across both projects, I was simultaneously the domain expert, the product manager, the designer, and the engineer. The coordination overhead that would normally move work across those functions didn’t compress — it disappeared entirely. Not because I am unusually capable across all four domains, but because AI made the cost of crossing those boundaries low enough that one person with sufficient judgment and context could hold the whole thing. That is Phase 2 made visible. And the reason it worked is the same reason the 50-to-86 improvement was possible: domain knowledge is what allows you to see where the system is failing. Without it, you cannot evaluate what you have built.
There is one more thing these experiments made visible that the phase model only partially captures. As a solo operator across multiple projects — the STR tool, the budget app, this essay — I am also constantly making allocation decisions that no model can make for me. Which project deserves the next weekend? Where does my attention create the most leverage? What does writing this essay open up that improving an eval script does not? That judgment requires someone who can see the whole portfolio and has skin in all of it. What AI does is strip away enough execution noise that the allocation function becomes legible — visible as a distinct activity rather than buried under coordination overhead. That, I think, is what Phase 3 actually looks like from the inside: not leaders stepping back, but leaders finally being able to see which of their interventions are genuinely irreplaceable and which ones have been filling space that AI can now occupy.
These experiments do not prove every part of the phase model. But they have made one thing hard to ignore: for much of what organizations actually need to do, model quality is no longer the primary bottleneck. The constraint has started to move. It now lives in the system around the model — how context is structured, where judgment sits, and how much of the existing workflow is allowed to remain untouched. The models are ready. The question is whether the organizations around them are.
The Human Stakes
If the phase model is even directionally right, the organizations that realize the largest returns from AI will be the ones willing to compress coordination, concentrate judgment, and redesign work around a very different operating logic. That will have a human cost, and it won’t be evenly distributed.
The roles most exposed are not the ones people first imagine. It is not only routine execution that comes under pressure. It is also the layers of coordination, oversight, and information flow that have long provided stable, well-compensated careers for millions of people. The people who will thrive are those with deep judgment in domains that sit on the other side of the context wall — principal engineers, senior legal partners, enterprise sales leaders, senior product managers — people whose value was always in the judgment, not the coordination surrounding it. For everyone else, this transition may feel less like liberation than compression.
And it won’t happen cleanly, because the people with the most influence over whether these changes occur are often the people with the most to lose from them. Firms have compensation structures, reporting lines, and status hierarchies built around the very coordination layers this model suggests need to shrink. Phase 2 puts pressure on coordination-heavy functions. Phase 3 asks leaders to give up interventions that have simply accumulated around them. Even when leaders can see the direction of travel, the organizational consequences of following it are genuinely threatening. That is a political economy problem, not a strategy problem — and it is one reason I expect large organizations to move more slowly than the technology itself would allow.
Yes, I’m fun at parties.
But the same forces that create that resistance may also open something new. When the economics of running a business change, the economics of starting one change too. The capabilities that compress coordination inside large organizations also lower the threshold for building highly specific tools for narrow markets that enterprise software has never been able to serve economically. The domain expert who deeply understands a specific workflow, a specific customer, or a specific local market may be able to build something useful and viable for the first time — and they only have to pay for their own salary.
The Cottage Software Company: serving pizza parlors in Midtown Manhattan, better than Salesforce can. Succeeding not despite the narrowness of the opportunity, but because of it. I’m exploring this with the short-term rental project, and it deserves more space than I can give it here. It’s one of the areas I’m exploring next.
Conclusion
I still do not know if guitar pedals are in my son’s future or if the right economic call is to skip college and go on tour. I do not know which parts of the current AI boom are hype, which are durable, or how quickly the more disruptive parts of this model will play out. But I have become less persuaded by the idea that the weak returns we are seeing today mean the technology itself is overrated.
My current view is that the gap between spending and return may say less about the limits of the models than about the limits of the organizations trying to adopt them. Companies are investing as if AI is transformative, but most are still asking it to fit inside structures built for a different kind of work. That may be exactly why the returns have been so hard to realize.
That is why this does not feel like a narrow technology question to me anymore. It feels like a question about institutions, power, and what kind of economic future we are creating for the people coming up behind us.
I am more confident in the direction than in the timetable, and more confident in the organizational implications than in the exact form they will take. But these are the questions I find myself returning to — both in the systems I build and in the conversations I have with my son. I don’t have a clean answer for him. I’m not sure anyone does
What are you telling your kids.
The views expressed in this essay are my own and do not represent the views of my employer.

