The Context Wall
tl;dr: We’re building a grounded, source-cited reasoning tool. The naive approach (one long chat thread) hits four walls in sequence: latency, cost, prompt-following degradation, and then a hard context limit. Solving that forced us to rethink what a “conversation” actually is. Turns out it’s just a tree, and a surprisingly malleable one at that. We’re building AI that proves it’s right. We’re hiring.
When we started building Kepler, the core promise was simple: every claim the product makes should be traceable to a source. Not “Nvidia is profitable” but “Nvidia is profitable, per this line in this SEC filing.” The difference sounds subtle. Technically, it’s the difference between a product that earns trust and one that quietly erodes it.
The natural way to build that is a linear chat with smarter prompting. Plant the seed, tend it carefully, and trust it to grow straight. This works fine in demos. It falls apart in production.
The first thing you notice is latency. Spreadsheets are central to what our users do. They’re not a nice-to-have, they’re the deliverable. We generate code that builds them rather than constructing them cell by cell, which is the right call for reliability and correctness. But for any analysis involving a company’s full history, that pipeline routinely adds 30 to 60 seconds to a response. Sometimes more. Users noticed immediately.1
You can reason about why: bigger context, more tokens, more to process before the first line of output. But knowing the reason doesn’t make it hurt less when you’re watching a spinner.
The second thing you notice is cost. Longer contexts mean more tokens processed on every single call, including all the previous turns you’re re-sending. You start doing the math and it doesn’t look good.
Both of those are, in principle, tractable. Latency and cost are engineering problems. You can throw compute at them, optimize prompts, cache aggressively. So you do.
What you can’t throw compute at is the third thing: the model starts drifting. You’ve given it careful instructions about how to cite sources, how to ground its claims, when to say it doesn’t know. But as the context grows, those instructions get diluted.
Early in a conversation, the model will flatly refuse to speculate. Push it for a view on where a stock is heading and it won’t engage, exactly as instructed.2 By the time you’re approaching 200k tokens, the model capitulates. In between, there’s a slow gradient: it starts padding citations with “common sense” reasoning, filling gaps with things it simply knows rather than things it can point to. The instructions are still there in the context. They’re just losing the argument.
For a product whose entire value proposition is trustworthy, sourced reasoning, this is not a performance regression. It’s an existential one.
And then you hit the wall. Context limits are finite. At some point there are no more workarounds. You’ve simply run out of room.
What the wall teaches you
When you’re forced to redesign around a hard constraint, you end up thinking more carefully about what you actually needed in the first place.
The linear chat model carries an implicit assumption: that the model needs to hold everything in one place, in one context, all at once. Every prior turn, every retrieved document, every tool result: all of it present, all of the time.
But that’s not actually how good reasoning works. A skilled analyst working on a complex report doesn’t hold every source document fully in mind simultaneously. They work in focused bursts, pass summaries to themselves and colleagues, keep the central argument clear while the supporting detail lives elsewhere. The full context is accessible when needed, but not all active at once.
LLMs are, at the level of the computation, surprisingly flexible. Given a sequence of tokens, predict the next one. That’s it. Which means the shape of “a conversation” is a design choice, not a constraint. Nothing says it has to be a single linear thread. It can branch. It can fork. It can grow.
Forking: parallel work from a shared root
The first thing this unlocked for us was forking.
Consider a task like building a report with multiple sections, each one analyzing a different aspect of the same company. The naive approach runs these sequentially, one after another, each carrying the full weight of everything that came before. It’s slow, expensive, and by the time you’re on section three, the model’s context is already congested.
But all three sections share a common starting point: the same source documents, the same initial analysis, the same instructions. That shared prefix can be cached. And from that cached point, you can branch: spawn multiple completions in parallel, each working on its own section, each paying only for the tokens it generates beyond the fork point.
The result is faster, cheaper, and more reliable, because each branch is working in a cleaner context, closer to the instructions that matter, without the noise of every other branch’s work.
Communication: keeping the primary context clear
Forking handles parallel work off a common base. The harder problem is what happens when agents need to talk to each other or share state.
The obvious approach is to merge contexts. When agent B returns from a research task, dump everything it found into agent A’s thread state. This recreates exactly the problem you were previously trying to solve: one long context, all the noise, degrading prompt-following, escalating cost.
What works better is treating inter-agent communication the way a good team treats it: not full context dumps, but structured summaries. Agent A sends B a specific question. B does its work, potentially running its own sub-tasks and accumulating its own context, then returns the relevant findings compressed. If A needs more, it asks. B’s full working context never floods A’s window.
This keeps the primary reasoning thread clean. The model doing the high-level synthesis isn’t distracted by the raw details of every retrieval task that fed into it. Of course, those details are accessible: B still has them, A can ask follow-up questions, but they’re not actively consuming context unless they’re needed.
We know what happens when they are. That’s how we got here. The Context Wall is unforgiving.
What this means for the product
These aren’t abstract architectural improvements. They’re what makes our core promise achievable.
When the context is clean and well-structured, the model follows its sourcing instructions. When the model follows its sourcing instructions, every claim gets traced to a document. When every claim gets traced to a document, the product does what it said it would do.
The Context Wall turned out to be useful. It forced us to stop treating conversations as sacred linear transcripts and start treating them as something more malleable: data structures to be designed, shaped, and managed. Threads that can be forked, agents that communicate through compression rather than flooding, histories that can be edited when the scaffolding has served its purpose.
The chat interface was a reasonable starting point for the industry. For what we’re building, it’s a constraint we had to think our way out of.
If this is the kind of problem you want to work on, we’re hiring.
Footnotes
-
There’s something funny about this. A user will happily spend three days building a spreadsheet by hand, then grow visibly impatient when the computer takes 45 seconds to do the same thing. Expectations don’t scale with capability — they invert. ↩
-
This is a simplification. A closer example: a user is extracting MD&A (Management Discussion & Analysis) from a filing and asking about battery production schedules. If the document doesn’t cover it, we should say so and not fill the gap with things we happen to know. That discipline is what degrades as the context grows. ↩