New Introducing Kepler Read more →
Products
Kepler for Finance Kepler Platform Soon
Sections
The Problem Team Applications Careers
Blog Our Story
Request a Demo

Agent Ontology: What It Actually Takes to Build AI You Can Trust

Looking up through a modern glass building interior, with a geometric skylight grid opening to bright sky above.

Early on in our Kepler journey, we sat down with a senior MD at one of the big banks. We weren’t demoing anything, just doing user research, trying to understand how he actually worked and where AI might fit. He said something I keep coming back to: “I don’t trust a template I didn’t build myself. How am I supposed to trust something I can’t audit?”

We pushed on this a little. The templates his team used were well-constructed. Real formulas, real logic, linked to real data. He knew that. But his point wasn’t that the templates were wrong. It was that he couldn’t defend them. He didn’t know what assumptions were embedded in the structure. He didn’t know which edge cases were handled and which were silently ignored. The output might be right. But if someone challenged a number in a meeting or in court, he couldn’t walk them through the reasoning step by step. And in his world, a number you can’t defend is a number you can’t use.

The most obvious version of this problem is the hardcoded cell. Every analyst has inherited a model where someone typed a value instead of linking it to a source. The number sits there, looks correct, and when the underlying data changes it doesn’t update.

But this MD was describing something deeper. Even a well-built template with live formulas and proper sources is suspect if you can’t trace the logic yourself. The question isn’t just “is this number right?” It’s “can I prove it’s right, and can I explain how it’s right?”

That’s the state of most AI today. The outputs look polished. The numbers are plausible. But there’s no chain of reasoning you can follow. No way to distinguish a right answer arrived at correctly from a right answer arrived at by accident. And if the process that produced the answer is wrong, the next answer might not be accidentally right.

Making AI trustworthy isn’t a prompting problem or a model selection problem. It’s an architecture problem. It requires data and APIs modeled to the specific domain. It requires an agentic system with parallel execution, specialized sub-agents, and careful tool selection. It requires workflow design, visualization, provenance at every layer, and a deep understanding of how your users actually work. We’ve started calling the sum of these design decisions an agent ontology. Here’s what building one has taught us.

Why Showing Your Work Requires an Architecture

The obvious response to “AI makes things up” is to give the model access to real documents. Ground it. Let it read the 10-K and pull the number directly, rather than generating one from memory.

This helps. But it doesn’t solve the problem.

Even when a model is reading from the right source, there’s no structural guarantee it pulled from the right table, the right column, the right reporting period. Financial filings are dense. A 10-K might contain three different revenue figures on the same page (GAAP revenue, adjusted revenue, revenue from a specific segment) and the model has to select the right one for the question being asked. Did it pull the annual number or the quarterly number? Did it grab the restated figure or the original? Is it reporting in millions or thousands? These aren’t hallucinations in the traditional sense. The model is looking at the right document. It might even have found the right page. But somewhere between source and output, a subtle error creeps in, and because you can’t trace the path, you can’t catch it.

At Kepler, we enforce a hard separation. The AI layer’s job is to reason: understand what you’re asking, decompose it into sub-tasks, plan the execution, and orchestrate the workflow. Deterministic code retrieves the data and performs every calculation.

When an analyst asks “What’s Nike’s inventory days outstanding trend over the last 8 quarters?”, here’s what doesn’t happen: the model doesn’t generate a number. The AI layer parses intent. This is a time-series query on a calculated metric for a specific entity. That structured intent passes to deterministic code that retrieves inventory and COGS figures from verified 10-Q extractions, computes DIO (Days Inventory Outstanding) using an explicit formula (average inventory / COGS × days), and returns results with pointers to the source filings, down to the specific page, table, and line item. The model never touched arithmetic. It reasoned about what was needed. Code moved the data.

This separation sounds simple. Maintaining it across hundreds of query types and data sources while the AI layer is actively planning, routing, and recovering from errors is a constant architectural discipline.

The Discovery Problem

Every hard problem in software comes back to data. But modeling data for an AI agent is a fundamentally different challenge than traditional data engineering, because the consumer isn’t a dashboard or a BI tool. It’s a reasoning system making decisions in real time about what to retrieve, how to compute it, and whether the results make sense.

Before an agent can answer a complex question, it needs to understand the landscape. We call this the discovery layer, and skipping it is the failure mode we see most often in AI systems that work in demos and fall apart on real queries.

Financial data is deceptively heterogeneous. “Gross margin” at Nike means something different than at a SaaS company. Fiscal Q3 at one company is calendar Q3 at another. Segment reporting changes year over year. Before any agent can reason about this data, you need a semantic layer that maps natural language concepts to precise definitions: what metrics exist, how they’re calculated, which line items feed them, and how those line items align across companies with different reporting structures.

When data comes from structured SEC filings, we need to efficiently discover schema across multiple filings. A 10-K for Nike and a 10-K for Adidas share a common structure, but the line items don’t use the same labels. The discovery layer maps those equivalences so downstream agents can query across entities without the orchestrator needing to know those details.

When the question targets unstructured content, like discussion sections in earnings calls, management commentary, or risk factors, the system routes to entirely separate agents with their own tool sets and instructions. A transcript search agent doesn’t need access to structured financial data, and giving it that access would create noise. The orchestrator knows which agent handles which domain and routes accordingly.

We also use what we call Skills: domain-specific knowledge packages that the system detects automatically based on the query context. When computing enterprise value, a skill provides the exact calculation steps, the specific line items to use, and the order of operations. When building a comp table, a formatting skill ensures consistent structure. Skills encode the domain expertise that would otherwise live only in a senior analyst’s head — the kind of knowledge that’s never in the documentation but always in the workflow.

Why Less Tooling Beats More

The dominant pattern in AI right now is to give an agent access to everything and let it figure out what to use. Every demo works this way: here are your tools, here’s a question, go. In practice, that’s how you get confident, polished, wrong outputs.

A single agent with access to web search, a database, a code interpreter, and a document store will use whichever tool pattern-matches to the query, even when that’s the wrong tool for the job.

But there’s a real tradeoff between narrowly scoped agents and agents that can recover when their assumptions are wrong.

Here’s a concrete example. An analyst asks for quarterly segment revenue trends across a company’s business units. The orchestrator identifies the 10-K as the primary source, since it contains segment disclosures. The retrieval agent pulls the annual filing and starts extracting. But the 10-K only reports segment revenue annually. To get quarterly granularity, you need the 10-Qs. If the retrieval agent has already handed off and no longer has access to document search tools, the workflow stalls.

On the flip side, consider a user who asks for a spreadsheet built from financial data. If the agent has both data retrieval and sheet-building tools, the failure mode looks like this: it jumps straight to constructing the spreadsheet before validating that it has complete, correct data. You get a polished deliverable with gaps or wrong numbers, which is worse than an error, because it looks finished.

The key design question isn’t “broad or narrow” but “where are the right validation checkpoints?” We combine search and data extraction tooling and let the agent validate completeness at that stage. Once the data is verified, we hand off to output construction with a separate tool set. The agent that builds the deliverable doesn’t need to search for data, and the agent that searches for data doesn’t get tempted to start building too early.

Designing for Recovery

A trustworthy system isn’t one that never makes mistakes internally. It’s one where mistakes are caught and corrected before they reach the user. This means designing every layer of the platform to produce useful feedback when something goes wrong, feedback that’s specific enough for the AI to act on.

This is a design philosophy that shapes how we build everything at Kepler. Our APIs use strong typing and structured error responses, so when an agent writes an incorrect call (wrong argument type, missing required field, invalid enum), the error message is specific enough to function as a correction. The agent rewrites, the system validates again, and each iteration converges toward correct output. A vague runtime error gives the model nothing to work with. A precise type error is essentially an instruction.

But the same principle extends well beyond API calls. When a retrieval agent pulls data from a filing and a validation checkpoint detects that the numbers don’t sum correctly, that’s a feedback signal: the agent can re-extract or escalate rather than passing bad data downstream. When a metric comes back in thousands but the rest of the comp table is in millions, the system flags the unit mismatch rather than silently mixing scales. When a company’s filing contains both original and restated figures, the validation layer catches the ambiguity and forces a resolution before the number reaches the output.

The underlying principle is that every interface in the system — between agents, between agents and tools, between the orchestration layer and the data platform — needs to be designed with AI as the consumer. Humans read stack traces and infer what went wrong. AI needs the error to contain the correction. This means investing heavily in the surfaces where things go wrong: typed APIs, structured error responses, validation rules that return specific failures rather than boolean pass/fail. The quality of your error surfaces determines how reliably your agents self-correct, which determines how much you can trust the output.

Orchestration Is the Product

All of the above — discovery, tool routing, validation checkpoints, recovery mechanisms, skills — has to be orchestrated. And orchestration, done right, is where most of the value lives.

Consider what happens when an analyst asks for a trading comp table across ten companies. A naive implementation processes them sequentially: retrieve data for company one, extract, validate, move to company two. That’s ten round trips. In our system, the orchestrator fans out ten parallel sub-agents, each one simultaneously:

  • Pulling the relevant filings for its assigned entity
  • Extracting the target metrics
  • Validating the data against source

The results converge, get normalized across reporting structures and fiscal calendars, and assemble into a single output. What would take minutes sequentially happens in seconds.

Or consider a question like “How has management’s tone on pricing power changed across the last four earnings calls?” The orchestrator spins up parallel agents to analyze each transcript independently, then hands the results to a synthesis agent that identifies the trend. The agents doing the analysis don’t need to know about each other. The orchestrator manages the fan-out, the convergence, and the synthesis.

This kind of orchestration requires tight integration with everything underneath it. The orchestrator needs to:

  • Know which data sources require which authentication
  • Understand which sub-tasks can safely run in parallel and which have dependencies
  • Access domain-specific validation rules — not generic “does this look right” checks, but precise constraints like “these two line items should sum to this total” or “this metric should fall within this range for this industry”
  • Route failures intelligently, redirecting sub-agents to alternate data sources when the primary source doesn’t have what’s needed, without restarting the entire workflow

General-purpose orchestration tools are getting remarkably capable. But they’re general-purpose by design. They don’t know your data model, your auth boundaries, your preferences, your domain’s validation rules, or which of your fifty data sources is the right one for a specific question. The gap between general orchestration and domain-integrated orchestration isn’t closing as models improve. It’s becoming more visible. The more capable the base model, the more obvious it becomes that the bottleneck is the integration layer: the auth, the data access, the parallel execution, the domain constraints, and the ability to recover gracefully when a sub-agent hits a wall.

Kepler learned this the expensive way early on. We assumed much of the required domain knowledge was already in the LLM. It should understand SEC filings, how to find them, how metrics relate to each other. And it does, broadly. But broad knowledge is precisely the problem. The model has seen so many patterns that it takes plausible shortcuts. It conflates similar-but-different concepts. We could rely on the LLM for general reasoning and instruction-following, but we had to provide specific guidance, structured directions, and hard boundaries. The systems we built with too much reliance on raw model capability failed on exactly the complex, multi-step queries that matter most.

The best way to understand why this matters is to look at what the user actually sees.

What You Can Verify

Everything we’ve described — the separation of AI and code, the discovery layer, the validation checkpoints, the orchestration — only matters if the user can see it working. Provenance isn’t a feature we bolt on at the end. It’s a design constraint that flows backward through the entire system. The orchestration layer plans with traceability in mind from the first step. Every data retrieval captures the source at extraction time. Every calculation records its formula and inputs. By the time results reach the user, the full chain of evidence is already there.

Here’s what that looks like in practice. When Kepler builds a comp table, the output is a spreadsheet where every number is interactive. Click a gross margin figure and we open the source filing, the actual 10-Q, with the exact line item highlighted. You’re not trusting a citation that says “Nike 10-Q, FY2024.” You’re looking at page 37, table 2, line 4, with the number highlighted in context. The formula that computed the metric is visible: gross profit / revenue. Each input in that formula links to its own source. Every filing used in the analysis, every company included, every metric computed is tracked and navigable in the UI.

This is what separates real provenance from gestures toward traceability. “This came from Nike’s 10-Q” is a label. Clicking a number and landing on the highlighted source in the original filing is proof. It’s the answer to the MD’s question: you trust it because you can audit it. Every number, every formula, every source, one click away.

The AI layer makes this possible because it’s not just reasoning about what data to retrieve. It’s reasoning about how to present and link that data so the user can verify it. The orchestration and the presentation layer are tightly coupled. The same system that retrieves, computes, and validates is the system that builds the traceable output. There’s no hand-off to a separate rendering step that might lose the thread. The provenance survives all the way from source filing to the cell in your spreadsheet because the architecture was designed to carry it end to end.

When you can verify any number in under ten seconds, not by re-doing the work, but by clicking through to the source, you stop wondering whether to trust the system. You just check.

The Thesis

We started in finance because it’s the most unforgiving domain for this problem. An analyst can’t put a number in front of a client without knowing where it came from. A model that’s usually right is worthless when you can’t distinguish the exceptions. If you can build AI that survives that level of scrutiny, in actual workflows with actual stakes, you’ve proven something about the architecture that transfers.

Models will keep getting better. General-purpose orchestration tools will keep getting more capable. But the gap between “impressive demo” and “trustworthy system” isn’t a model gap. It’s an integration gap: data modeling, domain-specific orchestration, verifiable outputs, and the tight coupling between all three. The model is an ingredient. The product is the recipe.

If I had to place one bet: within the next few years, “AI platform” will stop meaning “model wrapper with a chat interface” and start meaning the full stack of data modeling, agent orchestration, and trust infrastructure that makes AI reliable on hard problems. That’s the agent ontology.

That’s what we’re building at Kepler.


Read more: Context Is the Easy Part explores why context engineering is really an infrastructure problem. Trust in the Age of AI examines how forcing AI to cite sources changes the nature of the output itself.