Anthropic's Agentic Analytics Stack: What 95% Accuracy Actually Requires

Anthropic published one of the most honest posts I have read on agentic analytics. They now run most internal business analysis through Claude at roughly 95% accuracy. When they first pointed the same model at the warehouse with no context, accuracy was about 21%. Same model, different infrastructure.

TLDR:



  • Anthropic automates ~95% of internal business analytics queries with ~95% aggregate accuracy using Claude, not a bigger model.

  • The jump from 21% to 95% came from an agentic analytics stack: canonical data, sources of truth, skills, and validation loops.

  • Most failures are concept-to-entity ambiguity, stale docs, and retrieval failure across huge warehouses.

  • Skills (markdown procedural guides colocated with dbt models) were the biggest accuracy lever.

  • Teams without dedicated data engineers can still copy the loop: define metrics, force clarifying questions, show SQL, and maintain docs like code.

Why the 95% headline is not about SQL

Anthropic's post is worth reading because it names the real problem. Analytics agents fail when they treat data like software. Code has tests and multiple valid solutions. Business questions often have one correct answer tied to one governed definition of "revenue," "active user," or "churn."

The hard part is not writing SQL. It is mapping a vague question to the right entity in a model that changes every week.

Anthropic breaks most errors into three buckets:

  1. Concept-to-entity ambiguity. "Active users" could mean login, purchase, or return within seven days. Without a single definition, the agent picks one confidently.

  2. Data staleness. Schemas, business rules, and skill docs drift. Accuracy can fall from ~95% to ~65% in a month without maintenance.

  3. Retrieval failure. The answer exists somewhere in the warehouse, but the agent never finds the right table or metric among millions of fields.

That framing matches what we see with B2B SaaS teams trying to stand up an AI data analyst. The model is rarely the bottleneck. Context and verification are.

The four-layer agentic analytics stack

Anthropic's fix is a stack, not a prompt tweak.

Data foundations

Fewer, governed canonical datasets beat a sprawl of near-duplicate tables. When an agent searches for "revenue for product X," it should find one consumption-ready model, not twelve plausible candidates with slightly different filters.

Sources of truth

A semantic layer, lineage graph, query corpus, and business context sit in priority order. The semantic layer wins when it covers the question. Lineage helps when it does not. The goal is to collapse ambiguity before SQL runs.

Skills

Skills are markdown folders the agent reads on demand. They encode how a senior analyst would clarify the question, pick sources, run the query, and format the answer with freshness and provenance.

Without skills, Anthropic saw accuracy below 21% on internal evals. With skills, it climbed above 95% in aggregate and near 99% in some domains.

Validation

Offline evals, adversarial review sub-agents, and online checks catch stale or wrong answers before stakeholders act on them. Maintenance is first-class: skill docs live in the same repo as transformation code, and PRs that change models must update skills.

What most teams miss

Anthropic's setup works because senior data engineers treat skill maintenance like production code. Roughly 90% of their data-model PRs include skill updates in the same diff.

Most growth-stage SaaS teams do not have someone whose job is keeping agent context fresh. They have a warehouse, a BI tool, and a backlog of ad hoc requests. Pointing Claude or any LLM at that stack without governed metrics and living documentation recreates the 21% world fast.

The honest takeaway: an AI data analyst is a diagnostic for how mature your analytics org already is. Good definitions and docs make the agent fast. Messy metrics make the agent confidently wrong.

The loop we recommend (and how we built Mora)

If you are trying to get your own data analyst working, start with operational habits, not model selection:

  • Write down your top five metrics and how you actually count them today.

  • Make the agent ask clarifying questions before it runs a query.

  • Show the SQL on every answer so someone can spot-check it.

  • Save corrected queries and feed them back as context.

  • Watch for stale tables and docs that have not been updated in months.

That loop is how we think about Mora as an AI data analyst for B2B SaaS teams. Connect your warehouse, teach Mora your business context, and get plain-English answers with the query logic visible so operators can trust the output.

You do not need Anthropic's headcount to adopt the pattern. You do need the discipline: one definition per metric, procedural context the agent can load, and a habit of updating docs when the model changes.

What this means for conversational analytics

Static dashboards freeze assumptions at design time. Agentic analytics flips the model: users start with a question, follow up naturally, and get answers tied to governed definitions when the stack is built right.

The shift is not "replace your data team." It is move analysts from writing the same SQL repeatedly to owning definitions, evals, and the context layer the agent depends on.

Teams that skip that layer will get impressive demos and painful production mistakes. Teams that invest in context first get the 95% story without needing a research lab.

Bringing this to your stack

Mora connects to BigQuery, Snowflake, Postgres, and other warehouses your B2B SaaS team already uses. Ask questions in plain English, see the SQL, and build on answers with follow-ups. No separate semantic layer project required on day one, but the same principles apply: define metrics clearly, keep context current, and validate before you act.

Book a demo to see how Mora handles your schema and business logic without months of setup.

FAQs

Why did Anthropic's accuracy jump from 21% to 95%?

They added procedural skills, canonical datasets, sources of truth, and validation loops around the same Claude models. The model did not change. The context and verification stack did.

Do I need a semantic layer before deploying an AI data analyst?

Not always on day one, but you need governed metric definitions somewhere. Whether that lives in LookML, dbt metrics, or documented SQL views, the agent needs a single answer for "what is revenue" and "what counts as active."

What is the highest-leverage habit for small teams?

Document your top metrics in plain language and force the agent to clarify ambiguous questions before querying. That alone prevents a large share of wrong-but-confident answers.

How does Mora differ from pointing ChatGPT at my warehouse?

Mora is built as an AI data analyst for B2B SaaS: warehouse connections, business context, visible SQL, and conversational follow-ups in one product. The Anthropic lesson applies to any agent, but Mora packages the loop for teams that do not want to build it from scratch.

How often should skill or context docs be updated?

Whenever the underlying data model changes. Anthropic colocates skill markdown with transformation code so docs and tables stay in sync. Treat stale context as a production bug, not a documentation nice-to-have.

Author

Xavier Pladevall

Co-founder & CEO

Reading Time
0 minutes
Words
0 words

More articles