The Question to Ask Any GRC Vendor's AI: Based on What?
Your GRC vendor added an AI feature that generates compliance recommendations, but you can't trace any output back to specific evidence or controls.
Your GRC vendor just shipped an AI feature. It generates recommendations. It summarizes your compliance posture. It produces a paragraph that says: "Your access control posture is 78% compliant with ISO 27001 Annex A.9."
That sentence sounds authoritative. It has a number in it. Numbers feel precise. But ask the follow-up question that separates governance from theatre: based on what?
Which controls were evaluated? What evidence was examined for each? How fresh is that evidence? What scoring methodology produced 78%? What would need to change to reach 85%? If the answer to any of these is "the model determined this," you don't have an AI governance tool. You have a language model generating plausible-sounding compliance commentary. Those are different things.
The fluency trap
Large language models are trained to produce coherent, confident text. That's their function. They don't hedge naturally. They don't say "I'm uncertain about this" unless specifically prompted to. They generate the next probable token, and probable tokens in the compliance domain sound like authoritative assessments.
This is fine for summarizing meeting notes. It's dangerous for governance.
When an AI output says "your organization demonstrates strong alignment with NIST CSF Protect function," that sentence is indistinguishable from a sentence produced by a model that has no access to your actual control evidence. Fluency is not correlated with accuracy. A model can produce a grammatically perfect, professionally toned compliance assessment that's entirely fabricated. Not because it's malicious. Because generating plausible text is literally what it's trained to do.
The problem compounds when nobody asks the follow-up question. The output enters a dashboard. The dashboard goes to the board. The board sees 78% and feels reassured. Nobody traces the number back to its source, because the system doesn't expose the source.
This is the state of "AI in GRC" at most vendors today. As we covered in Legacy GRC Is Structurally Failing, the underlying data architecture doesn't support the reasoning chain that AI outputs claim to represent.
What "based on what" actually requires
Answering "based on what?" demands four things from an AI system:
1. Specific control references. Not "your access controls are strong." Which controls? By what identifier? Mapped to which framework clauses? If the AI says you're 78% compliant with ISO 27001 A.9, it should name the specific controls in your library that map to A.9, identify which ones passed and which didn't, and explain the weighting.
2. Linked evidence. Each control assessment should trace to a specific evidence artifact. Not "based on your policies" but "based on Evidence ID EV-2847, collected from Entra ID Conditional Access policies on 2026-05-08, showing 14 of 16 access policies enforce MFA." The evidence has a source, a collection date, and a verifiable content.
3. Scoring methodology. How did 14/16 become part of 78%? What's the weighting model? Is a control with documented policy but no operating evidence weighted the same as a control with continuous automated evidence collection? If the methodology isn't exposed, the number is decoration.
4. Freshness and currency. Evidence ages. A penetration test from 11 months ago tells you something different than one from last week. A policy document approved in 2023 doesn't prove the policy is enforced today. Any AI system generating compliance scores without factoring evidence freshness is generating a fiction that degrades daily.
Most GRC AI features answer none of these. They produce text that sounds like it could be grounded in your data, because the model was trained on compliance language. But sounding grounded and being grounded are different states.
The confidence problem
There's a subtler issue. When humans write compliance assessments, uncertainty is visible. An analyst hedges. They write "approximately" or "based on available evidence, which may be incomplete." Readers calibrate their confidence accordingly.
Language models don't naturally hedge. They produce text at the same confidence level regardless of whether the underlying data is strong or weak. A response based on 47 pieces of fresh evidence reads identically to a response based on two stale screenshots and a policy document from 2022.
Without an explicit confidence signal, the consumer of AI-generated compliance output has no way to gauge reliability. Every output looks equally authoritative. That's worse than traditional manual assessments, where at least the analyst's caveats gave you calibration signals.
This is why confidence scoring isn't a nice-to-have feature. It's a structural requirement for any AI system operating in a governance context. The output must carry a numerical signal indicating how well-supported it is. And that signal must be computed, not generated. A language model writing "I am 85% confident" is not the same as a system calculating confidence based on evidence completeness, freshness, and consistency.
Two layers, not one
The architectural answer is to separate operations that don't need AI from operations that benefit from it, and then constrain the AI operations with explicit traceability.
Kyudo's Two-Layer Trust architecture makes this separation explicit.
Layer 1: Deterministic. Control scoring, evidence validation, framework crosswalk mapping, and maturity assessment run on structured logic over the Compliance Graph. No language model involved. When the platform says a control scores 73/100, that number comes from a defined calculation: evidence count, evidence freshness, maturity level, enforcement status. You can audit the formula. You can verify the inputs. There's no probability involved and no hallucination risk.
This covers the majority of operations that governance professionals actually need: Is this control operating? Does the evidence support it? How does it map to my frameworks? What's my aggregate posture? These questions have deterministic answers derivable from structured data. Using a language model to answer them introduces risk for zero benefit.
Layer 2: Advisory. Policy drafting, gap analysis narratives, risk summaries, and natural-language explanations use NLP. But every output carries three properties:
- A confidence score between 0 and 1, calculated from the completeness and freshness of the source data
- Citations to specific controls, evidence artifacts, and framework clauses in the Compliance Graph
- Provenance metadata showing which data sources informed the output
Outputs scoring below 0.7 are flagged for mandatory human review. Not optional review. Mandatory. The system won't let an advisory output below that threshold propagate without a human explicitly accepting it.
This means when the Tensei Copilot generates a gap analysis, you can trace every claim back to specific nodes in the Compliance Graph. "Your data classification controls are under-evidenced" links to the specific controls, the specific evidence (or absence thereof), and the specific framework clauses that require stronger coverage. You can verify it. You can disagree with it. You can follow the chain.
Why the Compliance Graph eliminates hallucination for Layer 1
A common objection: "Can't you just prompt the AI to cite sources?" You can. It will happily generate citations. They might be real. They might not be. Prompt engineering doesn't prevent hallucination. It makes hallucination harder to detect because the hallucinated text now includes fabricated citations that look legitimate.
The Compliance Graph solves this differently. Layer 1 operations don't generate text at all. They traverse a graph structure: controls linked to evidence linked to frameworks linked to policies. When you ask "what controls map to ISO 27001 A.9.4?", the system follows relationship edges in the graph. It returns actual nodes. It can't hallucinate a control that doesn't exist because it's not generating, it's querying.
This is the same principle that makes a database query trustworthy while a language model's answer to the same question isn't. SELECT * FROM controls WHERE framework_clause = 'A.9.4' either returns real records or returns nothing. It doesn't invent records that sound plausible. Graph queries have the same property: they return what exists in the graph, period.
Layer 2 can still hallucinate, which is why it operates under the confidence-score-and-citation regime. But Layer 1 is architecturally incapable of hallucination because it doesn't use a generative model. The distinction between what's computed and what's generated is visible to the user at all times.
The vendor evaluation framework
Next time a GRC vendor demos their AI capability, bring these five questions:
1. "Show me the reasoning chain for this output." Not an explanation generated by the AI. The actual data trail: which inputs, which logic, which outputs. If the vendor can't show you the intermediate steps, they can't trace their own system's reasoning.
2. "What happens when the AI is wrong?" Every AI system produces errors. The question is whether the architecture makes errors detectable. Confidence scores, citations, and human review thresholds are the mechanisms. "Our AI is very accurate" is not a mechanism.
3. "Which operations use AI and which don't?" If the vendor can't clearly delineate what's deterministic and what's generated, they haven't thought carefully about trust boundaries. Every operation should be classifiable as either "computed from structured data" or "generated with citations and confidence."
4. "What's the confidence distribution?" Ask to see the distribution of confidence scores across their AI outputs. If everything is above 0.9, they're either measuring wrong or not measuring at all. Real advisory systems produce a range. Some outputs are well-supported. Some aren't. A system that's always confident is a system that doesn't model uncertainty.
5. "Can I verify this output without asking the AI?" Take an AI-generated claim and try to verify it manually from the underlying data. If you can't, the output is unfalsifiable, and unfalsifiable claims have no place in governance.
What this means for AI governance specifically
The irony isn't lost: organizations deploying AI governance programs are often using AI tools that can't explain their own reasoning. You're trying to ensure your AI systems are transparent while using a non-transparent AI tool to track that transparency.
If your AI governance program is built on outputs you can't trace, you have a circular problem. You're governing AI transparency with opaque AI. The EU AI Act Article 13 requires transparency for high-risk AI systems. ISO 42001 Clause 9 requires performance evaluation. NIST AI RMF's Measure function requires quantified assessment. These requirements apply to your governance tooling too, not just the AI systems you're governing.
The tool that measures transparency should itself be transparent. The tool that assesses AI risk should itself have a documented, auditable reasoning process. "Based on what?" isn't just a question for your vendors' AI features. It's the question that defines whether your governance program rests on foundations or on fluency.
As we explore in AI Governance Beyond the Policy PDF, the gap between documenting a policy and proving it operates in practice is where most programs fail. The same gap applies to AI tooling: the difference between a tool that generates compliance-sounding text and a tool that computes verifiable compliance assessments from structured data.
The single question, restated
Your GRC vendor's AI says something about your compliance posture. It sounds authoritative. It has numbers. It references frameworks.
Based on what?
If the answer involves specific controls, specific evidence, a defined methodology, freshness data, and a confidence signal, you're looking at a governance tool. If the answer is "the AI analyzed your data," you're looking at a language model that generates compliance-adjacent text.
One of those is auditable. The other will embarrass you in front of a regulator.
Want to see the reasoning chain? Book a demo of the Tensei Copilot. We'll generate a compliance assessment live and show you every control, evidence artifact, and citation that informed it. Bring your auditor.
