Confidence Scores, Citations, and Provenance: The AI Governance Stack
You can't present AI-generated compliance outputs to your auditor because there's no way to verify the reasoning chain or source data.
An auditor sits across from your compliance lead. They're reviewing the AI-generated gap analysis your team used to prioritize remediation work for the past quarter. The auditor asks three questions:
"How confident is this output?" Your team doesn't know. The AI didn't say.
"What specific controls and evidence informed this recommendation?" Your team can't trace it. The AI produced a narrative without citations.
"Where did the underlying data come from, and how current is it?" Your team can't answer. The system doesn't track data lineage for AI outputs.
Three questions. Three blank stares. Three months of remediation work that may have been directed by an unreliable assessment.
This is the state of AI in governance at most organizations. The outputs look professional. They read as authoritative. But they lack the three properties that make any AI output usable in an audit context: confidence scores, citations, and provenance.
Property 1: Confidence scores
A confidence score is a numerical signal (0 to 1) indicating how well-supported an AI output is by the available data. Not the model's self-assessment of its own accuracy. A computed metric based on structural properties of the source data.
The distinction matters. A language model asked "how confident are you?" will generate a number. That number is another token prediction, not a measurement. A model can generate "confidence: 0.92" for a completely fabricated claim, because "0.92" is a probable token in training data about AI systems.
Real confidence scoring requires a defined formula applied to measurable inputs:
Evidence completeness. What percentage of the controls relevant to this output have linked evidence? An advisory output referencing 12 controls where 9 have fresh evidence scores higher than an output where only 3 have evidence.
Evidence freshness. How old is the evidence? Artifacts less than 7 days old contribute full weight. Artifacts 8-30 days old contribute reduced weight. Artifacts over 30 days contribute nothing. A gap analysis based on 6-month-old evidence isn't reliable regardless of how sophisticated the AI is.
Data consistency. Do the evidence artifacts tell a coherent story? If one artifact shows a control operating and another shows it failing, that inconsistency reduces confidence in any generated narrative about that control's status.
Coverage ratio. What fraction of the relevant domain is represented in the Compliance Graph? If you're asking about ISO 42001 readiness but only 40% of the relevant controls have been configured and evidenced, the output can't be confident about the other 60%.
These inputs produce a score that means something specific. 0.85 says the relevant controls have fresh, consistent evidence. 0.4 says large data gaps exist and the output is more speculation than analysis.
The 0.7 threshold
In Kyudo's architecture, 0.7 is the line between "advisory output that can propagate to dashboards and reports" and "advisory output that requires mandatory human review before use."
Below 0.7, the data foundation is insufficient for the AI to produce reliable guidance. The output might still be useful as a starting point for human analysis. But it shouldn't travel through the system as though it were a verified assessment.
This threshold creates a natural feedback loop. When users see outputs flagged for review, they investigate why. Usually it's because evidence is stale or controls are un-evidenced. They fix the data problem, which improves the next output's confidence. The threshold drives data hygiene without requiring a separate data quality program.
Property 2: Citations
A citation in governance AI is a reference from a specific claim in an AI output to a specific node in the structured data that supports it. Not "based on your compliance data" at the end of a response. Per-claim traceability.
Consider a gap analysis output: "Your data classification controls show partial implementation. Three of seven mapped controls lack operating evidence, with the primary gap in automated classification enforcement."
Without citations, this is a paragraph. With citations, it becomes:
- "Three of seven mapped controls" cites: [CTRL-DC-001, CTRL-DC-002, CTRL-DC-003, CTRL-DC-004, CTRL-DC-005, CTRL-DC-006, CTRL-DC-007] with status annotations showing which three lack evidence
- "Lack operating evidence" cites: the evidence state for each control, showing collection dates (or absence thereof)
- "Automated classification enforcement" cites: CTRL-DC-005 specifically, showing its current maturity at Level 1 (Documented) versus the Level 3 (Operating) required for the framework obligation
Now the auditor can verify. They follow the citation links. They confirm the controls exist. They confirm the evidence state matches what the AI claims. They can disagree with the AI's interpretation while still trusting the factual basis. That's how governance works. Interpretations are debatable. Facts are verifiable.
Citation depth
Not all citations are equal. There's a hierarchy:
Surface citations reference an entity. "Based on Control AC-7." This tells you the AI looked at something but not what it found.
Attribute citations reference specific properties. "Control AC-7, current maturity: Level 2 (Implemented), evidence freshness: aging (last collected 22 days ago)." Now you know what the AI saw.
Relationship citations reference how entities connect. "Control AC-7 maps to ISO 27001 A.9.4.2 via STRM equivalence relationship, and evidence artifact EV-3291 (Entra ID Conditional Access export, collected 2026-05-20) validates enforcement." Now you see the full reasoning chain.
Kyudo's Tensei Copilot generates relationship-level citations. Every claim traces through the Compliance Graph from assertion to source nodes, through relationship edges, to evidence artifacts. That depth is what makes an AI output auditable.
Property 3: Provenance
Provenance answers: where did the data come from, how did it get here, and how current is it?
Citations tell you what data informed the output. Provenance tells you about the data itself. It's the metadata layer that establishes whether the cited data is trustworthy.
Every evidence artifact in the Compliance Graph carries provenance metadata:
Source system. Microsoft Defender XDR, Entra ID, Azure Policy, manual upload, or API integration. The source tells you about the artifact's authority. An automated export from Entra ID is higher-authority than a manually created attestation document.
Collection method. Automated API collection (scheduled, reproducible, tamper-resistant), manual upload (human-mediated, potentially selective), or integration webhook (event-driven, real-time). The method affects how much you trust completeness.
Collection timestamp. When the underlying data was extracted from the source system, not when it was uploaded to the GRC platform.
Cryptographic hash. Content hash at collection time, proving the artifact hasn't been modified. Cryptographic proof of integrity.
Integration path. The full route from source to storage: API endpoints, service principal, data transformations.
Why provenance matters for AI outputs
When an AI output cites evidence artifact EV-3291, provenance lets the auditor assess: Is this artifact authoritative (came from Entra ID, not a manual upload)? Is it fresh (collected yesterday, not three months ago)? Is it tamper-proof (hash matches, collected via authenticated API)?
Without provenance, citations are just pointers. With provenance, they're verifiable chains of custody. This matters for AI governance programs specifically: if your governance AI produces an output and a human makes a decision based on it, the provenance chain is the transparency record that Article 13 of the EU AI Act requires.
The three properties together
Individually, each property adds value. Together, they create something qualitatively different: an AI output that functions as an audit artifact.
Consider a Tensei Copilot output about ISO 42001 readiness:
Confidence: 0.82
Your AI governance controls demonstrate Level 3 (Operating) maturity across 9 of 14 mapped controls. Primary gaps exist in AI system lifecycle documentation (CTRL-AAT-044, Level 1) and automated bias testing (CTRL-AAT-067, Level 2). Three controls show aging evidence requiring refresh within 8 days (CTRL-AAT-012, CTRL-AAT-023, CTRL-AAT-031).
Citations: CTRL-AAT-012 [EV-4102, Entra ID, 2026-05-02], CTRL-AAT-023 [EV-4188, Azure AI Studio, 2026-05-04], CTRL-AAT-031 [EV-4201, manual attestation, 2026-05-01], CTRL-AAT-044 [no evidence linked], CTRL-AAT-067 [EV-3998, bias test results, 2026-04-28]...
This output is:
- Quantified (0.82 confidence, 9/14 controls, specific maturity levels)
- Traceable (every claim cites specific controls and evidence artifacts)
- Verifiable (follow any citation to the Compliance Graph node and check the facts)
- Time-bound (evidence dates are explicit, freshness is calculable)
- Actionable (specific gaps identified with specific remediation targets)
An auditor can work with this. They can verify the claims independently. They can assess whether the confidence score seems appropriate given the data. They can follow the citation chain and confirm the evidence artifacts exist and say what the AI claims they say. They might disagree with the interpretation, but they can see the basis for it.
Compare that to: "Your organization shows good progress toward ISO 42001 readiness, with some gaps in documentation and testing that should be addressed before your next assessment." This sounds fine. It's unfalsifiable. No auditor can work with it.
Regulatory alignment
These properties map directly to regulatory requirements:
EU AI Act Article 13 requires that high-risk AI systems "enable deployers to interpret the system's output and use it appropriately." Confidence scores enable interpretation. Citations enable appropriate use. Provenance demonstrates operational logic.
ISO 42001 Clause 9 requires monitoring, measuring, and evaluating AI management system performance. Confidence is a measurable metric. Citation completeness is auditable. Provenance chain integrity is verifiable.
NIST AI RMF MEASURE 2.5 and 2.6 require evaluating outputs for trustworthiness and measurable performance. All three properties directly address these subcategories.
If your governance AI can't produce these properties, it can't satisfy the transparency requirements that apply to your broader AI governance program. The tool itself becomes a compliance gap.
Implementation: the Compliance Graph foundation
The three properties are possible because of the Compliance Graph architecture. A language model can't produce real citations, compute confidence from data properties, or attach verified provenance. It can only generate text that mimics these things.
The Compliance Graph makes them structural:
-
Citations are graph traversals. Actual edges traversed from output claims to source nodes. They can't be fabricated because they're derived from graph structure, not generated from text.
-
Confidence is calculated from node properties. Evidence freshness, completeness ratios, and consistency measures are arithmetic operations on graph data. Not generation.
-
Provenance is stored on evidence nodes. Source, method, timestamp, and hash travel with citations automatically.
This is why you can't retrofit these properties through prompt engineering. "Please cite your sources" produces generated text that looks like citations. The Controls Hub architecture produces actual citations derived from structured data relationships. Architecturally different operations with different reliability guarantees.
What to ask your vendor
If your current GRC platform offers AI features, apply this framework:
-
Ask for the confidence formula. Not "our AI is accurate." The specific formula. What inputs? What weights? What threshold triggers human review? If there's no formula, there's no real confidence scoring.
-
Follow a citation. Pick any AI-generated claim. Follow the citation to its source data. Confirm the source data exists. Confirm it says what the AI claims. If you can't do this in under 60 seconds, the citations aren't functional.
-
Check provenance depth. For any cited evidence, can you see source system, collection method, timestamp, and integrity verification? If the evidence is just "a document in the system," provenance is absent.
-
Test the threshold. Ask the AI a question where you know the underlying data is thin. Does the confidence score drop? Does the system flag it for review? Or does it generate a confident-sounding answer regardless of data quality?
-
Verify independence. Can you verify an AI output without asking the AI? If the only way to check whether the AI is right is to ask it again, you have a circularity problem.
These aren't exotic requirements. They're the minimum bar for AI outputs to function as evidence in a governance context. Any AI system that can't meet them is generating prose, not governance.
See the three properties in action. Try the AI risk assessment with your own compliance data. You'll see confidence scores computed in real time, citations linking to specific controls and evidence, and provenance metadata on every artifact. Bring your hardest question.
