Deploying GRC in Customer Tenants: 40 Install Lessons

It's 3 PM on a Thursday. We're on a screen-share with a customer's infrastructure team. The Kyudo platform is deployed, the AKS cluster is running, the Compliance Graph is initialized. But the evidence pipeline from Defender XDR is returning 403s. The service principal has the right API permissions. The Graph API scopes are correct. The issue, which takes 90 minutes to isolate, is a Conditional Access policy that blocks service principal authentication from any IP not on the customer's corporate allow-list. The AKS cluster's egress IP wasn't added.

This is install number 23. We've seen this exact failure pattern before. By install 34, we added it to the pre-flight checklist. By install 40, we catch it before deployment starts.

Forty tenant deployments teach you things architecture diagrams don't. Every Azure tenant is theoretically identical. In practice, each one has accumulated years of policy decisions and network configurations that create unique deployment surfaces.

The deployment model

Kyudo deploys inside the customer's Azure tenant. Not adjacent to it. Inside it. AKS in their subscription, data in their resources, service principals in their Entra ID. Nothing egresses to Kyudo infrastructure. The customer's data stays in their data boundary.

This model exists because regulated organizations told us they won't send compliance data to a vendor's tenant. We wrote about this tradeoff in GRC Implementation in 30 Days.

The trade-off: we get data sovereignty and customer trust. We also get every configuration quirk that tenant has accumulated over its lifetime.

The three misconfigurations that actually break things

These three cause deployment failures or functional degradation. Everything else resolves quickly.

1. Network segmentation that blocks the evidence pipeline

Frequency: 14 of 40 deployments (35%)

The evidence pipeline calls Defender XDR, Sentinel, Purview, Entra ID, and Azure Policy APIs from the AKS cluster inside the customer's virtual network.

What breaks: Evidence collection returns 403 or timeout errors. The platform deploys fine, the UI works, but no evidence flows. The Compliance Graph stays empty.

The fix: Pre-deployment network topology review. We need outbound HTTPS (443) from the AKS subnet to graph.microsoft.com, api.securitycenter.microsoft.com, management.azure.com, and Sentinel workspace endpoints. Azure Firewall needs explicit application rules. NSGs need service tags AzureActiveDirectory and AzureResourceManager allowed.

Network Config	Risk Level	Pre-Flight Check
Open outbound (default)	Low	Verify no restrictive NSGs on AKS subnet
Azure Firewall with allow-list	High	Require application rules for Graph API, Defender API, ARM
Private Link for Graph API	Medium	Verify private DNS resolution from AKS VNET
NVA with URL filtering	High	Whitelist Microsoft service FQDNs, not just IPs
Hub-spoke with forced tunneling	High	Verify that Microsoft service traffic doesn't route through on-prem

The hub-spoke with forced tunneling case is the worst. API calls that should take 200ms take 3 seconds because they're routing through an on-prem datacenter in another geography. That breaks evidence collection timeouts.

2. Conditional Access policies that block service principals

Frequency: 11 of 40 deployments (28%)

CA policies are designed for human users. They often catch service principals as unintended collateral. Common triggers: location-based policies blocking sign-ins from IPs not on the corporate allow-list (the AKS egress IP isn't listed), policies requiring compliant devices (service principals don't have devices), or policies requiring MFA (service principals can't complete MFA interactively).

What breaks: API calls fail with AADSTS53003: Access has been blocked by Conditional Access policies. Evidence collection halts. If the CA policy is intermittent (risk-based), the evidence pipeline fails sporadically, which is harder to diagnose than a consistent failure.

The fix: Exclude the Kyudo service principal from CA policies that target user sign-in properties the service principal can't satisfy. This requires coordination with the customer's identity team, who are often (reasonably) reluctant to create exclusions without understanding the implications.

The identity team is usually reluctant. The resolution: "The service principal authenticates using a client certificate from within your own VNET. The location restriction doesn't add security value for a workload identity that never leaves your tenant." That resolves it. But it needs to happen before deployment, not during a troubleshooting call at 3 PM on a Thursday.

3. Azure Policy assignments that prevent platform updates

Frequency: 8 of 40 deployments (20%)

Azure Policy assignments deny resource operations that violate them. The initial deployment succeeds, but subsequent platform updates fail because a new configuration requirement conflicts with a policy added after deployment.

What breaks: Platform updates fail with cryptic ARM errors. The platform stays on the older version, accumulating version drift.

The fix: Document minimum Azure Policy exemptions for the Kyudo resource group at deployment time. Review policy compatibility before each platform update.

Common Azure Policy	Conflict	Resolution
Require specific AKS version	Blocks cluster upgrades	Exempt Kyudo RG or align update cadence
Deny public IP creation	Blocks ingress controller	Use internal load balancer (preferred) or exempt
Require specific VM SKU	Blocks node pool scaling	Pre-approve SKU list for Kyudo nodes
Enforce tag inheritance	Fails on auto-generated resources	Exempt system-managed resources in Kyudo RG
Deny extensions on VMs	Blocks monitoring agents	Exempt or pre-install required extensions

The ones nobody fixes (but should)

These don't break the deployment. They create security debt that compounds over time. In 40 deployments, we see these in nearly every tenant. Most customers acknowledge them and never remediate.

Expired certificates on service principals

In 28 of 40 tenants, we found at least one service principal with an expired credential still "in use." Most organizations don't have systematic credential lifecycle management for workload identities.

What we do now: We configure certificate expiry alerting through Azure Monitor during deployment and document the rotation procedure. Certificate rotation is a 15-minute task. Discovering your evidence pipeline has been down for a week because of expiry is a much worse conversation.

Overly broad RBAC assignments from initial setup

12 of 40 customers granted broader permissions than requested. Contributor instead of Reader. Security Admin instead of Security Reader. The broader grants were faster than troubleshooting which specific role resolved an error. Nobody scopes them down afterward.

What we do now: Post-deployment RBAC review at 30 days. We audit role assignments against our minimum-required matrix and provide specific commands to reduce scope. Adoption rate: about 60%.

Resource group tagging inconsistencies

In 22 of 40 deployments, tagging wasn't applied correctly. The cost was either unattributed or in the wrong cost center. Three customers discovered this during annual budgeting. The conversation was never about money. It was: "How did infrastructure get deployed without proper tagging, and what else did we miss?"

What we do now: Tagging requirements are a pre-flight item. We collect the schema before deployment and apply it during provisioning.

The three pre-flight checks that catch 80% of issues

Three checks catch approximately 80% of issues before deployment day.

Check 1: Tenant readiness assessment

Duration: 2-3 hours with the customer's infrastructure team.

Assessment Area	What We Check	Why
Entra ID configuration	CA policies targeting all cloud apps, workload identity policies, service principal restrictions	Identifies authentication blockers
Network topology	VNET structure, hub-spoke vs. flat, firewall/NVA presence, forced tunneling, Private Link usage	Identifies evidence pipeline blockers
Azure Policy inventory	Active policy assignments at subscription and management group scope	Identifies deployment and update blockers
Subscription limits	AKS quota, compute quota, storage quota in target region	Identifies capacity constraints
Existing service principals	Credential hygiene, naming conventions, lifecycle processes	Identifies operational maturity for SP management

Check 2: Integration surface validation

Duration: 1-2 hours with the customer's security operations team.

Which Microsoft security products are deployed and producing data? We validate:

Defender XDR: Active and generating alerts? Which workloads covered (Endpoint, Office 365, Identity, Cloud Apps)?
Sentinel: Deployed? Which data connectors active? Log retention period?
Purview: Information Protection deployed? DLP policies active? Sensitivity labels in use?
Entra ID: P1 or P2 licensing? Conditional Access policies active? Identity Protection enabled?
Azure Policy: Compliance state populated? Custom policy definitions in use?

The Compliance Graph's coverage map reflects actual integration availability, not theoretical maximum.

Check 3: Data boundary confirmation

Duration: 30 minutes. Usually a conversation with the customer's security architect or DPO.

We confirm:

All platform components run inside the customer's subscription. Verified by resource group inventory.
All data storage resides in customer Azure resources. Verified by connection strings.
No telemetry or compliance data egresses to Kyudo infrastructure. Verified by network flow logs.
Platform updates are pulled (customer cluster pulls images from our registry), never pushed.

For financial services, this produces a written attestation. For defense customers, it feeds their ATO package. For healthcare, it supports BAA scoping.

The counter-argument: "SaaS is simpler"

It is. A SaaS deployment takes hours. No pre-flight assessment, no CA policy conversations. If your data classification allows compliance data in a vendor's infrastructure and your regulators accept "the vendor is SOC 2 certified" as sufficient, SaaS is the right choice.

The forty organizations we deployed inside chose tenant-hosted for specific reasons: financial regulators requiring data residency (9 of 40), defense contracts prohibiting commercial multi-tenant (7 of 40), healthcare ePHI boundary requirements (6 of 40), enterprise policies prohibiting identity data in third-party tenants (11 of 40), and procurement rejecting SaaS on sovereignty grounds (7 of 40).

The pre-flight process adds 1-2 weeks compared to SaaS. For organizations where data sovereignty isn't optional, that's the cost of the requirement. Our job is reducing it to a predictable, repeatable process.

Deployment timeline: what "30 days" actually means

Week	Activities	Outputs
Week 1	Pre-flight checks 1-3. Network assessment. CA policy review. Azure Policy compatibility check.	Go/no-go decision. Remediation items list.
Week 2	Remediation (customer-side): firewall rules, CA exclusions, RBAC grants, tagging. Kyudo: deployment automation configuration.	Environment ready for deployment.
Week 3	Platform deployment. AKS provisioning. Compliance Graph initialization. Service principal configuration. Integration activation.	Platform running. Evidence pipeline active.
Week 4	Framework loading (STRM Engine imports customer's applicable frameworks). Initial evidence collection. CMCAE baseline assessment. User onboarding.	First compliance posture visible.

Week 1 catches problems. Week 2 fixes them. Week 3 deploys. Week 4 delivers value.

Monday morning checklist

1. Know your network topology. Map outbound connectivity from the target VNET. Can workloads reach Microsoft Graph API and Azure management endpoints? If you use forced tunneling or Azure Firewall, explicitly verify.

2. Audit your CA policies for service principal impact. For each policy scoped to "All cloud apps," check whether it applies to workload identities. Understand what conditions would block a service principal authenticating from within your own VNET.

3. Export your Azure Policy assignments. Run az policy assignment list --scope /subscriptions/{id}. Know what's enforced and what it prevents.

4. Check your service principal credential expiry. Run az ad app credential list --id {app-id} for critical service principals. Anything expiring within 90 days needs a rotation plan.

5. Define your data boundary position. "Can compliance data reside outside our tenant?" If no, tenant-deployed solutions are your only option. Know that before the sales process, not during procurement review.

Kyudo deploys inside your Azure tenant. Your data stays in your subscription. No egress. Forty deployments taught us the pre-flight checks that prevent Thursday afternoon troubleshooting calls.

Book a demo to walk through the tenant readiness assessment for your environment.