Kyūdō
ArchitectureBOFU

Deploying Inside a Customer Tenant: What We Learned From Forty Installs

The misconfigurations that actually matter, the ones nobody fixes, and the three pre-flight checks that catch 80% of issues.

Kyudo EditorialFebruary 25, 202611 min read

It's 3 PM on a Thursday. We're on a screen-share with a customer's infrastructure team. The Kyudo platform is deployed, the AKS cluster is running, the Compliance Graph is initialized. But the evidence pipeline from Defender XDR is returning 403s. The service principal has the right API permissions. The Graph API scopes are correct. The issue, which takes 90 minutes to isolate, is a Conditional Access policy that blocks service principal authentication from any IP not on the customer's corporate allow-list. The AKS cluster's egress IP wasn't added.

This is install number 23. We've seen this exact failure pattern before. By install 34, we added it to the pre-flight checklist. By install 40, we catch it before deployment starts.

Forty tenant deployments teach you things architecture diagrams don't. Every Azure tenant is theoretically identical. In practice, each one has accumulated years of policy decisions and network configurations that create unique deployment surfaces.

The deployment model

Kyudo deploys inside the customer's Azure tenant. Not adjacent to it. Inside it. AKS in their subscription, data in their resources, service principals in their Entra ID. Nothing egresses to Kyudo infrastructure. The customer's data stays in their data boundary.

This model exists because regulated organizations told us they won't send compliance data to a vendor's tenant. We wrote about this tradeoff in GRC Implementation in 30 Days.

The trade-off: we get data sovereignty and customer trust. We also get every configuration quirk that tenant has accumulated over its lifetime.

The three misconfigurations that actually break things

These three cause deployment failures or functional degradation. Everything else resolves quickly.

1. Network segmentation that blocks the evidence pipeline

Frequency: 14 of 40 deployments (35%)

The evidence pipeline calls Defender XDR, Sentinel, Purview, Entra ID, and Azure Policy APIs from the AKS cluster inside the customer's virtual network.

What breaks: Evidence collection returns 403 or timeout errors. The platform deploys fine, the UI works, but no evidence flows. The Compliance Graph stays empty.

The fix: Pre-deployment network topology review. We need outbound HTTPS (443) from the AKS subnet to graph.microsoft.com, api.securitycenter.microsoft.com, management.azure.com, and Sentinel workspace endpoints. Azure Firewall needs explicit application rules. NSGs need service tags AzureActiveDirectory and AzureResourceManager allowed.

Network ConfigRisk LevelPre-Flight Check
Open outbound (default)LowVerify no restrictive NSGs on AKS subnet
Azure Firewall with allow-listHighRequire application rules for Graph API, Defender API, ARM
Private Link for Graph APIMediumVerify private DNS resolution from AKS VNET
NVA with URL filteringHighWhitelist Microsoft service FQDNs, not just IPs
Hub-spoke with forced tunnelingHighVerify that Microsoft service traffic doesn't route through on-prem

The hub-spoke with forced tunneling case is the worst. API calls that should take 200ms take 3 seconds because they're routing through an on-prem datacenter in another geography. That breaks evidence collection timeouts.

2. Conditional Access policies that block service principals

Frequency: 11 of 40 deployments (28%)

CA policies are designed for human users. They often catch service principals as unintended collateral. Common triggers: location-based policies blocking sign-ins from IPs not on the corporate allow-list (the AKS egress IP isn't listed), policies requiring compliant devices (service principals don't have devices), or policies requiring MFA (service principals can't complete MFA interactively).

What breaks: API calls fail with AADSTS53003: Access has been blocked by Conditional Access policies. Evidence collection halts. If the CA policy is intermittent (risk-based), the evidence pipeline fails sporadically, which is harder to diagnose than a consistent failure.

The fix: Exclude the Kyudo service principal from CA policies that target user sign-in properties the service principal can't satisfy. This requires coordination with the customer's identity team, who are often (reasonably) reluctant to create exclusions without understanding the implications.

The identity team is usually reluctant. The resolution: "The service principal authenticates using a client certificate from within your own VNET. The location restriction doesn't add security value for a workload identity that never leaves your tenant." That resolves it. But it needs to happen before deployment, not during a troubleshooting call at 3 PM on a Thursday.

3. Azure Policy assignments that prevent platform updates

Frequency: 8 of 40 deployments (20%)

Azure Policy assignments deny resource operations that violate them. The initial deployment succeeds, but subsequent platform updates fail because a new configuration requirement conflicts with a policy added after deployment.

What breaks: Platform updates fail with cryptic ARM errors. The platform stays on the older version, accumulating version drift.

The fix: Document minimum Azure Policy exemptions for the Kyudo resource group at deployment time. Review policy compatibility before each platform update.

Common Azure PolicyConflictResolution
Require specific AKS versionBlocks cluster upgradesExempt Kyudo RG or align update cadence
Deny public IP creationBlocks ingress controllerUse internal load balancer (preferred) or exempt
Require specific VM SKUBlocks node pool scalingPre-approve SKU list for Kyudo nodes
Enforce tag inheritanceFails on auto-generated resourcesExempt system-managed resources in Kyudo RG
Deny extensions on VMsBlocks monitoring agentsExempt or pre-install required extensions

The ones nobody fixes (but should)

These don't break the deployment. They create security debt that compounds over time. In 40 deployments, we see these in nearly every tenant. Most customers acknowledge them and never remediate.

Expired certificates on service principals

In 28 of 40 tenants, we found at least one service principal with an expired credential still "in use." Most organizations don't have systematic credential lifecycle management for workload identities.

What we do now: We configure certificate expiry alerting through Azure Monitor during deployment and document the rotation procedure. Certificate rotation is a 15-minute task. Discovering your evidence pipeline has been down for a week because of expiry is a much worse conversation.

Overly broad RBAC assignments from initial setup

12 of 40 customers granted broader permissions than requested. Contributor instead of Reader. Security Admin instead of Security Reader. The broader grants were faster than troubleshooting which specific role resolved an error. Nobody scopes them down afterward.

What we do now: Post-deployment RBAC review at 30 days. We audit role assignments against our minimum-required matrix and provide specific commands to reduce scope. Adoption rate: about 60%.

Resource group tagging inconsistencies

In 22 of 40 deployments, tagging wasn't applied correctly. The cost was either unattributed or in the wrong cost center. Three customers discovered this during annual budgeting. The conversation was never about money. It was: "How did infrastructure get deployed without proper tagging, and what else did we miss?"

What we do now: Tagging requirements are a pre-flight item. We collect the schema before deployment and apply it during provisioning.

The three pre-flight checks that catch 80% of issues

Three checks catch approximately 80% of issues before deployment day.

Check 1: Tenant readiness assessment

Duration: 2-3 hours with the customer's infrastructure team.

Assessment AreaWhat We CheckWhy
Entra ID configurationCA policies targeting all cloud apps, workload identity policies, service principal restrictionsIdentifies authentication blockers
Network topologyVNET structure, hub-spoke vs. flat, firewall/NVA presence, forced tunneling, Private Link usageIdentifies evidence pipeline blockers
Azure Policy inventoryActive policy assignments at subscription and management group scopeIdentifies deployment and update blockers
Subscription limitsAKS quota, compute quota, storage quota in target regionIdentifies capacity constraints
Existing service principalsCredential hygiene, naming conventions, lifecycle processesIdentifies operational maturity for SP management

Check 2: Integration surface validation

Duration: 1-2 hours with the customer's security operations team.

Which Microsoft security products are deployed and producing data? We validate:

  • Defender XDR: Active and generating alerts? Which workloads covered (Endpoint, Office 365, Identity, Cloud Apps)?
  • Sentinel: Deployed? Which data connectors active? Log retention period?
  • Purview: Information Protection deployed? DLP policies active? Sensitivity labels in use?
  • Entra ID: P1 or P2 licensing? Conditional Access policies active? Identity Protection enabled?
  • Azure Policy: Compliance state populated? Custom policy definitions in use?

The Compliance Graph's coverage map reflects actual integration availability, not theoretical maximum.

Check 3: Data boundary confirmation

Duration: 30 minutes. Usually a conversation with the customer's security architect or DPO.

We confirm:

  • All platform components run inside the customer's subscription. Verified by resource group inventory.
  • All data storage resides in customer Azure resources. Verified by connection strings.
  • No telemetry or compliance data egresses to Kyudo infrastructure. Verified by network flow logs.
  • Platform updates are pulled (customer cluster pulls images from our registry), never pushed.

For financial services, this produces a written attestation. For defense customers, it feeds their ATO package. For healthcare, it supports BAA scoping.

The counter-argument: "SaaS is simpler"

It is. A SaaS deployment takes hours. No pre-flight assessment, no CA policy conversations. If your data classification allows compliance data in a vendor's infrastructure and your regulators accept "the vendor is SOC 2 certified" as sufficient, SaaS is the right choice.

The forty organizations we deployed inside chose tenant-hosted for specific reasons: financial regulators requiring data residency (9 of 40), defense contracts prohibiting commercial multi-tenant (7 of 40), healthcare ePHI boundary requirements (6 of 40), enterprise policies prohibiting identity data in third-party tenants (11 of 40), and procurement rejecting SaaS on sovereignty grounds (7 of 40).

The pre-flight process adds 1-2 weeks compared to SaaS. For organizations where data sovereignty isn't optional, that's the cost of the requirement. Our job is reducing it to a predictable, repeatable process.

Deployment timeline: what "30 days" actually means

WeekActivitiesOutputs
Week 1Pre-flight checks 1-3. Network assessment. CA policy review. Azure Policy compatibility check.Go/no-go decision. Remediation items list.
Week 2Remediation (customer-side): firewall rules, CA exclusions, RBAC grants, tagging. Kyudo: deployment automation configuration.Environment ready for deployment.
Week 3Platform deployment. AKS provisioning. Compliance Graph initialization. Service principal configuration. Integration activation.Platform running. Evidence pipeline active.
Week 4Framework loading (STRM Engine imports customer's applicable frameworks). Initial evidence collection. CMCAE baseline assessment. User onboarding.First compliance posture visible.

Week 1 catches problems. Week 2 fixes them. Week 3 deploys. Week 4 delivers value.

Monday morning checklist

1. Know your network topology. Map outbound connectivity from the target VNET. Can workloads reach Microsoft Graph API and Azure management endpoints? If you use forced tunneling or Azure Firewall, explicitly verify.

2. Audit your CA policies for service principal impact. For each policy scoped to "All cloud apps," check whether it applies to workload identities. Understand what conditions would block a service principal authenticating from within your own VNET.

3. Export your Azure Policy assignments. Run az policy assignment list --scope /subscriptions/{id}. Know what's enforced and what it prevents.

4. Check your service principal credential expiry. Run az ad app credential list --id {app-id} for critical service principals. Anything expiring within 90 days needs a rotation plan.

5. Define your data boundary position. "Can compliance data reside outside our tenant?" If no, tenant-deployed solutions are your only option. Know that before the sales process, not during procurement review.


Kyudo deploys inside your Azure tenant. Your data stays in your subscription. No egress. Forty deployments taught us the pre-flight checks that prevent Thursday afternoon troubleshooting calls.

Book a demo to walk through the tenant readiness assessment for your environment.

Next step

Book a demo

Book a demo
customer tenant GRC deploymentAzure tenant deploymentcustomer-hosted GRC architecturetenant deployment lessons