Skip to main content

Command Palette

Search for a command to run...

From Provisioning to Control Plane: Designing a Hybrid Terraform + Crossplane Architecture at Scale

Published
6 min read
From Provisioning to Control Plane: Designing a Hybrid Terraform + Crossplane Architecture at Scale

1. Overview

What I Designed

I designed a hybrid infrastructure architecture:

  • Terraform → Foundation Layer

  • Crossplane → Dynamic Lifecycle Layer

  • ArgoCD → GitOps Enforcement

This created a continuously reconciling cloud control plane inside Kubernetes.


Why It Was Required

Our platform crossed:

  • 50+ microservices

  • Multiple engineering teams

  • Multi-region expansion

  • PR-driven infrastructure workflows

  • Feature branch–based short-lived environments

Terraform workflows became operationally slow due to:

  • State contention

  • Long plan times

  • PR bottlenecks

  • Manual drift detection

Provisioning was working.
Lifecycle control was missing.


Constraint

  • No full rewrite

  • No infrastructure instability

  • Zero data loss

  • Minimal migration risk


Engineering Principle Followed

Separate provisioning from lifecycle reconciliation.

Provision once.
Reconcile continuously.


2. Problem Statement

Existing Architecture

Terraform model:

Plan → Apply → Exit

After apply:

  • No continuous reconciliation

  • Drift detection only on next plan

  • Manual console changes remain undetected


Failure Risk

If an engineer modified:

  • RDS storage encryption

  • Deletion protection

  • Security groups

  • IAM policies

Terraform would not react until the next plan/apply cycle.

Drift became silent operational risk.


What Would Break

  • Compliance posture

  • Backup guarantees

  • Encryption enforcement

  • Network boundaries

  • Incident recovery confidence


Why It Was Unacceptable

At scale:

Manual governance does not work.

Infrastructure must enforce its declared state.


3. Architecture After Implementation

Control plane flow:

Developer commits YAML

ArgoCD syncs to cluster

Kubernetes API stores desired state

Crossplane controller watches resource

Crossplane calls AWS API

Cloud resource created/updated

Continuous reconciliation loop

Terraform foundation layer:

Terraform

VPC
Subnets
EKS Control Plane
Core Networking

Clear separation of responsibilities.


4. Design Decisions


4.1 Core Component Choice

Terraform for foundation

Why I chose it:

  • Mature state handling

  • Strong bootstrap ecosystem

  • Clear isolation of foundational infrastructure

Multi-account governance was enforced via separate state isolation and account-factory patterns; Terraform itself does not natively provide org-level governance.

Trade-off:

  • No continuous reconciliation

Risk accepted:

  • Foundation changes are rare and tightly controlled

Crossplane for dynamic infrastructure

Why I chose it:

  • Kubernetes-native control loop

  • GitOps-friendly

  • CRD-based lifecycle management

Trade-off:

  • Adds API server load

  • Adds controller complexity

Risk accepted:

  • Infrastructure lifecycle now depends on cluster health

4.2 Failure Detection

Reconciliation loop ensures:

Actual State == Desired State

Manual console change → Crossplane reconciles.

Trade-off:

  • AWS API throttling possible

  • Eventual consistency delays

Risk accepted:

  • Tuned provider/controller concurrency and AWS API backoff settings to mitigate throttling

4.3 Event Routing

Git Commit
→ ArgoCD
→ Kubernetes API
→ Crossplane Controller
→ AWS API

Rollback = git revert.

Trade-off:

  • Git becomes critical dependency

Risk accepted:

  • Strong repository governance and PR controls

4.4 Automation Logic

Platform team defined Compositions.

Example:

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: xpostgres
spec:
  compositeTypeRef:
    apiVersion: platform.io/v1alpha1
    kind: XPostgres
  resources:
    - name: database
      base:
        apiVersion: database.aws.crossplane.io/v1beta1
        kind: RDSInstance
        spec:
          deletionPolicy: Orphan
          forProvider:
            storageEncrypted: true
            deletionProtection: true

Application team used Claim:

apiVersion: platform.io/v1alpha1
kind: PostgresClaim
metadata:
  name: app-db
spec:
  parameters:
    storage: 20

Why I chose this:

  • Central policy enforcement

  • Developer abstraction

  • Clear ownership boundary

Trade-off:

  • Composition update blast radius

  • Requires versioning discipline

Risk accepted:

  • Versioned compositions per environment

4.5 DNS / Networking

DNS and core networking remained Terraform-managed.

Reason:

  • High blast radius

  • Low change frequency

  • Complex dependency graph

Control plane expansion was phased deliberately.


5. Implementation Snippet

Install Crossplane:

helm repo add crossplane-stable https://charts.crossplane.io/stable
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system \
  --create-namespace

Install AWS Provider:

apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws
spec:
  package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2

Configure ProviderConfig (IRSA recommended):

apiVersion: aws.crossplane.io/v1beta1
kind: ProviderConfig
metadata:
  name: aws
spec:
  credentials:
    source: IRSA

ProviderConfig was configured using IRSA to avoid static credentials.

RDS Example:

apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: platform-db
spec:
  deletionPolicy: Orphan
  forProvider:
    region: us-east-1
    dbInstanceClass: db.t3.micro
    allocatedStorage: 20
    engine: postgres
    storageEncrypted: true
    deletionProtection: true
  providerConfigRef:
    name: aws

Assumes a default VPC and subnet group already exist; in hardened environments, explicit subnetGroupName and securityGroupIds must be specified.


6. Traffic / DNS Consideration

Database endpoints remained AWS-managed.

No DNS switching automated through Crossplane.

Reason:

  • Database endpoints are stable

  • DNS manipulation has high blast radius

  • Networking remained Terraform-owned


7. Validation Process

Step 1 – Manual Drift Simulation

Modified RDS parameter in AWS Console.

Observed:

  • Crossplane detected change

  • Reconciliation restored desired state


Step 2 – Deletion Policy Test

Deleted CR with:

deletionPolicy: Orphan

Observed:

  • Cloud resource retained

  • CR removed


Step 3 – Delete Mode Test

Changed to:

deletionPolicy: Delete

Deleted CR.

Observed:

  • Cloud resource removed

Lifecycle behavior verified.


Step 4 – API Throttling Simulation

Created multiple resources in parallel.

Observed:

  • AWS API throttling errors

  • Provider retry with exponential backoff

Validated concurrency and backoff tuning necessity.


8. Cost & Trade-offs

Infrastructure Cost

  • Additional Crossplane controller pods

  • Increased etcd object count

  • Higher AWS API call volume

Cost impact: Moderate.


Operational Complexity

Increased:

  • Controller debugging

  • Composition versioning

  • CRD lifecycle management

Reduced:

  • Manual drift remediation

  • Terraform PR bottlenecks

  • Apply-time surprises


RTO

Improved.

Drift auto-corrected without manual intervention.


RPO

No direct change.

Depends on AWS-native backup policies.


Scaling Impact

Pros:

  • Safe self-service for app teams

  • Git-auditable infrastructure

  • Continuous compliance enforcement

Cons:

  • API server load increases

  • AWS rate limit sensitivity

  • Composition update blast radius


9. When This Design Makes Sense

✔️ 50+ services
✔️ Dedicated platform team
✔️ GitOps maturity
✔️ Kubernetes-native organization
✔️ High infrastructure churn


When NOT to Use It

  • Small teams

  • Low churn infrastructure

  • No Kubernetes maturity

  • Multi-account bootstrap phase

  • Extremely complex networking requirements


10. Final Takeaway

Terraform builds infrastructure.

Crossplane manages the infrastructure lifecycle.

GitOps enforces declared intent.

This was not a tool replacement exercise.

It was an architectural shift from:

Provisioning mindset → Control plane mindset

At scale, lifecycle enforcement matters more than provisioning speed.

Hybrid architecture made lifecycle enforcement operationally viable at scale.