# From Provisioning to Control Plane: Designing a Hybrid Terraform + Crossplane Architecture at Scale

## 1\. Overview

### What I Designed

I designed a **hybrid infrastructure architecture**:

*   **Terraform → Foundation Layer**
    
*   **Crossplane → Dynamic Lifecycle Layer**
    
*   **ArgoCD → GitOps Enforcement**
    

This created a continuously reconciling cloud control plane inside Kubernetes.

* * *

### Why It Was Required

Our platform crossed:

*   50+ microservices
    
*   Multiple engineering teams
    
*   Multi-region expansion
    
*   PR-driven infrastructure workflows
    
*   Feature branch–based short-lived environments
    

Terraform workflows became operationally slow due to:

*   State contention
    
*   Long plan times
    
*   PR bottlenecks
    
*   Manual drift detection
    

Provisioning was working.  
Lifecycle control was missing.

* * *

### Constraint

*   No full rewrite
    
*   No infrastructure instability
    
*   Zero data loss
    
*   Minimal migration risk
    

* * *

### Engineering Principle Followed

**Separate provisioning from lifecycle reconciliation.**

Provision once.  
Reconcile continuously.

* * *

## 2\. Problem Statement

### Existing Architecture

Terraform model:

Plan → Apply → Exit

After apply:

*   No continuous reconciliation
    
*   Drift detection only on next plan
    
*   Manual console changes remain undetected
    

* * *

### Failure Risk

If an engineer modified:

*   RDS storage encryption
    
*   Deletion protection
    
*   Security groups
    
*   IAM policies
    

Terraform would not react until the next plan/apply cycle.

Drift became silent operational risk.

* * *

### What Would Break

*   Compliance posture
    
*   Backup guarantees
    
*   Encryption enforcement
    
*   Network boundaries
    
*   Incident recovery confidence
    

* * *

### Why It Was Unacceptable

At scale:

Manual governance does not work.

Infrastructure must enforce its declared state.

* * *

## 3\. Architecture After Implementation

Control plane flow:

Developer commits YAML  
↓  
ArgoCD syncs to cluster  
↓  
Kubernetes API stores desired state  
↓  
Crossplane controller watches resource  
↓  
Crossplane calls AWS API  
↓  
Cloud resource created/updated  
↑  
Continuous reconciliation loop

Terraform foundation layer:

Terraform  
↓  
VPC  
Subnets  
EKS Control Plane  
Core Networking

Clear separation of responsibilities.

* * *

## 4\. Design Decisions

* * *

### 4.1 Core Component Choice

**Terraform for foundation**

Why I chose it:

*   Mature state handling
    
*   Strong bootstrap ecosystem
    
*   Clear isolation of foundational infrastructure
    

Multi-account governance was enforced via separate state isolation and account-factory patterns; Terraform itself does not natively provide org-level governance.

Trade-off:

*   No continuous reconciliation
    

Risk accepted:

*   Foundation changes are rare and tightly controlled
    

* * *

**Crossplane for dynamic infrastructure**

Why I chose it:

*   Kubernetes-native control loop
    
*   GitOps-friendly
    
*   CRD-based lifecycle management
    

Trade-off:

*   Adds API server load
    
*   Adds controller complexity
    

Risk accepted:

*   Infrastructure lifecycle now depends on cluster health
    

* * *

### 4.2 Failure Detection

Reconciliation loop ensures:

Actual State == Desired State

Manual console change → Crossplane reconciles.

Trade-off:

*   AWS API throttling possible
    
*   Eventual consistency delays
    

Risk accepted:

*   Tuned provider/controller concurrency and AWS API backoff settings to mitigate throttling
    

* * *

### 4.3 Event Routing

Git Commit  
→ ArgoCD  
→ Kubernetes API  
→ Crossplane Controller  
→ AWS API

Rollback = git revert.

Trade-off:

*   Git becomes critical dependency
    

Risk accepted:

*   Strong repository governance and PR controls
    

* * *

### 4.4 Automation Logic

Platform team defined Compositions.

Example:

```yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: xpostgres
spec:
  compositeTypeRef:
    apiVersion: platform.io/v1alpha1
    kind: XPostgres
  resources:
    - name: database
      base:
        apiVersion: database.aws.crossplane.io/v1beta1
        kind: RDSInstance
        spec:
          deletionPolicy: Orphan
          forProvider:
            storageEncrypted: true
            deletionProtection: true
```

Application team used Claim:

```yaml
apiVersion: platform.io/v1alpha1
kind: PostgresClaim
metadata:
  name: app-db
spec:
  parameters:
    storage: 20
```

Why I chose this:

*   Central policy enforcement
    
*   Developer abstraction
    
*   Clear ownership boundary
    

Trade-off:

*   Composition update blast radius
    
*   Requires versioning discipline
    

Risk accepted:

*   Versioned compositions per environment
    

* * *

### 4.5 DNS / Networking

DNS and core networking remained Terraform-managed.

Reason:

*   High blast radius
    
*   Low change frequency
    
*   Complex dependency graph
    

Control plane expansion was phased deliberately.

* * *

## 5\. Implementation Snippet

Install Crossplane:

```bash
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system \
  --create-namespace
```

Install AWS Provider:

```yaml
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws
spec:
  package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2
```

Configure ProviderConfig (IRSA recommended):

```yaml
apiVersion: aws.crossplane.io/v1beta1
kind: ProviderConfig
metadata:
  name: aws
spec:
  credentials:
    source: IRSA
```

ProviderConfig was configured using IRSA to avoid static credentials.

RDS Example:

```yaml
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: platform-db
spec:
  deletionPolicy: Orphan
  forProvider:
    region: us-east-1
    dbInstanceClass: db.t3.micro
    allocatedStorage: 20
    engine: postgres
    storageEncrypted: true
    deletionProtection: true
  providerConfigRef:
    name: aws
```

Assumes a default VPC and subnet group already exist; in hardened environments, explicit `subnetGroupName` and `securityGroupIds` must be specified.

* * *

## 6\. Traffic / DNS Consideration

Database endpoints remained AWS-managed.

No DNS switching automated through Crossplane.

Reason:

*   Database endpoints are stable
    
*   DNS manipulation has high blast radius
    
*   Networking remained Terraform-owned
    

* * *

## 7\. Validation Process

### Step 1 – Manual Drift Simulation

Modified RDS parameter in AWS Console.

Observed:

*   Crossplane detected change
    
*   Reconciliation restored desired state
    

* * *

### Step 2 – Deletion Policy Test

Deleted CR with:

`deletionPolicy: Orphan`

Observed:

*   Cloud resource retained
    
*   CR removed
    

* * *

### Step 3 – Delete Mode Test

Changed to:

`deletionPolicy: Delete`

Deleted CR.

Observed:

*   Cloud resource removed
    

Lifecycle behavior verified.

* * *

### Step 4 – API Throttling Simulation

Created multiple resources in parallel.

Observed:

*   AWS API throttling errors
    
*   Provider retry with exponential backoff
    

Validated concurrency and backoff tuning necessity.

* * *

## 8\. Cost & Trade-offs

### Infrastructure Cost

*   Additional Crossplane controller pods
    
*   Increased etcd object count
    
*   Higher AWS API call volume
    

Cost impact: Moderate.

* * *

### Operational Complexity

Increased:

*   Controller debugging
    
*   Composition versioning
    
*   CRD lifecycle management
    

Reduced:

*   Manual drift remediation
    
*   Terraform PR bottlenecks
    
*   Apply-time surprises
    

* * *

### RTO

Improved.

Drift auto-corrected without manual intervention.

* * *

### RPO

No direct change.

Depends on AWS-native backup policies.

* * *

### Scaling Impact

Pros:

*   Safe self-service for app teams
    
*   Git-auditable infrastructure
    
*   Continuous compliance enforcement
    

Cons:

*   API server load increases
    
*   AWS rate limit sensitivity
    
*   Composition update blast radius
    

* * *

## 9\. When This Design Makes Sense

✔️ 50+ services  
✔️ Dedicated platform team  
✔️ GitOps maturity  
✔️ Kubernetes-native organization  
✔️ High infrastructure churn

* * *

### When NOT to Use It

*   Small teams
    
*   Low churn infrastructure
    
*   No Kubernetes maturity
    
*   Multi-account bootstrap phase
    
*   Extremely complex networking requirements
    

* * *

## 10\. Final Takeaway

Terraform builds infrastructure.

Crossplane manages the infrastructure lifecycle.

GitOps enforces declared intent.

This was not a tool replacement exercise.

It was an architectural shift from:

Provisioning mindset → Control plane mindset

At scale, lifecycle enforcement matters more than provisioning speed.

Hybrid architecture made lifecycle enforcement operationally viable at scale.
