Zero-Downtime Migration from NGINX Ingress to Gateway API on EKS

A Zero-Downtime, Step-by-Step Implementation Guide

1. Overview

In this post, we walk through a real production migration of a Kubernetes workload from NGINX Ingress Controller to Kubernetes Gateway API, implemented using Envoy Gateway, on Amazon EKS.

The key objective was to:

Migrate safely with zero downtime
Avoid introducing unnecessary cloud-specific complexity
Align the platform with Kubernetes’ future networking direction

This guide is written from a platform ownership perspective, not a lab or demo setup.

2. Problem Statement

The application was already running in production and exposed using NGINX Ingress Controller.

While the setup was stable, the following risks were identified:

The NGINX Ingress Controller project has moved toward reduced long-term maintenance focus, increasing uncertainty around future support guarantees.
No long-term guarantees for:
- Security patches
- CVE fixes
- Compatibility with future Kubernetes versions
Ingress sits at the cluster edge, making it a high-blast-radius component

Although there was no immediate outage, continuing with an edge component under reduced maintenance posed long-term operational and security risks.

3. Existing Production Architecture (Before Migration)

User
  ↓
AWS LoadBalancer (auto-created by Service)
  ↓
NGINX Ingress Controller
  ↓
Application Service (ClusterIP)
  ↓
Application Pods

Characteristics of the existing setup

Stable and functional
Easy to operate
Tightly coupled to controller-specific annotations
Limited separation between platform and application ownership

4. Why Gateway API?

Kubernetes Gateway API is positioned as the successor to Ingress, designed to solve long-standing limitations.

Key improvements over Ingress

Ingress	Gateway API
Single resource	Role-oriented resources
Annotation-driven	Spec-defined configuration
Weak ownership boundaries	Clear infra vs app separation
Controller-specific behavior	Standardized API

Gateway API introduces:

GatewayClass – defines platform capability
Gateway – infrastructure-level entry point
HTTPRoute – application-level routing rules

This model is more scalable, auditable, and production-safe.

5. Why Envoy Gateway in This Case?

The cluster did not have AWS Load Balancer Controller installed.

Installing it mid-migration would have required:

IAM and IRSA setup
Additional operational complexity
Increased blast radius during a live migration

Instead, we chose Envoy Gateway, because it:

Is a first-class Gateway API implementation
Does not depend on AWS-specific controllers
Creates and manages its own dataplane
Is vendor-neutral and portable
Allows parallel validation with minimal risk

This decision was intentional, not a workaround.

I intentionally avoided introducing AWS Load Balancer Controller during migration to prevent IAM, IRSA, and cloud-controller changes from increasing the migration blast radius. The goal was to change one edge component at a time.

6. Migration Strategy (Zero Downtime)

A direct replacement was not acceptable.

Chosen strategy

NGINX Ingress LoadBalancer  → continues serving production traffic
Envoy Gateway LoadBalancer → used for validation

Traffic was cut over only after successful validation was completed.

The existing Ingress resource was left untouched to prevent configuration drift and unintended side effects during migration.

This ensured:

No user impact
Easy rollback
Controlled blast radius

7. Step-by-Step Implementation

Step 1: Application Deployment (Already in Place)

The application was deployed with:

Kubernetes Deployment
Service of type ClusterIP

No changes were required at the application level.

Step 2: NGINX Ingress (Existing Production Entry)

NGINX Ingress Controller was already installed and exposed the application via an AWS LoadBalancer.

This remained untouched during the migration.

Step 3: Install Gateway API CRDs

Gateway API resources must exist before any controller can operate.

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml

Step 4: Install Envoy Gateway

Envoy Gateway was installed using Helm via OCI registry.

helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.7.0 \
  -n envoy-gateway-system \
  --create-namespace

The Envoy Gateway version was explicitly pinned to v1.7.0 after verifying compatibility with Gateway API v1.0.0 and the EKS cluster version.
Version pinning ensures deterministic deployments, reproducibility, and safe rollback capability in production environments.

Step 5: Create GatewayClass (Platform Ownership)

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

This explicitly defined Envoy Gateway as the cluster’s Gateway API implementation.

Step 6: Create Gateway (Infrastructure Entry Point)

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: app-gateway
  namespace: default
spec:
  gatewayClassName: envoy
  listeners:
  - protocol: HTTP
    port: 80

This created a new AWS LoadBalancer, separate from the existing NGINX Ingress LB.

Step 7: Create HTTPRoute (Application Routing)

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app
  namespace: default
spec:
  parentRefs:
  - name: app-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: app
      port: 8088

This replaced the Ingress routing logic using Gateway API primitives.

8. Validation

At this stage:

NGINX LB → Production users
Gateway LB → Validation traffic

Validation was performed at multiple levels:

Application Layer

Verified HTTP 200 responses using curl
Tested authentication flows
Executed critical user workflows
Confirmed session persistence behavior

Infrastructure Layer

Checked LoadBalancer health check status
Verified readiness and liveness probes
Monitored pod logs for errors or unexpected restarts
Confirmed correct backend service port mapping
Reviewed Envoy Gateway metrics and controller logs to ensure no reconciliation errors or route attachment failures were present.

Traffic & Stability

Compared response latency between both entry points
Monitored 4xx and 5xx error rates
Verified no increase in backend CPU or memory usage

Only after all validation checkpoints passed was production cutover approved.

9. Cost Considerations During Migration

Running NGINX Ingress and Envoy Gateway in parallel resulted in two active AWS LoadBalancers during the validation window, temporarily increasing infrastructure cost.

However:

The overlap period was intentionally short.
The additional cost was justified to eliminate downtime risk.
The parallel approach reduced blast radius during migration.

Cost was intentionally traded for reliability and controlled risk.

10. Cutover and Cleanup

After all validation checks passed:

kubectl delete ingress app-ingress

Traffic shift was verified immediately after deletion by validating active connections on the Gateway LoadBalancer and confirming healthy backend responses.

The legacy NGINX Ingress was removed only after confirming stable traffic flow through the Gateway LoadBalancer.

Rollback plan:

Re-apply the Ingress resource if needed
Restore DNS if traffic switch involved domain update

The migration was reversible during the validation window.

Optionally, after a stability window:

helm uninstall ingress-nginx -n ingress-nginx

The Gateway API entry point became the sole production path.

11. Final Architecture (After Migration)

User
  ↓
AWS LoadBalancer
  ↓
Envoy Gateway (Gateway API)
  ↓
Application Service
  ↓
Application Pods

12. Key Learnings

Gateway without HTTPRoute does nothing — infrastructure and routing are intentionally separated
Gateway API enforces clearer ownership boundaries than Ingress
Parallel migration is the safest approach for production workloads
Envoy Gateway is an effective bridge when cloud-native controllers are not yet in place

13. When Would AWS Load Balancer Controller Be Used?

In a later phase, once the platform is stable on the Gateway API.

Typical evolution:

NGINX Ingress
→ Envoy Gateway (Gateway API adoption)
→ AWS Load Balancer Controller (cloud-native optimization)

14. Failure Scenarios Considered

The following risks were evaluated before migration:

Gateway created without HTTPRoute (no traffic routing)
Incorrect backend service port reference
Namespace mismatch between Gateway and HTTPRoute
LoadBalancer health check failures
Controller crash or misconfiguration
Gateway API CRD and controller version mismatch
DNS TTL delays during traffic switch

By running both entry points in parallel, these risks were isolated and mitigated.

15. Final Takeaway

I designed and executed a zero-downtime migration from NGINX Ingress to Gateway API by running both entry points in parallel.

I validated routing behavior, health checks, infrastructure readiness, and traffic stability before shifting production traffic.

This approach reduced blast radius, preserved service availability, and aligned the platform with Kubernetes’ evolving networking model.

Zero-Downtime Migration from NGINX Ingress to Gateway API on Amazon EKS (Production Case Study)

A Zero-Downtime, Step-by-Step Implementation Guide

1. Overview

2. Problem Statement

3. Existing Production Architecture (Before Migration)

Characteristics of the existing setup

4. Why Gateway API?

Key improvements over Ingress

5. Why Envoy Gateway in This Case?

6. Migration Strategy (Zero Downtime)

Chosen strategy

7. Step-by-Step Implementation

Step 1: Application Deployment (Already in Place)

Step 2: NGINX Ingress (Existing Production Entry)

Step 3: Install Gateway API CRDs

Step 4: Install Envoy Gateway

Step 5: Create GatewayClass (Platform Ownership)

Step 6: Create Gateway (Infrastructure Entry Point)

Step 7: Create HTTPRoute (Application Routing)

8. Validation

Application Layer

Infrastructure Layer

Traffic & Stability

9. Cost Considerations During Migration

10. Cutover and Cleanup

11. Final Architecture (After Migration)

12. Key Learnings

13. When Would AWS Load Balancer Controller Be Used?

14. Failure Scenarios Considered

15. Final Takeaway

Comments

Ops Fix Hub

Cross-Region RDS Disaster Recovery: Production Failover Architecture

More from this blog

Cost Optimization with Planned Downtime Migrating an EBS-Backed StatefulSet from Multi-AZ to Single-AZ in Amazon EKS (Production Pattern)

From Provisioning to Control Plane: Designing a Hybrid Terraform + Crossplane Architecture at Scale

Cross-Cloud VM Migration: GCP → AWS Using AWS Application Migration Service (MGN)

AWS DevOps Agent: Real Testing, Architecture & Practical Insights

Production-Grade GCS to S3 Migration: Secure, Private, and Zero-Egress Architecture

Command Palette

A Zero-Downtime, Step-by-Step Implementation Guide

1. Overview

2. Problem Statement

3. Existing Production Architecture (Before Migration)

Characteristics of the existing setup

4. Why Gateway API?

Key improvements over Ingress

5. Why Envoy Gateway in This Case?

6. Migration Strategy (Zero Downtime)

Chosen strategy

7. Step-by-Step Implementation

Step 1: Application Deployment (Already in Place)

Step 2: NGINX Ingress (Existing Production Entry)

Step 3: Install Gateway API CRDs

Step 4: Install Envoy Gateway

Step 5: Create GatewayClass (Platform Ownership)

Step 6: Create Gateway (Infrastructure Entry Point)

Step 7: Create HTTPRoute (Application Routing)

8. Validation

Application Layer

Infrastructure Layer

Traffic & Stability

9. Cost Considerations During Migration

10. Cutover and Cleanup

11. Final Architecture (After Migration)

12. Key Learnings

13. When Would AWS Load Balancer Controller Be Used?

14. Failure Scenarios Considered

15. Final Takeaway

Comments

Ops Fix Hub

Cross-Region RDS Disaster Recovery: Production Failover Architecture

More from this blog