Skip to main content

Command Palette

Search for a command to run...

Zero-Downtime Migration from NGINX Ingress to Gateway API on Amazon EKS (Production Case Study)

Updated
6 min read
Zero-Downtime Migration from NGINX Ingress to Gateway API on Amazon EKS (Production Case Study)

A Zero-Downtime, Step-by-Step Implementation Guide


1. Overview

In this post, we walk through a real production migration of a Kubernetes workload from NGINX Ingress Controller to Kubernetes Gateway API, implemented using Envoy Gateway, on Amazon EKS.

The key objective was to:

  • Migrate safely with zero downtime

  • Avoid introducing unnecessary cloud-specific complexity

  • Align the platform with Kubernetes’ future networking direction

This guide is written from a platform ownership perspective, not a lab or demo setup.


2. Problem Statement

The application was already running in production and exposed using NGINX Ingress Controller.

While the setup was stable, the following risks were identified:

  • The NGINX Ingress Controller project has moved toward reduced long-term maintenance focus, increasing uncertainty around future support guarantees.

  • No long-term guarantees for:

    • Security patches

    • CVE fixes

    • Compatibility with future Kubernetes versions

  • Ingress sits at the cluster edge, making it a high-blast-radius component

Although there was no immediate outage, continuing with an edge component under reduced maintenance posed long-term operational and security risks.


3. Existing Production Architecture (Before Migration)

User
  ↓
AWS LoadBalancer (auto-created by Service)
  ↓
NGINX Ingress Controller
  ↓
Application Service (ClusterIP)
  ↓
Application Pods

Characteristics of the existing setup

  • Stable and functional

  • Easy to operate

  • Tightly coupled to controller-specific annotations

  • Limited separation between platform and application ownership


4. Why Gateway API?

Kubernetes Gateway API is positioned as the successor to Ingress, designed to solve long-standing limitations.

Key improvements over Ingress

IngressGateway API
Single resourceRole-oriented resources
Annotation-drivenSpec-defined configuration
Weak ownership boundariesClear infra vs app separation
Controller-specific behaviorStandardized API

Gateway API introduces:

  • GatewayClass – defines platform capability

  • Gateway – infrastructure-level entry point

  • HTTPRoute – application-level routing rules

This model is more scalable, auditable, and production-safe.


5. Why Envoy Gateway in This Case?

The cluster did not have AWS Load Balancer Controller installed.

Installing it mid-migration would have required:

  • IAM and IRSA setup

  • Additional operational complexity

  • Increased blast radius during a live migration

Instead, we chose Envoy Gateway, because it:

  • Is a first-class Gateway API implementation

  • Does not depend on AWS-specific controllers

  • Creates and manages its own dataplane

  • Is vendor-neutral and portable

  • Allows parallel validation with minimal risk

This decision was intentional, not a workaround.

I intentionally avoided introducing AWS Load Balancer Controller during migration to prevent IAM, IRSA, and cloud-controller changes from increasing the migration blast radius. The goal was to change one edge component at a time.


6. Migration Strategy (Zero Downtime)

A direct replacement was not acceptable.

Chosen strategy

NGINX Ingress LoadBalancer  → continues serving production traffic
Envoy Gateway LoadBalancer → used for validation

Traffic was cut over only after successful validation was completed.

The existing Ingress resource was left untouched to prevent configuration drift and unintended side effects during migration.

This ensured:

  • No user impact

  • Easy rollback

  • Controlled blast radius


7. Step-by-Step Implementation

Step 1: Application Deployment (Already in Place)

The application was deployed with:

  • Kubernetes Deployment

  • Service of type ClusterIP

No changes were required at the application level.


Step 2: NGINX Ingress (Existing Production Entry)

NGINX Ingress Controller was already installed and exposed the application via an AWS LoadBalancer.

This remained untouched during the migration.


Step 3: Install Gateway API CRDs

Gateway API resources must exist before any controller can operate.

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml

Step 4: Install Envoy Gateway

Envoy Gateway was installed using Helm via OCI registry.

helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.7.0 \
  -n envoy-gateway-system \
  --create-namespace

The Envoy Gateway version was explicitly pinned to v1.7.0 after verifying compatibility with Gateway API v1.0.0 and the EKS cluster version.
Version pinning ensures deterministic deployments, reproducibility, and safe rollback capability in production environments.


Step 5: Create GatewayClass (Platform Ownership)

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

This explicitly defined Envoy Gateway as the cluster’s Gateway API implementation.


Step 6: Create Gateway (Infrastructure Entry Point)

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: app-gateway
  namespace: default
spec:
  gatewayClassName: envoy
  listeners:
  - protocol: HTTP
    port: 80

This created a new AWS LoadBalancer, separate from the existing NGINX Ingress LB.


Step 7: Create HTTPRoute (Application Routing)

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app
  namespace: default
spec:
  parentRefs:
  - name: app-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: app
      port: 8088

This replaced the Ingress routing logic using Gateway API primitives.


8. Validation

At this stage:

NGINX LB → Production users
Gateway LB → Validation traffic

Validation was performed at multiple levels:

Application Layer

  • Verified HTTP 200 responses using curl

  • Tested authentication flows

  • Executed critical user workflows

  • Confirmed session persistence behavior

Infrastructure Layer

  • Checked LoadBalancer health check status

  • Verified readiness and liveness probes

  • Monitored pod logs for errors or unexpected restarts

  • Confirmed correct backend service port mapping

  • Reviewed Envoy Gateway metrics and controller logs to ensure no reconciliation errors or route attachment failures were present.

Traffic & Stability

  • Compared response latency between both entry points

  • Monitored 4xx and 5xx error rates

  • Verified no increase in backend CPU or memory usage

Only after all validation checkpoints passed was production cutover approved.

9. Cost Considerations During Migration

Running NGINX Ingress and Envoy Gateway in parallel resulted in two active AWS LoadBalancers during the validation window, temporarily increasing infrastructure cost.

However:

  • The overlap period was intentionally short.

  • The additional cost was justified to eliminate downtime risk.

  • The parallel approach reduced blast radius during migration.

Cost was intentionally traded for reliability and controlled risk.

10. Cutover and Cleanup

After all validation checks passed:

kubectl delete ingress app-ingress

Traffic shift was verified immediately after deletion by validating active connections on the Gateway LoadBalancer and confirming healthy backend responses.

The legacy NGINX Ingress was removed only after confirming stable traffic flow through the Gateway LoadBalancer.

Rollback plan:

  • Re-apply the Ingress resource if needed

  • Restore DNS if traffic switch involved domain update

The migration was reversible during the validation window.

Optionally, after a stability window:

helm uninstall ingress-nginx -n ingress-nginx

The Gateway API entry point became the sole production path.


11. Final Architecture (After Migration)

User
  ↓
AWS LoadBalancer
  ↓
Envoy Gateway (Gateway API)
  ↓
Application Service
  ↓
Application Pods

12. Key Learnings

  1. Gateway without HTTPRoute does nothing — infrastructure and routing are intentionally separated

  2. Gateway API enforces clearer ownership boundaries than Ingress

  3. Parallel migration is the safest approach for production workloads

  4. Envoy Gateway is an effective bridge when cloud-native controllers are not yet in place


13. When Would AWS Load Balancer Controller Be Used?

In a later phase, once the platform is stable on the Gateway API.

Typical evolution:

NGINX Ingress
→ Envoy Gateway (Gateway API adoption)
→ AWS Load Balancer Controller (cloud-native optimization)

14. Failure Scenarios Considered

The following risks were evaluated before migration:

  • Gateway created without HTTPRoute (no traffic routing)

  • Incorrect backend service port reference

  • Namespace mismatch between Gateway and HTTPRoute

  • LoadBalancer health check failures

  • Controller crash or misconfiguration

  • Gateway API CRD and controller version mismatch

  • DNS TTL delays during traffic switch

By running both entry points in parallel, these risks were isolated and mitigated.

15. Final Takeaway

I designed and executed a zero-downtime migration from NGINX Ingress to Gateway API by running both entry points in parallel.

I validated routing behavior, health checks, infrastructure readiness, and traffic stability before shifting production traffic.

This approach reduced blast radius, preserved service availability, and aligned the platform with Kubernetes’ evolving networking model.

More from this blog

D

DevOps and Cloud Mastery Online - DevOps' World

34 posts