<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[DevOps and Cloud Mastery Online - DevOps' World]]></title><description><![CDATA[Unlock DevOps and Cloud excellence with devopsofworld.com - your online resource for mastering modern IT practices.]]></description><link>https://devopsofworld.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1748360762256/5f40942b-8729-4e8b-a35b-c6698260c2a1.png</url><title>DevOps and Cloud Mastery Online - DevOps&apos; World</title><link>https://devopsofworld.com</link></image><generator>RSS for Node</generator><lastBuildDate>Sun, 19 Apr 2026 08:54:12 GMT</lastBuildDate><atom:link href="https://devopsofworld.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Cost Optimization with Planned Downtime
Migrating an EBS-Backed StatefulSet from Multi-AZ to Single-AZ in Amazon EKS (Production Pattern)]]></title><description><![CDATA[1. Overview
This article documents a real production-style migration of a Kubernetes StatefulSet backed by Amazon EBS from Multi-AZ to Single-AZ in Amazon EKS.
The migration was executed at the storag]]></description><link>https://devopsofworld.com/cost-optimization-with-planned-downtime-migrating-an-ebs-backed-statefulset-from-multi-az-to-single-az-in-amazon-eks-production-pattern</link><guid isPermaLink="true">https://devopsofworld.com/cost-optimization-with-planned-downtime-migrating-an-ebs-backed-statefulset-from-multi-az-to-single-az-in-amazon-eks-production-pattern</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[AWS]]></category><category><![CDATA[EKS]]></category><category><![CDATA[statefulsets]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[cloud cost optimization  ]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 31 Mar 2026 03:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/64f9b973c160b374c81c2b0e/47b5c93d-9fe8-4f09-85d4-bde5fda2363e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>1. Overview</h3>
<p>This article documents a real production-style migration of a Kubernetes StatefulSet backed by Amazon EBS from Multi-AZ to Single-AZ in Amazon EKS.</p>
<p>The migration was executed at the <strong>storage layer</strong> using EBS snapshots during a controlled maintenance window.</p>
<p>Objectives</p>
<ul>
<li><p>Reduce cross-AZ replication and inter-AZ data transfer cost</p>
</li>
<li><p>Preserve existing EBS-backed data</p>
</li>
<li><p>Avoid provisioning a parallel Kafka cluster</p>
</li>
<li><p>Maintain deterministic storage recovery</p>
</li>
<li><p>Ensure logical RPO = 0 with verified clean shutdown and completed snapshot.</p>
</li>
</ul>
<p>This was a cost-first architectural decision with acknowledged availability trade-offs.</p>
<hr />
<h3>2. Business Context</h3>
<p>Kafka (with ZooKeeper) was deployed across:</p>
<ul>
<li><p><code>ap-south-1b</code></p>
</li>
<li><p><code>ap-south-1c</code></p>
</li>
</ul>
<p>Multi-AZ improved resilience, but introduced:</p>
<ul>
<li><p>Continuous cross-AZ replication traffic</p>
</li>
<li><p>Inter-AZ data transfer billing</p>
</li>
<li><p>Increased recurring infrastructure cost</p>
</li>
</ul>
<p>After analyzing recurring billing, cross-AZ traffic became a major cost driver.</p>
<p>Cost analysis showed that inter-AZ data transfer and cross-AZ replication traffic accounted for a significant percentage of the monthly Kafka infrastructure spend.</p>
<p>While Multi-AZ improved resilience, the business determined that the availability gain did not justify the recurring transfer cost for this workload profile.</p>
<p>Business decision:</p>
<blockquote>
<p>Consolidate into single AZ<br />Downtime acceptable<br />No data loss allowed</p>
</blockquote>
<hr />
<h3>3. Technical Constraint — Why This Is Not a Simple Scheduler Change</h3>
<p>The workload runs as a StatefulSet using <code>volumeClaimTemplates</code>.</p>
<p>Storage backend: Amazon EBS via EBS CSI driver.</p>
<p>Important constraints:</p>
<ul>
<li><p>EBS volumes are strictly AZ-scoped</p>
</li>
<li><p>PVs include <code>topology.kubernetes.io/zone</code> nodeAffinity</p>
</li>
<li><p>PVCs are tightly bound to PVs</p>
</li>
<li><p>EBS cannot attach across AZ</p>
</li>
</ul>
<p>If we restrict nodeAffinity without moving storage:</p>
<ul>
<li><p>Pod schedules successfully</p>
</li>
<li><p>Volume attach fails</p>
</li>
<li><p>Pod stuck in <code>ContainerCreating</code></p>
</li>
</ul>
<p>This is fundamentally a storage locality constraint.</p>
<blockquote>
<p>Storage must move before pods move.</p>
</blockquote>
<hr />
<h3>4. Architecture Before Migration</h3>
<img src="https://cdn.hashnode.com/uploads/covers/64f9b973c160b374c81c2b0e/fbe735d4-b52d-454a-8846-d91dc7bd13bc.png" alt="" style="display:block;margin:0 auto" />

<p><code>Multi-AZ StatefulSet with cross-AZ replication traffic between replicas.</code></p>
<h3>Characteristics</h3>
<ul>
<li><p>Replica-0 in <code>ap-south-1b</code></p>
</li>
<li><p>Replica-1 in <code>ap-south-1c</code></p>
</li>
<li><p>Independent EBS volumes per replica</p>
</li>
<li><p>Continuous cross-AZ replication</p>
</li>
<li><p>Higher availability</p>
</li>
<li><p>Higher recurring cost</p>
</li>
</ul>
<hr />
<h3>5. Migration Options Evaluated</h3>
<p>Option 1 — Snapshot-Based Storage Migration (Chosen)</p>
<p>Flow:</p>
<ol>
<li><p>Validate Kafka stability</p>
</li>
<li><p>Scale StatefulSet to 0</p>
</li>
<li><p>Snapshot EBS volumes</p>
</li>
<li><p>Restore volumes in the target AZ</p>
</li>
<li><p>Rebind PVCs via static PVs</p>
</li>
<li><p>Restrict scheduling</p>
</li>
<li><p>Safe staged bring-up</p>
</li>
</ol>
<p>Properties:</p>
<ul>
<li><p>Logical RPO = 0, assuming:</p>
<ul>
<li><p>No under-replicated partitions</p>
</li>
<li><p>Clean shutdown</p>
</li>
<li><p>Snapshot completion verified</p>
</li>
</ul>
</li>
<li><p>Planned downtime required</p>
</li>
<li><p>No duplicate cluster</p>
</li>
<li><p>Lowest infrastructure cost</p>
</li>
<li><p>Requires operational precision</p>
</li>
</ul>
<hr />
<p><strong>Option 2 — Dual Cluster + MirrorMaker2</strong></p>
<p>Flow:</p>
<ol>
<li><p>Deploy new Kafka cluster in single AZ</p>
</li>
<li><p>Configure MirrorMaker2</p>
</li>
<li><p>Replicate topics</p>
</li>
<li><p>Validate offsets</p>
</li>
<li><p>Cut traffic</p>
</li>
<li><p>Decommission old cluster</p>
</li>
</ol>
<p>Properties:</p>
<ul>
<li><p>Near-zero downtime</p>
</li>
<li><p>Higher infrastructure cost</p>
</li>
<li><p>More operational complexity</p>
</li>
<li><p>Easier rollback</p>
</li>
</ul>
<p>Because downtime was acceptable and cost reduction was urgent, <strong>Option 1 was selected</strong>.</p>
<hr />
<h3>6. Multi-AZ Deployment YAML (Reproducible Lab Setup)</h3>
<pre><code class="language-plaintext"># multi-az-zk.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: tools
---
apiVersion: v1
kind: Service
metadata:
  name: zk-staging-headless
  namespace: tools
  labels:
    app: cp-zookeeper
    release: kafka-staging
spec:
  clusterIP: None
  selector:
    app: cp-zookeeper
    release: kafka-staging
  ports:
    - name: client
      port: 2181
      targetPort: 2181
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zk-staging
  namespace: tools
spec:
  serviceName: zk-staging-headless
  replicas: 2
  selector:
    matchLabels:
      app: cp-zookeeper
      release: kafka-staging
  template:
    metadata:
      labels:
        app: cp-zookeeper
        release: kafka-staging
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - ap-south-1b
                      - ap-south-1c
      containers:
        - name: cp-zookeeper
          image: docker.io/confluentinc/cp-zookeeper:5.5.6
          volumeMounts:
            - name: datadir
              mountPath: /var/lib/zookeeper/data
            - name: datalogdir
              mountPath: /var/lib/zookeeper/log
  volumeClaimTemplates:
    - metadata:
        name: datadir
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: gp3
        resources:
          requests:
            storage: 10Gi
    - metadata:
        name: datalogdir
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: gp3
        resources:
          requests:
            storage: 10Gi
</code></pre>
<p>Apply:</p>
<pre><code class="language-plaintext">kubectl apply -f multi-az-zk.yaml
</code></pre>
<p>Validate zone distribution:</p>
<pre><code class="language-plaintext">kubectl get nodes -L topology.kubernetes.io/zone
kubectl get pods -o wide -n tools
kubectl describe pv &lt;pv-name&gt;
</code></pre>
<hr />
<h3>7. Production Migration — Snapshot-Based Execution</h3>
<hr />
<p><strong>Step 1 — Validate Kafka Stability</strong></p>
<p>Before shutdown, ensure:</p>
<ul>
<li><p>No under-replicated partitions</p>
</li>
<li><p>No leader elections are ongoing</p>
</li>
<li><p>ISR stable</p>
</li>
</ul>
<p>Example:</p>
<pre><code class="language-plaintext">kubectl logs &lt;pod-name&gt; -n tools
</code></pre>
<hr />
<h3><strong>Step 2 — Protect Reclaim Policy</strong></h3>
<p>Before deleting PVCs:</p>
<pre><code class="language-plaintext">kubectl get pv
</code></pre>
<p>Ensure persistentVolumeReclaimPolicy: Retain.</p>
<p>If not:</p>
<pre><code class="language-plaintext">kubectl patch pv &lt;pv-name&gt; \
  -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
</code></pre>
<p>This prevents accidental EBS deletion.</p>
<hr />
<h3><strong>Step 3 — Scale Down</strong></h3>
<pre><code class="language-plaintext">kubectl scale statefulset zk-staging -n tools --replicas=0
kubectl get pods -n tools
</code></pre>
<hr />
<h3>Pre-Snapshot Validation — StatefulSet Ordinal Mapping (Critical)</h3>
<p>Before taking snapshots, validate which EBS volume belongs to which StatefulSet ordinal.</p>
<p>StatefulSet PVC naming convention:</p>
<pre><code class="language-plaintext">&lt;claim-name&gt;-&lt;statefulset-name&gt;-&lt;ordinal&gt;
</code></pre>
<p>Example:</p>
<pre><code class="language-plaintext">datadir-zk-staging-0
datadir-zk-staging-1
</code></pre>
<p>Validate volume mapping:</p>
<pre><code class="language-plaintext">kubectl get pvc -n tools
kubectl describe pvc datadir-zk-staging-1 -n tools
</code></pre>
<p>Extract:</p>
<pre><code class="language-plaintext">Volume: pvc-xxxx
</code></pre>
<p>Then map to AWS volume:</p>
<pre><code class="language-plaintext">aws ec2 describe-volumes \
  --filters Name=tag:KubernetesCluster,Values=&lt;cluster-name&gt;
</code></pre>
<p>Or inspect via:</p>
<pre><code class="language-plaintext">kubectl describe pv &lt;pv-name&gt;
</code></pre>
<p>Confirm:</p>
<ul>
<li><p>Correct ordinal</p>
</li>
<li><p>Correct AZ</p>
</li>
<li><p>Correct volume ID</p>
</li>
</ul>
<p>If you restore the wrong ordinal volume to the wrong replica, data corruption or cluster quorum failure can occur.</p>
<p>Never assume volume ordering.</p>
<p>Validate explicitly.</p>
<h3>Step 4 — Snapshot Volumes</h3>
<pre><code class="language-plaintext">aws ec2 create-snapshot \
  --volume-id vol-xxxx \
  --description "zk-migration"
</code></pre>
<p>Wait until the snapshot state = completed before proceeding.</p>
<p>Do not scale up or delete any additional resources until snapshot completion is verified via AWS CLI or console.</p>
<hr />
<h3><strong>Step 5 — Restore With Matching Performance</strong></h3>
<p>Check original performance:</p>
<pre><code class="language-plaintext">aws ec2 describe-volumes --volume-ids &lt;volume-id&gt;
</code></pre>
<p>Restore:</p>
<pre><code class="language-plaintext">aws ec2 create-volume \
  --snapshot-id snap-xxxx \
  --availability-zone ap-south-1b \
  --volume-type gp3 \
  --iops 3000 \
  --throughput 125
</code></pre>
<p>Do not proceed until the restored volume state is "available" and fully initialized.</p>
<p>The restored volume must match the original volume type, IOPS, and throughput.</p>
<p>If performance parameters are reduced during restore:</p>
<ul>
<li><p>Kafka disk flush latency may increase</p>
</li>
<li><p>Log segment recovery may slow</p>
</li>
<li><p>Consumer lag may spike</p>
</li>
<li><p>ZooKeeper session instability may occur</p>
</li>
</ul>
<p>Storage migration must preserve performance characteristics, not just data.</p>
<hr />
<h3>Step 6 — Delete Replica PVCs</h3>
<pre><code class="language-plaintext">kubectl delete pvc datadir-zk-staging-1 -n tools
kubectl delete pvc datalogdir-zk-staging-1 -n tools
</code></pre>
<p>Because the reclaim policy is set to <code>Retain</code>, the underlying EBS volumes will not be deleted.</p>
<h3>Step 7 — Static PV Restore YAML</h3>
<pre><code class="language-plaintext"># restore-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-restore-datadir-1
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ebs.csi.aws.com
    volumeHandle: vol-RESTORED-DATADIR
    fsType: ext4
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
                - ap-south-1b
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-restore-datalogdir-1
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ebs.csi.aws.com
    volumeHandle: vol-RESTORED-DATALOGDIR
    fsType: ext4
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
                - ap-south-1b
</code></pre>
<p>Apply:</p>
<pre><code class="language-plaintext">kubectl apply -f restore-pv.yaml
kubectl get pv
</code></pre>
<hr />
<h3>Step 8 — Restore PVC YAML</h3>
<pre><code class="language-plaintext"># restore-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: datadir-zk-staging-1
  namespace: tools
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 10Gi
  volumeName: pv-restore-datadir-1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: datalogdir-zk-staging-1
  namespace: tools
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 10Gi
  volumeName: pv-restore-datalogdir-1
</code></pre>
<p>Apply and verify:</p>
<pre><code class="language-plaintext">kubectl apply -f restore-pvc.yaml
kubectl get pv,pvc -n tools
</code></pre>
<hr />
<h3>Step 9 — Restrict Scheduling to Single AZ</h3>
<p>Update nodeAffinity to:</p>
<pre><code class="language-plaintext">ap-south-1b
</code></pre>
<p>Apply updated StatefulSet.</p>
<hr />
<h3>Step 10 — Safe Staged Bring-Up</h3>
<pre><code class="language-plaintext">kubectl scale statefulset zk-staging -n tools --replicas=1
</code></pre>
<p>Validate stability.</p>
<p>Then:</p>
<pre><code class="language-plaintext">kubectl scale statefulset zk-staging -n tools --replicas=2
</code></pre>
<p>Migration complete.</p>
<hr />
<h3><strong>8. Architecture After Migration</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/64f9b973c160b374c81c2b0e/6ca4bfae-351b-433c-ab8c-5f701f0bd60b.png" alt="" style="display:block;margin:0 auto" />

<p><code>Inter-AZ data transfer cost is eliminated because both replicas now operate within ap-south-1b, but AZ-level redundancy is removed.</code></p>
<p>Capacity Validation Before Consolidation</p>
<p>Before consolidating both replicas into a single Availability Zone, validate:</p>
<ul>
<li><p>Worker node CPU headroom</p>
</li>
<li><p>Available memory capacity</p>
</li>
<li><p>EBS volume attachment limits per node</p>
</li>
<li><p>Network bandwidth availability</p>
</li>
<li><p>Failure of ap-south-1b will now result in a full service outage until recovery.</p>
</li>
</ul>
<p>Single-AZ consolidation increases resource contention risk and expands the blast radius.</p>
<p>Cost optimization must not introduce saturation instability.</p>
<h3><strong>Trade-Off</strong></h3>
<p>Single AZ failure = Full outage.</p>
<p>Availability ↓</p>
<p>Cost ↓</p>
<p>Intentional architectural decision.</p>
<hr />
<h3><strong>Rollback Strategy</strong></h3>
<p>If issues occur:</p>
<ol>
<li><p>Scale down</p>
</li>
<li><p>Restore original snapshots in their original Availability Zones.</p>
</li>
<li><p>Recreate original PV bindings</p>
</li>
<li><p>Revert nodeAffinity</p>
</li>
<li><p>Scale up gradually</p>
</li>
</ol>
<hr />
<h3><strong>Final Takeaway</strong></h3>
<p>This migration was not a scheduler tweak.</p>
<p>It was a storage topology redesign.</p>
<blockquote>
<p>Stateful workloads are constrained by storage locality.</p>
<p>Storage must move before pods move.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[From Provisioning to Control Plane: Designing a Hybrid Terraform + Crossplane Architecture at Scale]]></title><description><![CDATA[1. Overview
What I Designed
I designed a hybrid infrastructure architecture:

Terraform → Foundation Layer

Crossplane → Dynamic Lifecycle Layer

ArgoCD → GitOps Enforcement


This created a continuou]]></description><link>https://devopsofworld.com/terraform-crossplane-hybrid-control-plane-architecture</link><guid isPermaLink="true">https://devopsofworld.com/terraform-crossplane-hybrid-control-plane-architecture</guid><category><![CDATA[Devops]]></category><category><![CDATA[Terraform]]></category><category><![CDATA[crossplane]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[gitops]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[AWS]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Fri, 20 Mar 2026 03:34:59 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/64f9b973c160b374c81c2b0e/0570919e-1985-4854-b98c-3c83907ac875.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>1. Overview</h2>
<h3>What I Designed</h3>
<p>I designed a <strong>hybrid infrastructure architecture</strong>:</p>
<ul>
<li><p><strong>Terraform → Foundation Layer</strong></p>
</li>
<li><p><strong>Crossplane → Dynamic Lifecycle Layer</strong></p>
</li>
<li><p><strong>ArgoCD → GitOps Enforcement</strong></p>
</li>
</ul>
<p>This created a continuously reconciling cloud control plane inside Kubernetes.</p>
<hr />
<h3>Why It Was Required</h3>
<p>Our platform crossed:</p>
<ul>
<li><p>50+ microservices</p>
</li>
<li><p>Multiple engineering teams</p>
</li>
<li><p>Multi-region expansion</p>
</li>
<li><p>PR-driven infrastructure workflows</p>
</li>
<li><p>Feature branch–based short-lived environments</p>
</li>
</ul>
<p>Terraform workflows became operationally slow due to:</p>
<ul>
<li><p>State contention</p>
</li>
<li><p>Long plan times</p>
</li>
<li><p>PR bottlenecks</p>
</li>
<li><p>Manual drift detection</p>
</li>
</ul>
<p>Provisioning was working.<br />Lifecycle control was missing.</p>
<hr />
<h3>Constraint</h3>
<ul>
<li><p>No full rewrite</p>
</li>
<li><p>No infrastructure instability</p>
</li>
<li><p>Zero data loss</p>
</li>
<li><p>Minimal migration risk</p>
</li>
</ul>
<hr />
<h3>Engineering Principle Followed</h3>
<p><strong>Separate provisioning from lifecycle reconciliation.</strong></p>
<p>Provision once.<br />Reconcile continuously.</p>
<hr />
<h2>2. Problem Statement</h2>
<h3>Existing Architecture</h3>
<p>Terraform model:</p>
<p>Plan → Apply → Exit</p>
<p>After apply:</p>
<ul>
<li><p>No continuous reconciliation</p>
</li>
<li><p>Drift detection only on next plan</p>
</li>
<li><p>Manual console changes remain undetected</p>
</li>
</ul>
<hr />
<h3>Failure Risk</h3>
<p>If an engineer modified:</p>
<ul>
<li><p>RDS storage encryption</p>
</li>
<li><p>Deletion protection</p>
</li>
<li><p>Security groups</p>
</li>
<li><p>IAM policies</p>
</li>
</ul>
<p>Terraform would not react until the next plan/apply cycle.</p>
<p>Drift became silent operational risk.</p>
<hr />
<h3>What Would Break</h3>
<ul>
<li><p>Compliance posture</p>
</li>
<li><p>Backup guarantees</p>
</li>
<li><p>Encryption enforcement</p>
</li>
<li><p>Network boundaries</p>
</li>
<li><p>Incident recovery confidence</p>
</li>
</ul>
<hr />
<h3>Why It Was Unacceptable</h3>
<p>At scale:</p>
<p>Manual governance does not work.</p>
<p>Infrastructure must enforce its declared state.</p>
<hr />
<h2>3. Architecture After Implementation</h2>
<p>Control plane flow:</p>
<p>Developer commits YAML<br />↓<br />ArgoCD syncs to cluster<br />↓<br />Kubernetes API stores desired state<br />↓<br />Crossplane controller watches resource<br />↓<br />Crossplane calls AWS API<br />↓<br />Cloud resource created/updated<br />↑<br />Continuous reconciliation loop</p>
<p>Terraform foundation layer:</p>
<p>Terraform<br />↓<br />VPC<br />Subnets<br />EKS Control Plane<br />Core Networking</p>
<p>Clear separation of responsibilities.</p>
<hr />
<h2>4. Design Decisions</h2>
<hr />
<h3>4.1 Core Component Choice</h3>
<p><strong>Terraform for foundation</strong></p>
<p>Why I chose it:</p>
<ul>
<li><p>Mature state handling</p>
</li>
<li><p>Strong bootstrap ecosystem</p>
</li>
<li><p>Clear isolation of foundational infrastructure</p>
</li>
</ul>
<p>Multi-account governance was enforced via separate state isolation and account-factory patterns; Terraform itself does not natively provide org-level governance.</p>
<p>Trade-off:</p>
<ul>
<li>No continuous reconciliation</li>
</ul>
<p>Risk accepted:</p>
<ul>
<li>Foundation changes are rare and tightly controlled</li>
</ul>
<hr />
<p><strong>Crossplane for dynamic infrastructure</strong></p>
<p>Why I chose it:</p>
<ul>
<li><p>Kubernetes-native control loop</p>
</li>
<li><p>GitOps-friendly</p>
</li>
<li><p>CRD-based lifecycle management</p>
</li>
</ul>
<p>Trade-off:</p>
<ul>
<li><p>Adds API server load</p>
</li>
<li><p>Adds controller complexity</p>
</li>
</ul>
<p>Risk accepted:</p>
<ul>
<li>Infrastructure lifecycle now depends on cluster health</li>
</ul>
<hr />
<h3>4.2 Failure Detection</h3>
<p>Reconciliation loop ensures:</p>
<p>Actual State == Desired State</p>
<p>Manual console change → Crossplane reconciles.</p>
<p>Trade-off:</p>
<ul>
<li><p>AWS API throttling possible</p>
</li>
<li><p>Eventual consistency delays</p>
</li>
</ul>
<p>Risk accepted:</p>
<ul>
<li>Tuned provider/controller concurrency and AWS API backoff settings to mitigate throttling</li>
</ul>
<hr />
<h3>4.3 Event Routing</h3>
<p>Git Commit<br />→ ArgoCD<br />→ Kubernetes API<br />→ Crossplane Controller<br />→ AWS API</p>
<p>Rollback = git revert.</p>
<p>Trade-off:</p>
<ul>
<li>Git becomes critical dependency</li>
</ul>
<p>Risk accepted:</p>
<ul>
<li>Strong repository governance and PR controls</li>
</ul>
<hr />
<h3>4.4 Automation Logic</h3>
<p>Platform team defined Compositions.</p>
<p>Example:</p>
<pre><code class="language-yaml">apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: xpostgres
spec:
  compositeTypeRef:
    apiVersion: platform.io/v1alpha1
    kind: XPostgres
  resources:
    - name: database
      base:
        apiVersion: database.aws.crossplane.io/v1beta1
        kind: RDSInstance
        spec:
          deletionPolicy: Orphan
          forProvider:
            storageEncrypted: true
            deletionProtection: true
</code></pre>
<p>Application team used Claim:</p>
<pre><code class="language-yaml">apiVersion: platform.io/v1alpha1
kind: PostgresClaim
metadata:
  name: app-db
spec:
  parameters:
    storage: 20
</code></pre>
<p>Why I chose this:</p>
<ul>
<li><p>Central policy enforcement</p>
</li>
<li><p>Developer abstraction</p>
</li>
<li><p>Clear ownership boundary</p>
</li>
</ul>
<p>Trade-off:</p>
<ul>
<li><p>Composition update blast radius</p>
</li>
<li><p>Requires versioning discipline</p>
</li>
</ul>
<p>Risk accepted:</p>
<ul>
<li>Versioned compositions per environment</li>
</ul>
<hr />
<h3>4.5 DNS / Networking</h3>
<p>DNS and core networking remained Terraform-managed.</p>
<p>Reason:</p>
<ul>
<li><p>High blast radius</p>
</li>
<li><p>Low change frequency</p>
</li>
<li><p>Complex dependency graph</p>
</li>
</ul>
<p>Control plane expansion was phased deliberately.</p>
<hr />
<h2>5. Implementation Snippet</h2>
<p>Install Crossplane:</p>
<pre><code class="language-bash">helm repo add crossplane-stable https://charts.crossplane.io/stable
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system \
  --create-namespace
</code></pre>
<p>Install AWS Provider:</p>
<pre><code class="language-yaml">apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws
spec:
  package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2
</code></pre>
<p>Configure ProviderConfig (IRSA recommended):</p>
<pre><code class="language-yaml">apiVersion: aws.crossplane.io/v1beta1
kind: ProviderConfig
metadata:
  name: aws
spec:
  credentials:
    source: IRSA
</code></pre>
<p>ProviderConfig was configured using IRSA to avoid static credentials.</p>
<p>RDS Example:</p>
<pre><code class="language-yaml">apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: platform-db
spec:
  deletionPolicy: Orphan
  forProvider:
    region: us-east-1
    dbInstanceClass: db.t3.micro
    allocatedStorage: 20
    engine: postgres
    storageEncrypted: true
    deletionProtection: true
  providerConfigRef:
    name: aws
</code></pre>
<p>Assumes a default VPC and subnet group already exist; in hardened environments, explicit <code>subnetGroupName</code> and <code>securityGroupIds</code> must be specified.</p>
<hr />
<h2>6. Traffic / DNS Consideration</h2>
<p>Database endpoints remained AWS-managed.</p>
<p>No DNS switching automated through Crossplane.</p>
<p>Reason:</p>
<ul>
<li><p>Database endpoints are stable</p>
</li>
<li><p>DNS manipulation has high blast radius</p>
</li>
<li><p>Networking remained Terraform-owned</p>
</li>
</ul>
<hr />
<h2>7. Validation Process</h2>
<h3>Step 1 – Manual Drift Simulation</h3>
<p>Modified RDS parameter in AWS Console.</p>
<p>Observed:</p>
<ul>
<li><p>Crossplane detected change</p>
</li>
<li><p>Reconciliation restored desired state</p>
</li>
</ul>
<hr />
<h3>Step 2 – Deletion Policy Test</h3>
<p>Deleted CR with:</p>
<p><code>deletionPolicy: Orphan</code></p>
<p>Observed:</p>
<ul>
<li><p>Cloud resource retained</p>
</li>
<li><p>CR removed</p>
</li>
</ul>
<hr />
<h3>Step 3 – Delete Mode Test</h3>
<p>Changed to:</p>
<p><code>deletionPolicy: Delete</code></p>
<p>Deleted CR.</p>
<p>Observed:</p>
<ul>
<li>Cloud resource removed</li>
</ul>
<p>Lifecycle behavior verified.</p>
<hr />
<h3>Step 4 – API Throttling Simulation</h3>
<p>Created multiple resources in parallel.</p>
<p>Observed:</p>
<ul>
<li><p>AWS API throttling errors</p>
</li>
<li><p>Provider retry with exponential backoff</p>
</li>
</ul>
<p>Validated concurrency and backoff tuning necessity.</p>
<hr />
<h2>8. Cost &amp; Trade-offs</h2>
<h3>Infrastructure Cost</h3>
<ul>
<li><p>Additional Crossplane controller pods</p>
</li>
<li><p>Increased etcd object count</p>
</li>
<li><p>Higher AWS API call volume</p>
</li>
</ul>
<p>Cost impact: Moderate.</p>
<hr />
<h3>Operational Complexity</h3>
<p>Increased:</p>
<ul>
<li><p>Controller debugging</p>
</li>
<li><p>Composition versioning</p>
</li>
<li><p>CRD lifecycle management</p>
</li>
</ul>
<p>Reduced:</p>
<ul>
<li><p>Manual drift remediation</p>
</li>
<li><p>Terraform PR bottlenecks</p>
</li>
<li><p>Apply-time surprises</p>
</li>
</ul>
<hr />
<h3>RTO</h3>
<p>Improved.</p>
<p>Drift auto-corrected without manual intervention.</p>
<hr />
<h3>RPO</h3>
<p>No direct change.</p>
<p>Depends on AWS-native backup policies.</p>
<hr />
<h3>Scaling Impact</h3>
<p>Pros:</p>
<ul>
<li><p>Safe self-service for app teams</p>
</li>
<li><p>Git-auditable infrastructure</p>
</li>
<li><p>Continuous compliance enforcement</p>
</li>
</ul>
<p>Cons:</p>
<ul>
<li><p>API server load increases</p>
</li>
<li><p>AWS rate limit sensitivity</p>
</li>
<li><p>Composition update blast radius</p>
</li>
</ul>
<hr />
<h2>9. When This Design Makes Sense</h2>
<p>✔️ 50+ services<br />✔️ Dedicated platform team<br />✔️ GitOps maturity<br />✔️ Kubernetes-native organization<br />✔️ High infrastructure churn</p>
<hr />
<h3>When NOT to Use It</h3>
<ul>
<li><p>Small teams</p>
</li>
<li><p>Low churn infrastructure</p>
</li>
<li><p>No Kubernetes maturity</p>
</li>
<li><p>Multi-account bootstrap phase</p>
</li>
<li><p>Extremely complex networking requirements</p>
</li>
</ul>
<hr />
<h2>10. Final Takeaway</h2>
<p>Terraform builds infrastructure.</p>
<p>Crossplane manages the infrastructure lifecycle.</p>
<p>GitOps enforces declared intent.</p>
<p>This was not a tool replacement exercise.</p>
<p>It was an architectural shift from:</p>
<p>Provisioning mindset → Control plane mindset</p>
<p>At scale, lifecycle enforcement matters more than provisioning speed.</p>
<p>Hybrid architecture made lifecycle enforcement operationally viable at scale.</p>
]]></content:encoded></item><item><title><![CDATA[Cross-Cloud VM Migration: GCP → AWS Using AWS Application Migration Service (MGN)]]></title><description><![CDATA[Cross-cloud VM migration is not a disk copy task.
It is:

An access model transformation

A replication lifecycle management exercise

A downtime control operation

A cost boundary decision


We execu]]></description><link>https://devopsofworld.com/gcp-to-aws-vm-migration-using-aws-mgn</link><guid isPermaLink="true">https://devopsofworld.com/gcp-to-aws-vm-migration-using-aws-mgn</guid><category><![CDATA[AWS]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Cloud Migration]]></category><category><![CDATA[aws-mgn]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 17 Mar 2026 03:15:00 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/64f9b973c160b374c81c2b0e/f865652a-db15-41b4-8202-16fb66161a5d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cross-cloud VM migration is not a disk copy task.</p>
<p>It is:</p>
<ul>
<li><p>An access model transformation</p>
</li>
<li><p>A replication lifecycle management exercise</p>
</li>
<li><p>A downtime control operation</p>
</li>
<li><p>A cost boundary decision</p>
</li>
</ul>
<p>We executed a production-grade migration from Google Compute Engine to Amazon EC2 using AWS Application Migration Service (MGN) under strict operational constraints:</p>
<ul>
<li><p>Target RPO ≈ 0 at cutover (achieved through application write freeze, replication validation, and successful final synchronization)</p>
</li>
<li><p>Downtime &lt; 30 minutes</p>
</li>
<li><p>No SSH/RDP lockout</p>
</li>
<li><p>Controlled replication cost</p>
</li>
<li><p>Preserved rollback capability</p>
</li>
</ul>
<p>This article documents a technically precise, enterprise-review-safe execution model.</p>
<hr />
<h2>1. Why We Chose AWS MGN</h2>
<p>We evaluated:</p>
<ul>
<li><p>Manual snapshot export/import</p>
</li>
<li><p>Image-based migration</p>
</li>
<li><p>Backup/restore workflows</p>
</li>
<li><p>Application rebuild</p>
</li>
</ul>
<p>We selected AWS MGN because:</p>
<ul>
<li><p>Continuous block-level replication minimizes downtime</p>
</li>
<li><p>Test migrations do not impact production</p>
</li>
<li><p>Cutover is controlled and repeatable</p>
</li>
<li><p>No application redesign required (pure rehost model)</p>
</li>
</ul>
<p>This was lift-and-shift, not modernization.</p>
<hr />
<h2>2. What AWS MGN Actually Does</h2>
<p>AWS MGN is an agent-based block-level replication service.</p>
<p>It:</p>
<ul>
<li><p>Installs a replication agent on the source VM</p>
</li>
<li><p>Reads disk blocks at the operating system level</p>
</li>
<li><p>Continuously replicates encrypted data to AWS</p>
</li>
<li><p>Uses staging replication servers and EBS volumes</p>
</li>
<li><p>Launches test and cutover EC2 instances</p>
</li>
</ul>
<h3>Critical Clarification</h3>
<p>MGN replicates disk state.</p>
<p>It migrates:</p>
<ul>
<li><p>Local OS users stored on disk</p>
</li>
<li><p>SSH configuration files stored on disk</p>
</li>
<li><p>Application binaries</p>
</li>
<li><p>System configuration</p>
</li>
</ul>
<p>It does not migrate:</p>
<ul>
<li><p>GCP OS Login (IAM-based access)</p>
</li>
<li><p>Metadata-injected SSH keys</p>
</li>
<li><p>DNS records</p>
</li>
<li><p>Load balancers</p>
</li>
<li><p>Managed services (Cloud SQL, etc.)</p>
</li>
<li><p>IAM identities</p>
</li>
</ul>
<p>Any identity mechanism tied to GCP metadata or IAM must be replaced with persistent OS-level users or SSM-based access prior to cutover.</p>
<hr />
<h2>3. Migration Architecture</h2>
<pre><code class="language-plaintext">GCP VM
↓
MGN Replication Agent
↓
Encrypted TLS Transfer
↓
AWS Staging Replication Server
↓
EBS Replication Volumes
↓
Launch Template
↓
Test EC2 Instance
↓
Cutover EC2 Instance
</code></pre>
<p>Replication continues until cutover.</p>
<ul>
<li><p>Data encrypted in transit (TLS)</p>
</li>
<li><p>EBS encryption configurable at rest</p>
</li>
<li><p>Replication throughput depends on bandwidth, write churn, and staging configuration</p>
</li>
</ul>
<hr />
<h2>4. MGN Lifecycle States (Operational Meaning)</h2>
<h3>Not Ready</h3>
<p>Initial full replication in progress.</p>
<h3>Ready for Testing</h3>
<p>Initial synchronization complete. Continuous delta replication active.</p>
<h3>Test in Progress</h3>
<p>Test EC2 launched. Source VM continues running.</p>
<h3>Ready for Cutover</h3>
<p>Replication functioning correctly with minimal lag.</p>
<h3>Production Cutover Gate</h3>
<p>Proceed only if:</p>
<ul>
<li><p>Migration lifecycle = Ready for cutover</p>
</li>
<li><p>Replication status = Healthy</p>
</li>
<li><p>Replication lag is negligible or zero</p>
</li>
</ul>
<p>Never execute cutover if replication is not Healthy.</p>
<h3>Cutover in Progress</h3>
<p>Final synchronization executing and EC2 launching.</p>
<h3>Cutover Complete</h3>
<p>AWS EC2 becomes the active production workload.</p>
<hr />
<h2>5. Primary Risk: Access Model Mismatch</h2>
<p>Default GCP access model:</p>
<ul>
<li><p>OS Login (IAM-based)</p>
</li>
<li><p>Metadata-injected SSH</p>
</li>
<li><p>Console SSH</p>
</li>
</ul>
<p>Default AWS access model:</p>
<ul>
<li><p>OS-level local users</p>
</li>
<li><p>SSH key pairs</p>
</li>
<li><p>SSM Session Manager</p>
</li>
</ul>
<p>These models are incompatible.</p>
<p>If a VM relies exclusively on GCP OS Login or metadata-based SSH, administrative access will be lost after migration.</p>
<hr />
<h2>6. Mandatory Pre-Migration Access Preparation</h2>
<h3>Linux</h3>
<p>Create a persistent administrative user:</p>
<pre><code class="language-plaintext">sudo useradd awsadmin
sudo mkdir -p /home/awsadmin/.ssh
sudo chmod 700 /home/awsadmin/.ssh
sudo nano /home/awsadmin/.ssh/authorized_keys
sudo chmod 600 /home/awsadmin/.ssh/authorized_keys
sudo chown -R awsadmin:awsadmin /home/awsadmin
sudo usermod -aG sudo awsadmin
</code></pre>
<p>Validate SSH access before migration.</p>
<p>Attach IAM role:</p>
<pre><code class="language-plaintext">AmazonSSMManagedInstanceCore
</code></pre>
<p>This ensures emergency fallback access via Session Manager.</p>
<hr />
<h3>Windows</h3>
<ul>
<li><p>Enable local Administrator</p>
</li>
<li><p>Enable RDP</p>
</li>
<li><p>Allow TCP 3389</p>
</li>
<li><p>Validate login before migration</p>
</li>
</ul>
<hr />
<h2>7. Network Requirements</h2>
<h3>On GCP VM (Outbound Required)</h3>
<table style="min-width:75px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Port</p></th><th><p>Protocol</p></th><th><p>Purpose</p></th></tr><tr><td><p>443</p></td><td><p>TCP</p></td><td><p>AWS control plane communication</p></td></tr><tr><td><p>1500</p></td><td><p>TCP</p></td><td><p>Replication data channel</p></td></tr><tr><td><p>53</p></td><td><p>TCP/UDP</p></td><td><p>DNS (system prerequisite)</p></td></tr><tr><td><p>123</p></td><td><p>UDP</p></td><td><p>NTP (system prerequisite)</p></td></tr></tbody></table>

<p>No inbound firewall rules are required on the source VM for MGN.</p>
<h3>On AWS</h3>
<p>If staging and target instances reside in private subnets, outbound connectivity must be provided via:</p>
<ul>
<li><p>NAT Gateway</p>
</li>
<li><p>VPC Endpoints</p>
</li>
<li><p>Direct Connect</p>
</li>
<li><p>VPN</p>
</li>
</ul>
<p>NAT is required only if no internet access or VPC endpoints are configured.</p>
<hr />
<h2>8. Replication Consistency Model</h2>
<p>AWS MGN provides crash-consistent replication.</p>
<p>To achieve near-zero RPO:</p>
<ol>
<li><p>Freeze application writes</p>
</li>
<li><p>Confirm replication status = Healthy</p>
</li>
<li><p>Confirm replication lag ≈ 0</p>
</li>
<li><p>Execute cutover</p>
</li>
</ol>
<p>For database workloads, application quiescing is mandatory.</p>
<p>MGN does not provide automatic application-consistent snapshots.</p>
<hr />
<h2>9. Replication Timing Reality</h2>
<p>Initial synchronization depends on:</p>
<ul>
<li><p>Disk size</p>
</li>
<li><p>Network bandwidth</p>
</li>
<li><p>IOPS</p>
</li>
<li><p>Write churn</p>
</li>
</ul>
<p>Example (environment dependent):</p>
<ul>
<li><p>50 GB → 20–40 minutes</p>
</li>
<li><p>200 GB → 1–2 hours</p>
</li>
<li><p>1 TB → Dependent on available bandwidth</p>
</li>
</ul>
<p>Downtime is not determined by disk size.</p>
<p>Downtime = write freeze duration + final synchronization time.</p>
<hr />
<h2>10. Cutover Execution Model</h2>
<p>Validated production sequence:</p>
<ol>
<li><p>Freeze application writes</p>
</li>
<li><p>Confirm replication status = Healthy</p>
</li>
<li><p>Confirm lag negligible</p>
</li>
<li><p>Launch cutover instance</p>
</li>
<li><p>Stop GCP VM</p>
</li>
<li><p>Update DNS or routing</p>
</li>
</ol>
<p>Observed downtime: 5–30 minutes.</p>
<hr />
<h2>11. Rollback Strategy</h2>
<p>If validation fails:</p>
<ul>
<li><p>Do not delete source VM</p>
</li>
<li><p>Restore DNS to GCP</p>
</li>
<li><p>Maintain replication configuration</p>
</li>
<li><p>Reattempt cutover</p>
</li>
</ul>
<p>Never decommission the source environment until fully validated.</p>
<hr />
<h2>12. Cost and Scale Considerations</h2>
<h3>AWS Costs</h3>
<ul>
<li><p>Staging replication servers</p>
</li>
<li><p>EBS volumes</p>
</li>
<li><p>Snapshots</p>
</li>
<li><p>NAT (if used)</p>
</li>
<li><p>Data transfer</p>
</li>
</ul>
<h3>GCP Cost</h3>
<ul>
<li>Network egress (primary cost driver)</li>
</ul>
<p>Large parallel migrations may:</p>
<ul>
<li><p>Increase staging cost</p>
</li>
<li><p>Increase EBS consumption</p>
</li>
<li><p>Saturate bandwidth</p>
</li>
<li><p>Trigger API throttling</p>
</li>
</ul>
<p>Recommendation: Execute migrations in controlled waves.</p>
<hr />
<h2>13. When This Approach Makes Sense</h2>
<p>Suitable for:</p>
<ul>
<li><p>Lift-and-shift</p>
</li>
<li><p>Data center exit</p>
</li>
<li><p>Legacy workload relocation</p>
</li>
<li><p>Time-constrained transitions</p>
</li>
</ul>
<p>Not suitable for:</p>
<ul>
<li><p>Cloud-native redesign</p>
</li>
<li><p>Container migrations</p>
</li>
<li><p>Managed database modernization</p>
</li>
<li><p>Application refactoring</p>
</li>
</ul>
<hr />
<h2>Final Engineering Conclusion</h2>
<p>AWS Application Migration Service is predictable when:</p>
<ul>
<li><p>Access models are reconciled</p>
</li>
<li><p>OS compatibility is validated</p>
</li>
<li><p>Replication health and lag are monitored</p>
</li>
<li><p>Cutover is controlled</p>
</li>
<li><p>Rollback capability is preserved</p>
</li>
</ul>
<p>Most migration failures are not replication failures.</p>
<p>They are planning failures.</p>
<p>Cross-cloud VM migration is an operational engineering discipline — not a file transfer task.</p>
]]></content:encoded></item><item><title><![CDATA[AWS DevOps Agent: Real Testing, Architecture & Practical Insights]]></title><description><![CDATA[When AWS introduced AWS DevOps Agent, I was less interested in feature lists and more interested in one practical question.
Can it actually reduce investigation time during real production-style failu]]></description><link>https://devopsofworld.com/aws-devops-agent-real-testing-architecture-insights</link><guid isPermaLink="true">https://devopsofworld.com/aws-devops-agent-real-testing-architecture-insights</guid><category><![CDATA[#CloudWatch]]></category><category><![CDATA[cloud architecture]]></category><category><![CDATA[incident management]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Thu, 12 Mar 2026 03:15:00 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/64f9b973c160b374c81c2b0e/3ac1918e-8e35-4c57-8504-0d38dad36daf.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When AWS introduced AWS DevOps Agent, I was less interested in feature lists and more interested in one practical question.</p>
<p>Can it actually reduce investigation time during real production-style failures?</p>
<p>To answer that, I tested it using controlled failure scenarios instead of relying purely on documentation.</p>
<hr />
<h3><strong>What This Article Covers</strong></h3>
<ul>
<li><p>What it actually does</p>
</li>
<li><p>How it works architecturally</p>
</li>
<li><p>What I observed during testing</p>
</li>
<li><p>Where it helps</p>
</li>
<li><p>Where it does not</p>
</li>
</ul>
<p>No marketing. Only practical evaluation.</p>
<hr />
<h3><strong>What Is AWS DevOps Agent?</strong></h3>
<p>AWS DevOps Agent is an AI-powered investigation capability that analyzes AWS telemetry and generates structured incident timelines with evidence-backed probable causes.</p>
<p>It is <strong>not</strong>:</p>
<ul>
<li><p>A chatbot</p>
</li>
<li><p>An auto-remediation engine</p>
</li>
<li><p>A monitoring replacement</p>
</li>
</ul>
<p>It does not generate telemetry.</p>
<p>It consumes existing signals and correlates them.</p>
<hr />
<h3><strong>High-Level Architecture</strong></h3>
<pre><code class="language-plaintext">Application Workload
        ↓
AWS Services (EC2 / RDS / EKS / ALB / Lambda)
        ↓
CloudWatch Metrics + Logs + Events
        ↓
AWS DevOps Agent Correlation Engine
        ↓
Incident Timeline + Evidence + Root Cause Hypothesis
</code></pre>
<p>It acts as a reasoning layer on top of CloudWatch telemetry.</p>
<p>Monitoring detects the problem.  </p>
<p>The DevOps Agent explains it.</p>
<p>It does not independently detect incidents — it analyzes them after alerts are triggered.</p>
<hr />
<h3><strong>Where It Gets Its Data</strong></h3>
<p>The DevOps Agent analyzes signals primarily from:</p>
<ul>
<li><p>Amazon CloudWatch (metrics &amp; logs)</p>
</li>
<li><p>AWS resource configuration events</p>
</li>
<li><p>Control plane activity</p>
</li>
<li><p>Deployment-related changes</p>
</li>
</ul>
<p>Common services involved during investigation include:</p>
<ul>
<li><p>Amazon EC2</p>
</li>
<li><p>Amazon RDS</p>
</li>
<li><p>Amazon EKS</p>
</li>
<li><p>Elastic Load Balancing</p>
</li>
<li><p>AWS Lambda</p>
</li>
</ul>
<p>If metrics and logs are incomplete, investigation quality drops.</p>
<p>Observability maturity directly affects output quality.</p>
<hr />
<h3><strong>Testing Scenario 1: High CPU on Burstable EC2</strong></h3>
<p><strong>Setup</strong></p>
<ul>
<li><p>Burstable EC2 instance</p>
</li>
<li><p>Sustained workload applied</p>
</li>
<li><p>Manual SSH session before spike</p>
</li>
</ul>
<h3>Symptoms</h3>
<ul>
<li><p>High CPU alarm</p>
</li>
<li><p>Increased latency</p>
</li>
</ul>
<p><strong>What the Agent Correlated</strong></p>
<ul>
<li><p>CPUUtilization spike</p>
</li>
<li><p>CPUCreditBalance drop</p>
</li>
<li><p>Increased NetworkIn and NetworkOut metrics</p>
</li>
<li><p>SSH login event</p>
</li>
</ul>
<p><strong>Conclusion</strong></p>
<p>Sustained workload exhausted burst credits.</p>
<p>This was expected behavior for a burstable instance — not infrastructure failure.</p>
<p>Instead of manually checking multiple dashboards, the agent produced a structured investigation timeline.</p>
<hr />
<h3><strong>Testing Scenario 2: Application Down (Nginx Configuration Error)</strong></h3>
<p><strong>Setup</strong></p>
<ul>
<li><p>Manual Nginx configuration change</p>
</li>
<li><p>Introduced an invalid directive</p>
</li>
<li><p>Restarted service</p>
</li>
</ul>
<p><strong>Symptoms</strong></p>
<ul>
<li><p>Website inaccessible</p>
</li>
<li><p>Instance healthy</p>
</li>
<li><p>No CPU or memory pressure</p>
</li>
</ul>
<p><strong>What the Agent Correlated</strong></p>
<ul>
<li><p>Service restart failure</p>
</li>
<li><p>Configuration change event</p>
</li>
<li><p>Application log error</p>
</li>
<li><p>No correlated resource exhaustion</p>
</li>
</ul>
<p><strong>Conclusion</strong></p>
<p>Application configuration error.</p>
<p>Not a scaling or capacity issue.</p>
<p>The agent correctly separated this from the earlier CPU incident.</p>
<hr />
<h3><strong>Operational Impact</strong></h3>
<p>The biggest benefit was not detection — it was compression.</p>
<p>What normally requires:</p>
<ul>
<li><p>Checking multiple dashboards</p>
</li>
<li><p>Reviewing deployment history</p>
</li>
<li><p>Inspecting logs manually</p>
</li>
</ul>
<p>Was presented as a structured investigation narrative.</p>
<p>That compression directly impacts MTTR.</p>
<p>In short, DevOps Agent improves explanation quality — not detection capability.</p>
<hr />
<h3><strong>What Worked Well</strong></h3>
<p><strong>Timeline Clarity</strong></p>
<p>It clearly shows:</p>
<ul>
<li><p>What changed</p>
</li>
<li><p>When it changed</p>
</li>
<li><p>What metrics moved</p>
</li>
<li><p>What correlated</p>
</li>
</ul>
<p>This reduces guesswork during incidents.</p>
<hr />
<p><strong>Multi-Signal Correlation</strong></p>
<p>It combines:</p>
<ul>
<li><p>Metrics</p>
</li>
<li><p>Logs</p>
</li>
<li><p>Configuration changes</p>
</li>
<li><p>Access events</p>
</li>
</ul>
<p>This cross-signal reasoning improves investigation speed.</p>
<hr />
<p><strong>Issue Isolation</strong></p>
<p>Multiple issues can overlap in production.</p>
<p>The DevOps Agent attempts to isolate causal chains instead of merging everything into one root cause.</p>
<p>That improves RCA accuracy.</p>
<hr />
<h3><strong>Limitations</strong></h3>
<p><strong>No Deep Application Debugging</strong></p>
<p>It cannot analyze:</p>
<ul>
<li><p>Business logic bugs</p>
</li>
<li><p>Runtime memory leaks</p>
</li>
<li><p>Thread-level behavior</p>
</li>
</ul>
<p>Unless those signals are exposed through telemetry.</p>
<hr />
<p><strong>SSH Blind Spot</strong></p>
<p>Commands executed via SSH are invisible unless command logging is enabled.</p>
<p>Proper logging discipline is required.</p>
<hr />
<p><strong>Observability Dependency</strong></p>
<p>Insight quality depends on:</p>
<ul>
<li><p>Log completeness</p>
</li>
<li><p>Metric granularity</p>
</li>
<li><p>Tagging consistency</p>
</li>
<li><p>Retention strategy</p>
</li>
</ul>
<p>Weak telemetry produces weak conclusions.</p>
<hr />
<h3>Cost Considerations</h3>
<p>Because it operates on CloudWatch telemetry, overall cost is tied to observability depth.</p>
<p>Cost drivers include:</p>
<ul>
<li><p>Log ingestion volume</p>
</li>
<li><p>Metric storage</p>
</li>
<li><p>Retention duration</p>
</li>
</ul>
<p>The DevOps Agent itself is not the primary cost driver.</p>
<p>CloudWatch log ingestion, retention policies, and metric granularity determine overall observability spend.</p>
<p>Organizations must balance investigation visibility with cost control.</p>
<hr />
<h3><strong>When It Makes Sense</strong></h3>
<p>Best suited for:</p>
<ul>
<li><p>Multi-service AWS architectures</p>
</li>
<li><p>Teams handling frequent incidents</p>
</li>
<li><p>Organizations aiming to reduce MTTR</p>
</li>
<li><p>Standardizing RCA processes</p>
</li>
</ul>
<p>Less useful when:</p>
<ul>
<li><p>Infrastructure is extremely simple</p>
</li>
<li><p>Logging is minimal</p>
</li>
<li><p>Systems are mostly outside AWS</p>
</li>
</ul>
<hr />
<h3><strong>Final Thoughts</strong></h3>
<p>AWS DevOps Agent should be positioned as:</p>
<ul>
<li><p>An investigation accelerator</p>
</li>
<li><p>A structured reasoning layer over telemetry</p>
</li>
<li><p>An MTTR reduction enabler</p>
</li>
<li><p>A standardization tool for incident analysis</p>
</li>
</ul>
<p>It does not replace engineers.</p>
<p>It amplifies the quality of your existing observability.</p>
<p>It is most effective in mature environments with structured logging and tagging standards.</p>
<p>Strong telemetry in → structured reasoning out.  </p>
<p>Weak telemetry in → weak conclusions out.</p>
]]></content:encoded></item><item><title><![CDATA[Production-Grade GCS to S3 Migration: Secure, Private, and Zero-Egress Architecture]]></title><description><![CDATA[Migrating object storage across cloud providers is not a copy task.It is a cost, network, and security boundary problem.
We migrated 10+ TB of object data from Google Cloud Storage to Amazon S3 under ]]></description><link>https://devopsofworld.com/production-grade-gcs-to-s3-migration-secure-private-and-zero-egress-architecture</link><guid isPermaLink="true">https://devopsofworld.com/production-grade-gcs-to-s3-migration-secure-private-and-zero-egress-architecture</guid><category><![CDATA[Devops]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[AWS]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[S3]]></category><category><![CDATA[#gcs]]></category><category><![CDATA[cloud architecture]]></category><category><![CDATA[infrastructure]]></category><category><![CDATA[multi-cloud]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 10 Mar 2026 03:00:00 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/64f9b973c160b374c81c2b0e/a7227ee3-a136-48ef-bcf4-618794bfb5d1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Migrating object storage across cloud providers is not a copy task.<br />It is a cost, network, and security boundary problem.</p>
<p>We migrated 10+ TB of object data from Google Cloud Storage to Amazon S3 under strict enterprise constraints:</p>
<ul>
<li><p>Zero data loss</p>
</li>
<li><p>No public exposure</p>
</li>
<li><p>No long-lived credentials</p>
</li>
<li><p>Private networking compatibility</p>
</li>
<li><p>Full audit traceability</p>
</li>
<li><p>Controlled and predictable cost</p>
</li>
</ul>
<p>This document describes the architecture and execution model validated for production use.</p>
<hr />
<h2>1. The Engineering Constraint</h2>
<p>In cross-cloud migration, the execution location determines financial and security risk.</p>
<p>If migration runs outside GCP:</p>
<ul>
<li><p>GCS internet egress charges apply</p>
</li>
<li><p>Traffic traverses public endpoints</p>
</li>
<li><p>Credential exposure surface increases</p>
</li>
<li><p>Cost becomes unpredictable at scale</p>
</li>
</ul>
<p>At a multi-terabyte volume, this is unacceptable.</p>
<p>The objective was clear:</p>
<blockquote>
<p>Eliminate public egress from GCS while maintaining integrity and operational control.</p>
</blockquote>
<hr />
<h2>2. Architectural Design</h2>
<p>Migration was executed inside GCP using rclone on a Google Compute Engine (GCE) VM.</p>
<h3>Data Flow</h3>
<pre><code class="language-plaintext">GCS Bucket
   ↓
GCE VM (rclone)
   ↓
HTTPS
   ↓
S3 Bucket
</code></pre>
<p>Impact of this design:</p>
<ul>
<li><p>GCS → VM traffic remains internal</p>
</li>
<li><p>No GCS public egress billing</p>
</li>
<li><p>AWS inbound transfer remains free</p>
</li>
<li><p>Execution remains fully controlled</p>
</li>
</ul>
<p>The primary cost risk was removed architecturally, not operationally.</p>
<hr />
<h2>3. Deployment Modes Validated</h2>
<p>The design was tested under:</p>
<table style="min-width:50px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Component</p></th><th><p>Supported</p></th></tr><tr><td><p>Public GCP VM</p></td><td><p>Yes</p></td></tr><tr><td><p>Private GCP VM (No External IP)</p></td><td><p>Yes</p></td></tr><tr><td><p>Public S3 Endpoint</p></td><td><p>Yes</p></td></tr><tr><td><p>S3 via VPC Endpoint</p></td><td><p>Yes</p></td></tr><tr><td><p>VPN / Interconnect</p></td><td><p>Yes</p></td></tr></tbody></table>

<p>This ensures compatibility with both PoC and hardened enterprise environments.</p>
<hr />
<h2>4. Network &amp; Security Model</h2>
<h3>GCP Side</h3>
<ul>
<li><p>Private VM (no external IP)</p>
</li>
<li><p>Private Google Access enabled</p>
</li>
<li><p>Cloud NAT or VPN for outbound traffic</p>
</li>
<li><p>No inbound exposure</p>
</li>
<li><p>Metadata-based IAM authentication</p>
</li>
</ul>
<p>No service account JSON keys were used.</p>
<p>Access to GCS was restricted to the VM-attached service account.</p>
<hr />
<h3>AWS Side</h3>
<p>Two supported access patterns:</p>
<p><strong>Public S3 (PoC only)</strong>  </p>
<p>Standard endpoint with IAM control.</p>
<p><strong>Private S3 (Production)</strong></p>
<ul>
<li><p>S3 Gateway VPC Endpoint</p>
</li>
<li><p>Bucket policy restricted using <code>aws:SourceVpce</code></p>
</li>
<li><p>No public S3 exposure</p>
</li>
</ul>
<p>Example condition:</p>
<pre><code class="language-plaintext">{
  "Condition": {
    "StringEquals": {
      "aws:SourceVpce": "vpce-xxxxxxxx"
    }
  }
}
</code></pre>
<p>Traffic remains private across environments.</p>
<hr />
<h2>5. Authentication Strategy</h2>
<h3>GCS</h3>
<pre><code class="language-plaintext">[gcs]
type = google cloud storage
env_auth = true
</code></pre>
<ul>
<li><p>VM-attached service account</p>
</li>
<li><p>Metadata server authentication</p>
</li>
<li><p>No static credential storage</p>
</li>
</ul>
<h3>AWS</h3>
<ul>
<li><p>PoC: Temporary access key</p>
</li>
<li><p>Production: IAM Role / STS</p>
</li>
</ul>
<p>No long-lived credentials were introduced.</p>
<hr />
<h2>6. Migration Execution Model</h2>
<p>Execution followed controlled phases.</p>
<h3>Phase 1 — Pre-Flight Validation</h3>
<ul>
<li><p>IAM verification</p>
</li>
<li><p>DNS resolution check</p>
</li>
<li><p>Connectivity confirmation</p>
</li>
<li><p>Source and destination listing</p>
</li>
</ul>
<p>Mandatory dry-run:</p>
<pre><code class="language-plaintext">rclone copy gcs:&lt;bucket&gt; s3:&lt;bucket&gt; \
  --dry-run --checksum
</code></pre>
<p>No migration proceeded without validation.</p>
<hr />
<h3>Phase 2 — Controlled Transfer</h3>
<pre><code class="language-plaintext">rclone copy gcs:&lt;bucket&gt; s3:&lt;bucket&gt; \
  --checksum \
  --fast-list \
  --transfers=8 \
  --checkers=8 \
  --progress
</code></pre>
<p>Controls enforced:</p>
<ul>
<li><p>Checksum validation</p>
</li>
<li><p>Resume-safe execution</p>
</li>
<li><p>Tuned concurrency</p>
</li>
<li><p>Encrypted HTTPS transport</p>
</li>
</ul>
<p>Concurrency was deliberately limited to prevent throttling.</p>
<hr />
<h3>Phase 3 — Integrity Verification</h3>
<pre><code class="language-plaintext">rclone check gcs:&lt;bucket&gt; s3:&lt;bucket&gt;
</code></pre>
<p>This ensured:</p>
<ul>
<li><p>No checksum mismatches</p>
</li>
<li><p>No partial transfers</p>
</li>
<li><p>No silent corruption</p>
</li>
</ul>
<p>Logs and artifacts were archived for audit compliance.</p>
<hr />
<h2>7. Cost Model (5 TB Reference)</h2>
<table style="min-width:50px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Component</p></th><th><p>Cost</p></th></tr><tr><td><p>GCS Egress</p></td><td><p>\(0</p></td></tr><tr><td><p>GCP VM</p></td><td><p>Minimal runtime cost</p></td></tr><tr><td><p>Cloud NAT</p></td><td><p>Predictable usage cost</p></td></tr><tr><td><p>AWS Transfer IN</p></td><td><p>\)0</p></td></tr><tr><td><p>rclone</p></td><td><p>Free</p></td></tr><tr><td><p><strong>Total</strong></p></td><td><p>Infrastructure-level only</p></td></tr></tbody></table>

<p>If executed externally, GCS internet egress alone would exceed this amount.</p>
<p>Cost predictability was achieved through architectural control.</p>
<hr />
<h2>8. Alternatives Considered</h2>
<p>Managed services were evaluated.</p>
<p>Observed trade-offs:</p>
<ul>
<li><p>Per-GB transfer charges</p>
</li>
<li><p>Reduced retry visibility</p>
</li>
<li><p>Limited private networking control</p>
</li>
<li><p>Additional service dependency</p>
</li>
</ul>
<p>For small migrations, managed services are acceptable.</p>
<p>For enterprise-scale workloads requiring cost governance and auditability, direct execution provides stronger guarantees.</p>
<hr />
<h2>9. Outcome</h2>
<ul>
<li><p>Zero data loss</p>
</li>
<li><p>No public exposure</p>
</li>
<li><p>IAM-based authentication</p>
</li>
<li><p>Private networking compatibility</p>
</li>
<li><p>Full checksum validation</p>
</li>
<li><p>Resume-safe execution</p>
</li>
<li><p>Audit-ready logs</p>
</li>
<li><p>Production approval</p>
</li>
</ul>
<hr />
<h2>10. Conclusion</h2>
<p>Cross-cloud storage migration is not about moving objects.</p>
<p>It is about defining the correct execution boundary.</p>
<p>By executing the migration inside GCP, we eliminated public egress cost, preserved private networking, reduced credential risk, and maintained deterministic control over the entire process.</p>
<p>When execution placement is correct, migration risk becomes controlled, cost becomes predictable, and integrity becomes measurable.</p>
]]></content:encoded></item><item><title><![CDATA[Migrating Redis OSS Across AWS Accounts — Real Issues Faced and the Production-Safe Solution]]></title><description><![CDATA[Migrating Redis OSS data across AWS accounts sounds simple:
Export snapshot → Restore in another account.
In practice, it is not that straightforward.
In this case, we migrated Redis OSS from Account ]]></description><link>https://devopsofworld.com/migrating-redis-oss-across-aws-accounts-real-issues-faced-and-the-production-safe-solution</link><guid isPermaLink="true">https://devopsofworld.com/migrating-redis-oss-across-aws-accounts-real-issues-faced-and-the-production-safe-solution</guid><category><![CDATA[AWS]]></category><category><![CDATA[Amazon Elasticache]]></category><category><![CDATA[Redis]]></category><category><![CDATA[Amazon S3]]></category><category><![CDATA[cloud architecture]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Wed, 04 Mar 2026 03:15:00 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/64f9b973c160b374c81c2b0e/0b9e9461-8b0d-4bc9-a4f8-1718afbd95aa.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Migrating Redis OSS data across AWS accounts sounds simple:</p>
<p>Export snapshot → Restore in another account.</p>
<p>In practice, it is not that straightforward.</p>
<p>In this case, we migrated Redis OSS from <strong>Account A</strong> to <strong>Account B</strong> in the same region (<code>ap-south-1</code>) using default encryption (<strong>SSE-S3</strong>).</p>
<p>This post documents:</p>
<ul>
<li><p>Why snapshot export initially failed</p>
</li>
<li><p>Why cross-account restore failed</p>
</li>
<li><p>What actually worked</p>
</li>
<li><p>The production-safe migration pattern</p>
</li>
</ul>
<p>No assumptions. Only what was tested and validated.</p>
<hr />
<h3>1. Architecture Overview</h3>
<p><strong>Source Account (Account A)</strong><br />Redis OSS → Manual Snapshot → Export to S3</p>
<p><strong>Target Account (Account B)</strong><br />S3 (copied snapshot) → Restore Redis OSS</p>
<p>Key constraint:<br />No public bucket. No insecure workaround. Production-safe design.</p>
<hr />
<h3>2. Phase 1 — Snapshot Export to S3 (Why It Failed)</h3>
<p>When exporting the snapshot from ElastiCache to S3, we encountered:</p>
<pre><code class="language-plaintext">ElastiCache was unable to validate the authenticated user has access to the S3 bucket
</code></pre>
<p>At that time:</p>
<ul>
<li><p>Bucket existed</p>
</li>
<li><p>Same region (<code>ap-south-1</code>)</p>
</li>
<li><p>Block public access enabled</p>
</li>
<li><p>SSE-S3 encryption enabled</p>
</li>
<li><p>Bucket policy configured</p>
</li>
</ul>
<p>Yet export failed.</p>
<hr />
<h3>Root Cause</h3>
<p>In standard AWS regions, ElastiCache export and seed operations commonly require ACL-based grants using the service Canonical ID.<br />In opt-in regions, AWS also documents a bucket policy–based approach using the <code>elasticache-snapshot</code> service principal.</p>
<p>Modern S3 guidance promotes disabling ACLs, but Redis export requires:</p>
<ul>
<li><p>ACLs enabled</p>
</li>
<li><p>ElastiCache Canonical ID added</p>
</li>
</ul>
<p>Without ACL configuration, export validation fails.</p>
<hr />
<h3>Correct Export Configuration</h3>
<p><strong>Step 1 — Enable ACLs</strong></p>
<p>S3 → Bucket → Permissions → Object Ownership</p>
<p>Change from:</p>
<pre><code class="language-plaintext">Bucket owner enforced
</code></pre>
<p>To:</p>
<pre><code class="language-plaintext">ACLs enabled (Bucket owner preferred)
</code></pre>
<p>Save changes.</p>
<hr />
<p><strong>Step 2 — Add ElastiCache Canonical ID</strong></p>
<p>S3 → Bucket → Permissions → ACL → Edit</p>
<p>Add Canonical ID:</p>
<pre><code class="language-plaintext">540804c33a284a299d2547575ce1010f2312ef3da9b3a053c8bc45bf233e4353
</code></pre>
<p>Grant:</p>
<ul>
<li><p>Objects → List, Write</p>
</li>
<li><p>Bucket ACL → Read, Write</p>
</li>
</ul>
<p>Important:</p>
<ul>
<li><p>Block Public Access remained enabled</p>
</li>
<li><p>No “Everyone” permission added</p>
</li>
</ul>
<p>After this configuration, the snapshot export succeeded.</p>
<p>For standard (non-GovCloud) AWS regions, ElastiCache uses the same Canonical ID. Only GovCloud regions use different Canonical IDs. Always verify from AWS official <a href="https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/backups-exporting.html">documentation</a> before configuring.</p>
<hr />
<h3>3. Phase 2 — Cross-Account Restore Attempt</h3>
<p>Snapshot successfully exported to:</p>
<pre><code class="language-plaintext">s3://redis-oss-backup-bucket/redis-oss-backup-0001.rdb
</code></pre>
<p>The next step was restoring in Account B.</p>
<p>The restore operation is executed by the ElastiCache service in the target account, and cross-account restore requires exact service principal and permission configuration, and our bucket policy did not satisfy those requirements.</p>
<hr />
<h3>Attempt — Cross-Account Bucket Policy</h3>
<p>We initially configured a bucket policy allowing the ElastiCache service principal access. However, snapshot seeding requires the correct regional elasticache-snapshot service principal and specific permissions (including s3:GetBucketAcl). Our configuration did not match the documented requirement, which contributed to restore failure.</p>
<pre><code class="language-plaintext">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowElastiCacheFromAccountB",
      "Effect": "Allow",
      "Principal": {
        "Service": "elasticache.amazonaws.com"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::redis-oss-backup-bucket",
        "arn:aws:s3:::redis-oss-backup-bucket/*"
      ],
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "ACCOUNT_B_ID"
        }
      }
    }
  ]
}
</code></pre>
<p>Restore failed with:</p>
<pre><code class="language-plaintext">No permission to access S3 object
</code></pre>
<p>Even though:</p>
<ul>
<li><p>Region matched</p>
</li>
<li><p>Object key was correct</p>
</li>
<li><p>SSE-S3 encryption used</p>
</li>
<li><p>Service-linked role existed</p>
</li>
</ul>
<p>Restore did not succeed.</p>
<hr />
<h3><strong>Why It Failed</strong></h3>
<p>In our tested configuration, cross-account restore required more precise service principal permissions than our implementation provided.</p>
<p>The restore process expects:</p>
<ul>
<li><p>Bucket ownership alignment</p>
</li>
<li><p>Same-account execution boundary</p>
</li>
</ul>
<p>In our tested implementation, the cross-account restore failed with the bucket policy configuration we applied. Rather than iterating further on service-layer permission nuances, we adopted a production-safe pattern: copy the snapshot into an S3 bucket owned by the target account and restore from there.</p>
<hr />
<h3>4. Diagnostic Test — Public Object</h3>
<p>To isolate the issue, we temporarily made the object public.</p>
<p>Restore succeeded immediately.</p>
<p>This confirmed:</p>
<ul>
<li><p>Snapshot file was valid</p>
</li>
<li><p>Region correct</p>
</li>
<li><p>Encryption is not the issue</p>
</li>
<li><p>The permission boundary was the root cause</p>
</li>
</ul>
<p>Public access is not production-safe and was used only for troubleshooting.</p>
<hr />
<h3>5. Final Production-Safe Migration Pattern</h3>
<p>Instead of forcing cross-account restore, we redesigned the architecture.</p>
<h3>Correct Approach</h3>
<p>Account A<br />Export snapshot → S3 bucket A</p>
<p>↓</p>
<p>Secure object copy</p>
<p>↓</p>
<p>Account B<br />Restore from S3 bucket B</p>
<p>This ensures ownership alignment and eliminates the cross-account restore boundary.</p>
<hr />
<h3>6. Implementation Steps</h3>
<h3>Step 1 — Create S3 Bucket in Account B</h3>
<ul>
<li><p>Region: <code>ap-south-1</code></p>
</li>
<li><p>Block Public Access: Enabled</p>
</li>
<li><p>Object Ownership: Bucket owner enforced</p>
</li>
<li><p>Default Encryption: SSE-S3</p>
</li>
</ul>
<hr />
<h3>Step 2 — Copy Snapshot Securely</h3>
<pre><code class="language-plaintext">aws s3 cp s3://redis-oss-backup-bucket/redis-oss-backup-0001.rdb \
s3://redis-oss-backup-bucket-b/redis-oss-backup-0001.rdb \
--source-region ap-south-1 \
--region ap-south-1
</code></pre>
<p>After copy:</p>
<ul>
<li><p>Object owner becomes Account B</p>
</li>
<li><p>No cross-account policy required</p>
</li>
</ul>
<p>Ownership alignment resolves the restore issue.</p>
<hr />
<h3>Step 3 — Restore in Account B</h3>
<p>Use:</p>
<pre><code class="language-plaintext">redis-oss-backup-bucket-b/redis-oss-backup-0001.rdb
</code></pre>
<p>Restore completed successfully.</p>
<hr />
<h3>7. When to Use Replication Instead</h3>
<p>If the requirement is disaster recovery or continuous synchronization:</p>
<p>Use S3 Cross-Account Replication.</p>
<p>For one-time migration, manual secure copy is simpler and cleaner.</p>
<hr />
<h3>8. Key Engineering Takeaways</h3>
<ol>
<li><p>In standard regions, Redis OSS export typically requires ACL configuration with the ElastiCache Canonical ID.</p>
</li>
<li><p>Cross-account restore requires precise configuration of the service principal and permissions.</p>
</li>
<li><p>Ownership alignment via secure object copy eliminates cross-account complexity and reduces operational risk.</p>
</li>
<li><p>SSE-S3 simplifies cross-account migration.</p>
</li>
<li><p>Public object works but is insecure.</p>
</li>
<li><p>Ownership alignment is the key requirement.</p>
</li>
<li><p>Secure object copy is the cleanest production solution.</p>
</li>
</ol>
<hr />
<h3>Final Principle</h3>
<p>When migrating workloads across AWS accounts:</p>
<p>Do not force cross-account service execution boundaries when the service is not designed to support them.</p>
<p>Move the data.</p>
<p>Align ownership.</p>
<p>Restore cleanly.</p>
<p>That is production-grade cloud engineering.</p>
]]></content:encoded></item><item><title><![CDATA[Cross-Region RDS Disaster Recovery: Production Failover Architecture]]></title><description><![CDATA[1. Overview
This post documents how I designed and implemented cross-region disaster recovery for a production MySQL database running on Amazon RDS.
The requirement was straightforward:
If the primary region (ap-south-1) becomes unavailable, the data...]]></description><link>https://devopsofworld.com/cross-region-rds-disaster-recovery-production-architecture</link><guid isPermaLink="true">https://devopsofworld.com/cross-region-rds-disaster-recovery-production-architecture</guid><category><![CDATA[AWS]]></category><category><![CDATA[amazon-rds]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[Devops]]></category><category><![CDATA[cloud architecture]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 03 Mar 2026 03:30:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771393054460/a3c8cd1f-1dc9-4f33-88e3-70b5fb567437.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-1-overview">1. Overview</h2>
<p>This post documents how I designed and implemented cross-region disaster recovery for a production MySQL database running on Amazon RDS.</p>
<p>The requirement was straightforward:</p>
<p>If the primary region (ap-south-1) becomes unavailable, the database must recover automatically from another region with minimal downtime and no manual intervention.</p>
<p>The approach relies entirely on native AWS services and focuses on controlled automation rather than introducing additional orchestration layers.</p>
<hr />
<h2 id="heading-2-problem-statement">2. Problem Statement</h2>
<p>The application was running with:</p>
<ul>
<li><p>A single primary RDS instance in ap-south-1</p>
</li>
<li><p>No cross-region replica</p>
</li>
<li><p>A manual disaster recovery procedure</p>
</li>
</ul>
<p>In the event of a regional failure:</p>
<ul>
<li><p>The database endpoint would become unreachable</p>
</li>
<li><p>The application would lose connectivity</p>
</li>
<li><p>Recovery would depend on human response time</p>
</li>
</ul>
<p>This created an unacceptable recovery model for a production-facing workload.</p>
<p>The goal was to introduce regional resilience while keeping the system operationally predictable and architecturally simple.</p>
<hr />
<h2 id="heading-3-architecture-after-dr-implementation">3. Architecture After DR Implementation</h2>
<p>Application<br />↓<br />Primary RDS (ap-south-1)<br />↓<br />Cross-Region Read Replica (us-east-1)</p>
<p>Automation Flow:</p>
<p>CloudWatch Alarm<br />→ SNS Topic<br />→ Lambda (us-east-1)<br />→ Promote Read Replica</p>
<p>After promotion, the replica becomes a standalone primary database.</p>
<hr />
<h2 id="heading-4-design-decisions">4. Design Decisions</h2>
<h3 id="heading-41-cross-region-read-replica">4.1 Cross-Region Read Replica</h3>
<p>A read replica was created in us-east-1 from the primary RDS instance.</p>
<p>Replication is managed natively by RDS.</p>
<p>Important consideration:</p>
<p>Cross-region replication is asynchronous. This introduces a non-zero Recovery Point Objective (RPO). A small amount of recent data may not be replicated during failover.</p>
<p>For this workload, that trade-off was acceptable.</p>
<hr />
<h3 id="heading-42-failure-detection">4.2 Failure Detection</h3>
<p>Failure detection was implemented using Amazon CloudWatch.</p>
<p>A CloudWatch alarm was configured on the primary instance. The initial metric used was:</p>
<p>DatabaseConnections &lt; 1</p>
<p>If the instance stops accepting connections, the alarm transitions to ALARM state.</p>
<p>In more mature setups, detection can be improved using:</p>
<ul>
<li><p>RDS instance status</p>
</li>
<li><p>RDS event notifications</p>
</li>
<li><p>Composite alarms</p>
</li>
</ul>
<p>The alarm publishes its state change to an SNS topic.</p>
<hr />
<h3 id="heading-43-event-routing">4.3 Event Routing</h3>
<p>Amazon Simple Notification Service was introduced to decouple monitoring from execution.</p>
<p>The flow:</p>
<p>CloudWatch → SNS → Lambda</p>
<p>This separation provides:</p>
<ul>
<li><p>Clear event-driven architecture</p>
</li>
<li><p>Loose coupling</p>
</li>
<li><p>Extensibility for future automation</p>
</li>
</ul>
<hr />
<h3 id="heading-44-automated-promotion">4.4 Automated Promotion</h3>
<p>Promotion logic was implemented using AWS Lambda in us-east-1.</p>
<p>The Lambda role was granted minimal permission:</p>
<p>rds:PromoteReadReplica</p>
<p>Core logic:</p>
<pre><code class="lang-bash">import boto3

def lambda_handler(event, context):
    client = boto3.client(<span class="hljs-string">'rds'</span>, region_name=<span class="hljs-string">'us-east-1'</span>)

    replica_id = <span class="hljs-string">"your-read-replica-id"</span>

    client.promote_read_replica(
        DBInstanceIdentifier=replica_id
    )

    <span class="hljs-built_in">return</span> {
        <span class="hljs-string">"statusCode"</span>: 200,
        <span class="hljs-string">"message"</span>: f<span class="hljs-string">"Promotion initiated for {replica_id}"</span>
    }
</code></pre>
<p>When triggered, Lambda initiates promotion of the replica. Promotion typically completes within a few minutes.</p>
<hr />
<h2 id="heading-5-dns-consideration">5. DNS Consideration</h2>
<p>Promotion alone does not restore application traffic.</p>
<p>The application must connect to the new primary endpoint.</p>
<p>This is typically handled using Amazon Route 53 with:</p>
<ul>
<li><p>Failover routing policies</p>
</li>
<li><p>Health checks</p>
</li>
<li><p>Low TTL configuration</p>
</li>
</ul>
<p>Without DNS integration, traffic would continue to target the failed endpoint.</p>
<p>This is a critical part of the disaster recovery strategy.</p>
<hr />
<h2 id="heading-6-validation-process">6. Validation Process</h2>
<p>To validate the setup:</p>
<ul>
<li><p>The primary instance was stopped</p>
</li>
<li><p>CloudWatch alarm triggered</p>
</li>
<li><p>SNS published the event</p>
</li>
<li><p>Lambda executed</p>
</li>
<li><p>The read replica was promoted</p>
</li>
</ul>
<p>Post-promotion validation included:</p>
<ul>
<li><p>Verifying instance state</p>
</li>
<li><p>Testing database connectivity</p>
</li>
<li><p>Executing queries</p>
</li>
<li><p>Confirming application behavior</p>
</li>
</ul>
<p>Failover occurred without manual intervention.</p>
<hr />
<h2 id="heading-7-cost-and-trade-offs">7. Cost and Trade-offs</h2>
<p>Cross-region disaster recovery increases infrastructure cost:</p>
<ul>
<li><p>An additional RDS instance</p>
</li>
<li><p>Cross-region replication traffic</p>
</li>
<li><p>Additional storage</p>
</li>
</ul>
<p>This cost is deliberate and justified by the reduction in downtime risk.</p>
<p>RTO: A few minutes (detection + promotion time)<br />RPO: A few seconds (due to asynchronous replication)</p>
<p>These expectations must be clearly defined with stakeholders before implementation.</p>
<hr />
<h2 id="heading-8-when-this-approach-makes-sense">8. When This Approach Makes Sense</h2>
<p>This architecture is suitable when:</p>
<ul>
<li><p>A few seconds of data loss is acceptable</p>
</li>
<li><p>Automated regional failover is required</p>
</li>
<li><p>Operational simplicity is preferred</p>
</li>
<li><p>The workload is not using Aurora Global Database</p>
</li>
</ul>
<p>For workloads requiring near-zero RPO and faster failover, managed global database architectures may be more appropriate.</p>
<hr />
<h2 id="heading-9-final-takeaway">9. Final Takeaway</h2>
<p>I implemented a cross-region disaster recovery architecture for Amazon RDS using native AWS services.</p>
<p>The system:</p>
<ul>
<li><p>Detects primary failure automatically</p>
</li>
<li><p>Triggers event-driven automation</p>
</li>
<li><p>Promotes a cross-region replica</p>
</li>
<li><p>Reduces human dependency during outages</p>
</li>
</ul>
<p>The design balances simplicity, cost, and resilience. It introduces regional fault tolerance without adding unnecessary operational complexity.</p>
]]></content:encoded></item><item><title><![CDATA[Migrating from Amazon Linux 2 to Amazon Linux 2023]]></title><description><![CDATA[A Practical Production Playbook (200 Instance Scenario)
Amazon Linux 2 (AL2) will reach end-of-support on June 30, 2026. After that date, AWS will no longer provide security updates, patches, or new packages.
Although AL2 continues to receive mainten...]]></description><link>https://devopsofworld.com/amazon-linux-2-to-amazon-linux-2023-migration-guide</link><guid isPermaLink="true">https://devopsofworld.com/amazon-linux-2-to-amazon-linux-2023-migration-guide</guid><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Amazon EKS]]></category><category><![CDATA[ec2]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Linux]]></category><category><![CDATA[infrastructure]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Thu, 26 Feb 2026 03:30:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771374712810/fdda2b3e-dd44-42b3-a501-0fd4c7c7bfc1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-a-practical-production-playbook-200-instance-scenario">A Practical Production Playbook (200 Instance Scenario)</h2>
<p>Amazon Linux 2 (AL2) will reach end-of-support on <strong>June 30, 2026</strong>. After that date, AWS will no longer provide security updates, patches, or new packages.</p>
<p>Although AL2 continues to receive maintenance updates today, it is no longer the forward-looking platform. Amazon Linux 2023 (AL2023) is the long-term replacement, offering a predictable <strong>5-year lifecycle per release (2 years standard + 3 years maintenance)</strong> along with modernized system components.</p>
<p>If you are running production workloads on AL2, migration should be planned early — not rushed near 2026.</p>
<p>This article explains how to safely migrate in a large production environment with:</p>
<ul>
<li><p>Standalone EC2 instances</p>
</li>
<li><p>Auto Scaling Groups (ASG)</p>
</li>
<li><p>Amazon EKS worker nodes</p>
</li>
</ul>
<p>Assume a worst-case environment of 200 AL2 instances.</p>
<hr />
<h2 id="heading-why-in-place-upgrade-is-not-recommended">Why In-Place Upgrade Is Not Recommended</h2>
<p>There is no supported in-place upgrade path from AL2 to AL2023.</p>
<p>AL2023 introduces:</p>
<ul>
<li><p>Updated kernel</p>
</li>
<li><p>Newer system libraries</p>
</li>
<li><p>Updated OpenSSL and crypto policies</p>
</li>
<li><p>cgroup v2 by default</p>
</li>
<li><p>Updated container runtime stack</p>
</li>
</ul>
<p>Attempting in-place OS mutation:</p>
<ul>
<li><p>Is unsupported</p>
</li>
<li><p>Is difficult to test</p>
</li>
<li><p>Has no clean rollback</p>
</li>
<li><p>Is unsafe for Kubernetes worker nodes</p>
</li>
</ul>
<p>The correct production pattern is:</p>
<p><strong>Build new → Validate → Controlled cutover → Preserve rollback → Decommission old</strong></p>
<hr />
<h2 id="heading-example-production-layout-200-instances">Example Production Layout (200 Instances)</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Workload Type</td><td>Count</td></tr>
</thead>
<tbody>
<tr>
<td>Standalone EC2 (stateful)</td><td>40</td></tr>
<tr>
<td>Auto Scaling Groups (stateless)</td><td>120</td></tr>
<tr>
<td>EKS worker nodes</td><td>40</td></tr>
<tr>
<td><strong>Total</strong></td><td><strong>200</strong></td></tr>
</tbody>
</table>
</div><p>Each workload type requires a different strategy.</p>
<hr />
<h2 id="heading-1-standalone-ec2-stateful-systems">1. Standalone EC2 (Stateful Systems)</h2>
<h3 id="heading-typical-examples">Typical Examples</h3>
<ul>
<li><p>Databases</p>
</li>
<li><p>Legacy applications</p>
</li>
<li><p>EC2 instances using Elastic IP</p>
</li>
<li><p>Applications dependent on local storage</p>
</li>
</ul>
<hr />
<h2 id="heading-step-1-create-a-safety-checkpoint">Step 1 – Create a Safety Checkpoint</h2>
<p>Before any migration:</p>
<ul>
<li><p>Create EBS snapshots</p>
</li>
<li><p>Confirm snapshot completion</p>
</li>
</ul>
<p>This is your rollback baseline.</p>
<hr />
<h2 id="heading-step-2-launch-parallel-al2023-instance">Step 2 – Launch Parallel AL2023 Instance</h2>
<p>Create a new EC2 instance with:</p>
<ul>
<li><p>Same VPC and subnet</p>
</li>
<li><p>Same security groups</p>
</li>
<li><p>Same IAM role</p>
</li>
<li><p>Same instance type</p>
</li>
</ul>
<p>Do not modify the AL2 instance.</p>
<hr />
<h2 id="heading-step-3-validate-dependency-compatibility">Step 3 – Validate Dependency Compatibility</h2>
<p>Test explicitly:</p>
<ul>
<li><p>Runtime versions (Java, Python, Node)</p>
</li>
<li><p>OpenSSL behavior</p>
</li>
<li><p>Crypto policy differences</p>
</li>
<li><p>Systemd services</p>
</li>
<li><p>Hardcoded paths in custom scripts</p>
</li>
</ul>
<p>Small OS-level changes can break production services.</p>
<hr />
<h2 id="heading-step-4-controlled-data-migration">Step 4 – Controlled Data Migration</h2>
<ol>
<li><p>Initial sync while application is running</p>
</li>
<li><p>Stop or freeze writes</p>
</li>
<li><p>Final sync with checksum validation</p>
</li>
</ol>
<p>Example:</p>
<pre><code class="lang-bash">rsync -avh --checksum --numeric-ids /data/ new-server:/data/
</code></pre>
<p>For database systems, prefer logical dump/restore over raw filesystem copy to avoid corruption risk.</p>
<hr />
<h2 id="heading-step-5-validate-before-cutover">Step 5 – Validate Before Cutover</h2>
<p>Start the application on AL2023 and verify:</p>
<ul>
<li><p>Application logs</p>
</li>
<li><p>Health endpoints</p>
</li>
<li><p>Downstream connectivity</p>
</li>
<li><p>Resource utilization</p>
</li>
</ul>
<p>Only after validation:</p>
<ul>
<li><p>Stop service on AL2</p>
</li>
<li><p>Switch Elastic IP or DNS</p>
</li>
</ul>
<p>Keep AL2 intact until full confidence.</p>
<hr />
<h2 id="heading-2-auto-scaling-groups-stateless-systems">2. Auto Scaling Groups (Stateless Systems)</h2>
<h3 id="heading-typical-examples-1">Typical Examples</h3>
<ul>
<li><p>Web servers</p>
</li>
<li><p>APIs</p>
</li>
<li><p>Microservices</p>
</li>
</ul>
<p>The main risk is pushing a broken AMI to the entire fleet.</p>
<hr />
<h2 id="heading-step-1-build-an-al2023-golden-ami">Step 1 – Build an AL2023 Golden AMI</h2>
<p>Include:</p>
<ul>
<li><p>Monitoring agents</p>
</li>
<li><p>Security agents</p>
</li>
<li><p>Logging agents</p>
</li>
<li><p>Application bootstrap scripts</p>
</li>
</ul>
<p>Test full userdata execution.<br />Simulate instance termination to confirm auto-recovery.</p>
<hr />
<h2 id="heading-step-2-create-new-launch-template-version">Step 2 – Create New Launch Template Version</h2>
<p>Update only:</p>
<ul>
<li>AMI ID</li>
</ul>
<p>Keep AL2 template available for rollback.</p>
<hr />
<h2 id="heading-step-3-canary-deployment">Step 3 – Canary Deployment</h2>
<p>Increase desired capacity by 1.</p>
<p>Validate:</p>
<ul>
<li><p>Load balancer health checks</p>
</li>
<li><p>Application startup</p>
</li>
<li><p>Logs and error rate</p>
</li>
<li><p>Metrics stability</p>
</li>
</ul>
<p>Do not skip canary testing.</p>
<hr />
<h2 id="heading-step-4-controlled-instance-refresh">Step 4 – Controlled Instance Refresh</h2>
<p>Use safe rollout configuration:</p>
<ul>
<li><p>Configure minimum healthy percentage appropriate for fleet size (e.g., 90–100% for large fleets).</p>
</li>
<li><p>Warm-up time configured</p>
</li>
<li><p>ELB health checks enabled</p>
</li>
</ul>
<p>Monitor closely during rollout.</p>
<p>If instability occurs:</p>
<ul>
<li><p>Cancel refresh</p>
</li>
<li><p>Revert launch template</p>
</li>
</ul>
<hr />
<h2 id="heading-3-amazon-eks-worker-nodes-highest-blast-radius">3. Amazon EKS Worker Nodes (Highest Blast Radius)</h2>
<p>AL2023 introduces changes that can affect Kubernetes workloads:</p>
<ul>
<li><p>cgroup v2</p>
</li>
<li><p>Updated kernel</p>
</li>
<li><p>Updated container runtime</p>
</li>
</ul>
<p>This can impact:</p>
<ul>
<li><p>DaemonSets</p>
</li>
<li><p>Monitoring agents</p>
</li>
<li><p>Security tooling</p>
</li>
<li><p>CNI plugins</p>
</li>
</ul>
<hr />
<h2 id="heading-safe-eks-migration-flow">Safe EKS Migration Flow</h2>
<h3 id="heading-step-1-add-al2023-node-group">Step 1 – Add AL2023 Node Group</h3>
<p>Create a new managed node group (or Karpenter pool).<br />Do not modify AL2 nodes yet.</p>
<hr />
<h3 id="heading-step-2-taint-al2-nodes">Step 2 – Taint AL2 Nodes</h3>
<pre><code class="lang-bash">kubectl taint node &lt;al2-node&gt; os=al2:NoSchedule
</code></pre>
<p>Effect:</p>
<ul>
<li><p>No new pods schedule on AL2</p>
</li>
<li><p>Existing pods continue running</p>
</li>
</ul>
<hr />
<h3 id="heading-step-3-validate-scheduling-on-al2023">Step 3 – Validate Scheduling on AL2023</h3>
<p>Scale workloads or deploy new services.</p>
<p>Confirm:</p>
<ul>
<li><p>Pods schedule on AL2023</p>
</li>
<li><p>Networking functions correctly</p>
</li>
<li><p>Metrics and logs flow normally</p>
</li>
</ul>
<hr />
<h3 id="heading-step-4-validate-cluster-add-ons-and-daemonsets">Step 4 – Validate Cluster Add-ons and DaemonSets</h3>
<p>Check:</p>
<ul>
<li><p>VPC CNI</p>
</li>
<li><p>CoreDNS</p>
</li>
<li><p>kube-proxy</p>
</li>
<li><p>Logging agents</p>
</li>
<li><p>Monitoring agents</p>
</li>
<li><p>Security tools</p>
</li>
</ul>
<p>Ensure cluster add-ons (VPC CNI, CoreDNS, kube-proxy) versions are compatible with AL2023 node AMIs before rollout.</p>
<p>Also verify PodDisruptionBudgets:</p>
<pre><code class="lang-bash">kubectl get pdb -A
</code></pre>
<p>Ignoring PDBs can cause draining failures or partial outages.</p>
<hr />
<h3 id="heading-step-5-drain-al2-nodes">Step 5 – Drain AL2 Nodes</h3>
<pre><code class="lang-bash">kubectl drain &lt;node&gt; --ignore-daemonsets
</code></pre>
<p>Observe workload behavior during drain.</p>
<hr />
<h3 id="heading-step-6-delete-al2-node-group">Step 6 – Delete AL2 Node Group</h3>
<p>Delete only after full stability confirmation.</p>
<p>Rollback is possible only until deletion.</p>
<hr />
<h2 id="heading-common-production-mistakes">Common Production Mistakes</h2>
<ul>
<li><p>Attempting in-place OS upgrades</p>
</li>
<li><p>Skipping canary validation</p>
</li>
<li><p>Ignoring bootstrap script testing</p>
</li>
<li><p>Draining EKS nodes prematurely</p>
</li>
<li><p>Not snapshotting stateful systems</p>
</li>
<li><p>Removing rollback resources too early</p>
</li>
</ul>
<hr />
<h2 id="heading-executive-summary">Executive Summary</h2>
<p>Amazon Linux 2 reaches end-of-support on June 30, 2026. Migration to Amazon Linux 2023 should follow immutable infrastructure principles. For stateful EC2, use parallel instances with validated data sync and controlled DNS cutover. For Auto Scaling Groups, roll out a new AMI using canary and guarded instance refresh. For EKS, introduce AL2023 nodes, prevent new scheduling on AL2, validate workloads and cluster add-ons, then drain and remove AL2 nodes after stability is confirmed. Maintain rollback until the final step.</p>
<hr />
<h2 id="heading-final-takeaway">Final Takeaway</h2>
<p>This migration is not about replacing servers.<br />It is about maintaining production stability while upgrading the platform.</p>
<p>Plan early.<br />Validate carefully.<br />Preserve rollback.<br />Decommission only after confidence.</p>
]]></content:encoded></item><item><title><![CDATA[Production Incident: Control Plane Latency During Large-Scale Rollout on Amazon EKS]]></title><description><![CDATA[1. Context
As part of readiness planning for high-demand production scenarios, we executed a large-scale rollout simulation on one of our production clusters running onAmazon Elastic Kubernetes Service.
The cluster hosts thousands of pods, supports a...]]></description><link>https://devopsofworld.com/eks-control-plane-latency-production-incident</link><guid isPermaLink="true">https://devopsofworld.com/eks-control-plane-latency-production-incident</guid><category><![CDATA[AWS]]></category><category><![CDATA[EKS]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[SRE]]></category><category><![CDATA[production-incident]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Wed, 25 Feb 2026 03:30:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771376235927/a5e68572-e680-4b1f-8144-4fb70a65a1de.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-1-context">1. Context</h2>
<p>As part of readiness planning for high-demand production scenarios, we executed a large-scale rollout simulation on one of our production clusters running on<br />Amazon Elastic Kubernetes Service.</p>
<p>The cluster hosts thousands of pods, supports active CI/CD workflows, and runs with autoscaling enabled.</p>
<p>To validate system behavior under operational stress, we triggered an update of approximately 2000 pods in parallel.</p>
<p>The objective was simple: identify performance boundaries before peak traffic windows.</p>
<hr />
<h2 id="heading-2-what-happened">2. What Happened</h2>
<p>During the rollout:</p>
<ul>
<li><p><code>kubectl get pods</code> responses became noticeably slow</p>
</li>
<li><p>Deployment progression slowed</p>
</li>
<li><p>CI pipelines interacting with the cluster experienced delays</p>
</li>
<li><p>API response times increased</p>
</li>
</ul>
<p>There were:</p>
<ul>
<li><p>No worker node failures</p>
</li>
<li><p>No pod crashes</p>
</li>
<li><p>No autoscaling instability</p>
</li>
</ul>
<p>Once the rollout completed, API performance returned to normal.</p>
<hr />
<h2 id="heading-3-observations">3. Observations</h2>
<p>CloudWatch Container Insights showed:</p>
<ul>
<li><p>A sharp spike in API server request volume</p>
</li>
<li><p>Increased API server latency</p>
</li>
<li><p>Minor request drops during peak rollout</p>
</li>
<li><p>Automatic normalization after rollout completion</p>
</li>
</ul>
<p>The behavior was consistent during heavy parallel updates.</p>
<p>This indicated temporary control plane saturation under burst traffic.</p>
<hr />
<h2 id="heading-4-root-cause-analysis">4. Root Cause Analysis</h2>
<p>Updating ~2000 pods simultaneously generated significant API traffic, including:</p>
<ul>
<li><p>Pod create and update requests</p>
</li>
<li><p>Deployment controller reconciliation</p>
</li>
<li><p>Watch stream updates</p>
</li>
<li><p>kubelet status reporting</p>
</li>
<li><p>Autoscaler interactions</p>
</li>
<li><p>CI/CD polling</p>
</li>
</ul>
<p>All of these operations flow through the Kubernetes API server.</p>
<p>By default,<br />Amazon Elastic Kubernetes Service<br />uses reactive auto-scaling for its control plane.</p>
<p>Reactive scaling introduces a short window where burst request volume can temporarily exceed allocated API capacity before scaling adjusts.</p>
<p>During that window:</p>
<ul>
<li><p>API latency increases</p>
</li>
<li><p>kubectl commands respond slowly</p>
</li>
<li><p>Rollout completion time extends</p>
</li>
</ul>
<p>The system stabilizes once traffic decreases and scaling catches up.</p>
<p>This was a burst capacity boundary — not a failure.</p>
<hr />
<h2 id="heading-5-risk-consideration">5. Risk Consideration</h2>
<p>Under normal operating conditions, temporary latency during heavy rollout may be acceptable.</p>
<p>However, during high-demand production windows:</p>
<ul>
<li><p>Deployment speed directly impacts mitigation time</p>
</li>
<li><p>Autoscaling responsiveness is critical</p>
</li>
<li><p>API stability affects operational recovery</p>
</li>
</ul>
<p>Control plane latency becomes an operational risk during peak events.</p>
<hr />
<h2 id="heading-6-solution-evaluated">6. Solution Evaluated</h2>
<p>To eliminate the burst latency window, we evaluated Provisioned Control Plane in<br />Amazon Elastic Kubernetes Service.</p>
<p>Provisioned Control Plane allows selecting predefined control plane capacity tiers instead of relying entirely on reactive scaling.</p>
<p>This provides:</p>
<ul>
<li><p>Reserved API throughput</p>
</li>
<li><p>Predictable control plane performance</p>
</li>
<li><p>Reduced throttling during heavy rollouts</p>
</li>
<li><p>Improved stability under burst conditions</p>
</li>
</ul>
<p>Higher tiers provide greater sustained API capacity, with increased operational cost.</p>
<hr />
<h2 id="heading-7-action-taken">7. Action Taken</h2>
<p>We decided to validate the higher control plane tier in non-production first.</p>
<p>Steps performed:</p>
<ol>
<li><p>Upgraded the control plane tier.</p>
</li>
<li><p>Re-ran heavy rollout simulations.</p>
</li>
<li><p>Compared API latency and request drop metrics.</p>
</li>
<li><p>Evaluated stability improvement versus cost impact.</p>
</li>
</ol>
<p>Command used:</p>
<pre><code class="lang-bash">aws eks update-cluster-config \
  --name apps-eks \
  --control-plane-scaling-config tier=tier-xl
</code></pre>
<p>Verification:</p>
<pre><code class="lang-bash">aws eks describe-cluster --name apps-eks
</code></pre>
<p>Production rollout will be based on measured improvement and cost justification.</p>
<hr />
<h2 id="heading-8-key-learnings">8. Key Learnings</h2>
<ul>
<li><p>Large clusters expose control plane limits during parallel rollouts.</p>
</li>
<li><p>Reactive scaling introduces short latency windows under burst traffic.</p>
</li>
<li><p>Deployment scale directly influences API server performance.</p>
</li>
<li><p>Control plane capacity planning must be part of production architecture decisions.</p>
</li>
<li><p>Provisioned Control Plane is suitable for environments with frequent heavy updates or high operational demand.</p>
</li>
</ul>
<hr />
<h2 id="heading-final-outcome">Final Outcome</h2>
<p>The incident did not cause downtime or workload failure.</p>
<p>It identified a control plane burst capacity boundary during large-scale rollout testing.</p>
<p>By addressing it during readiness validation, we reduced operational risk before peak demand scenarios.</p>
]]></content:encoded></item><item><title><![CDATA[Production Change: Migrating a StatefulSet from Large to Smaller Nodes in EKS (Without Downtime)]]></title><description><![CDATA[We had a production application running on Amazon EKS as a StatefulSet.Each replica had its own PersistentVolumeClaim backed by Amazon EBS.
During the initial launch phase, we deployed the workload on]]></description><link>https://devopsofworld.com/production-change-migrating-a-statefulset-from-large-to-smaller-nodes-in-eks-without-downtime</link><guid isPermaLink="true">https://devopsofworld.com/production-change-migrating-a-statefulset-from-large-to-smaller-nodes-in-eks-without-downtime</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[EKS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[AWS]]></category><category><![CDATA[statefulsets]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 24 Feb 2026 03:16:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771313635166/4d1a5c59-8770-4c4b-846a-30af5e77be46.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We had a production application running on <strong>Amazon EKS</strong> as a StatefulSet.<br />Each replica had its own PersistentVolumeClaim backed by <strong>Amazon EBS</strong>.</p>
<p>During the initial launch phase, we deployed the workload on large instances to remove any performance uncertainty.</p>
<p>After a few weeks of monitoring, the data was clear:</p>
<ul>
<li><p>CPU utilization consistently below 35%</p>
</li>
<li><p>Memory below 40%</p>
</li>
<li><p>No disk pressure</p>
</li>
<li><p>Stable traffic and latency</p>
</li>
</ul>
<p>We were clearly over-provisioned.</p>
<p>The goal was straightforward:</p>
<p>Move the StatefulSet from a large node group to a smaller node group to reduce infrastructure cost — without downtime.</p>
<p>This was a live production system.</p>
<hr />
<h2>Cluster Setup</h2>
<p>We had two managed node groups:</p>
<ul>
<li><p><code>aws-devops-agent-eks-test-ng1</code> (large instances)</p>
</li>
<li><p><code>migration-ng</code> (smaller instances)</p>
</li>
</ul>
<p>Workload characteristics:</p>
<ul>
<li><p>StatefulSet with 3 replicas</p>
</li>
<li><p>Each pod had its own PVC (created via <code>volumeClaimTemplates</code>)</p>
</li>
<li><p>StorageClass backed by EBS</p>
</li>
<li><p>PodDisruptionBudget configured:</p>
</li>
</ul>
<pre><code class="language-bash">maxUnavailable: 1
</code></pre>
<p>Before touching the production workload, we tested the entire flow using a demo Nginx StatefulSet with PVCs. This allowed us to observe storage detach/attach behavior safely.</p>
<hr />
<h2>What Happens When You Change nodeSelector in a StatefulSet</h2>
<p>Changing the <code>nodeSelector</code> modifies the pod template inside the StatefulSet:</p>
<pre><code class="language-bash">spec:
  template:
    spec:
      nodeSelector:
</code></pre>
<p>Any change under <code>spec.template</code> updates the pod template hash.</p>
<p>That automatically triggers a rolling update.</p>
<p>No manual restart command is required.</p>
<p>For each pod, Kubernetes performs the following lifecycle:</p>
<ol>
<li><p>Terminate pod on the old node</p>
</li>
<li><p>Detach the EBS volume</p>
</li>
<li><p>Schedule pod on a new node</p>
</li>
<li><p>Attach the same volume</p>
</li>
<li><p>Mount filesystem</p>
</li>
<li><p>Start container</p>
</li>
<li><p>Wait for readiness probe</p>
</li>
</ol>
<p>Because this is a StatefulSet:</p>
<ul>
<li><p>Pod identity remains stable</p>
</li>
<li><p>PVC remains the same</p>
</li>
<li><p>The EBS volume remains in its original Availability Zone</p>
</li>
</ul>
<p>The primary risk during migration is not parallel restarts — StatefulSet prevents that by default.<br />The real concern is restart pacing and storage stability between transitions.</p>
<hr />
<h2>Why We Explicitly Kept OrderedReady</h2>
<p>We defined:</p>
<pre><code class="language-bash">podManagementPolicy: OrderedReady
</code></pre>
<p>StatefulSet supports two policies:</p>
<ul>
<li><p>OrderedReady</p>
</li>
<li><p>Parallel</p>
</li>
</ul>
<p>With OrderedReady:</p>
<ul>
<li><p>Pods are terminated in reverse ordinal order (pod-2 → pod-1 → pod-0)</p>
</li>
<li><p>The controller waits for a pod to become Ready before proceeding to the next one</p>
</li>
</ul>
<p>This guarantees serialized lifecycle transitions.</p>
<p>Only one pod is ever moving at a time.</p>
<p>For storage-backed workloads, predictability is more important than speed.</p>
<hr />
<h2>Why We Increased minReadySeconds</h2>
<p>Originally:</p>
<pre><code class="language-bash">minReadySeconds: 30
</code></pre>
<p>During migration, we increased it:</p>
<pre><code class="language-bash">minReadySeconds: 60
</code></pre>
<p>This does not delay traffic.</p>
<p>It delays rollout progression.</p>
<p>The behavior becomes:</p>
<ul>
<li><p>Pod becomes Ready</p>
</li>
<li><p>Controller waits 60 seconds</p>
</li>
<li><p>Then proceeds to the next pod</p>
</li>
</ul>
<p>That buffer provides:</p>
<ul>
<li><p>Storage stabilization time</p>
</li>
<li><p>Application warm-up window</p>
</li>
<li><p>A monitoring observation period before the next restart</p>
</li>
</ul>
<p>OrderedReady ensures serialization.<br />minReadySeconds ensures pacing.</p>
<p>Together, they create controlled transitions.</p>
<hr />
<h2>What PodDisruptionBudget Actually Protects</h2>
<p>The PodDisruptionBudget does not control rolling update speed.</p>
<p>It protects against voluntary disruptions such as:</p>
<ul>
<li><p>Node drain</p>
</li>
<li><p>Evictions</p>
</li>
<li><p>Autoscaler actions</p>
</li>
</ul>
<p>With:</p>
<pre><code class="language-bash">maxUnavailable: 1
</code></pre>
<p>We ensured that even outside the rollout logic, no more than one pod could be voluntarily disrupted at a time.</p>
<p>This preserved availability guarantees during infrastructure operations.</p>
<hr />
<h2>Availability Zone Validation</h2>
<p>EBS volumes are Availability Zone bound.</p>
<p>Because PVCs were already provisioned, each PersistentVolume existed in a specific AZ.</p>
<p>Before migration, we verified:</p>
<ul>
<li><p>The smaller node group spans the same AZs as the large node group</p>
</li>
<li><p>There is node capacity in those AZs</p>
</li>
</ul>
<p>If a pod’s volume resides in <code>ap-south-1a</code>, the new node must also be in <code>ap-south-1a</code>.<br />Otherwise, the pod remains Pending due to volume node affinity constraints.</p>
<p>This check is mandatory for StatefulSet migrations using EBS.</p>
<hr />
<h2>What We Changed</h2>
<p>Only two fields were modified.</p>
<p>Before:</p>
<pre><code class="language-bash">minReadySeconds: 30

nodeSelector:
  eks.amazonaws.com/nodegroup: aws-devops-agent-eks-test-ng1
</code></pre>
<p>After:</p>
<pre><code class="language-bash">minReadySeconds: 60

nodeSelector:
  eks.amazonaws.com/nodegroup: migration-ng
</code></pre>
<p>Everything else remained unchanged.</p>
<p>Small change. Controlled blast radius.</p>
<hr />
<h2>Execution</h2>
<p>Steps:</p>
<ol>
<li><p>Created the smaller node group (<code>migration-ng</code>)</p>
</li>
<li><p>Verified AZ alignment and resource headroom</p>
</li>
<li><p>Updated the StatefulSet manifest</p>
</li>
<li><p>Applied the updated configuration:</p>
</li>
</ol>
<pre><code class="language-bash">kubectl apply -f statefulset.yaml
</code></pre>
<p>Because the pod template changed, the StatefulSet controller automatically initiated a rolling update.</p>
<p>We first executed this full flow using the demo Nginx StatefulSet to validate behavior before applying it to the production workload.</p>
<hr />
<h2>Observed Behavior</h2>
<p>For each replica:</p>
<ul>
<li><p>Pod terminated on the large node</p>
</li>
<li><p>EBS detached</p>
</li>
<li><p>Pod scheduled onto a smaller node in the same AZ</p>
</li>
<li><p>Volume attached</p>
</li>
<li><p>Pod became Ready</p>
</li>
<li><p>Controller waited 60 seconds</p>
</li>
<li><p>Next pod restarted</p>
</li>
</ul>
<p>There were no overlapping transitions.</p>
<p>No downtime.<br />No error spike.<br />Stable latency throughout.</p>
<hr />
<h2>Final Result</h2>
<p>Before:</p>
<p>3 large instances</p>
<p>After:</p>
<p>3 smaller instances</p>
<p>Outcome:</p>
<ul>
<li><p>Reduced compute cost</p>
</li>
<li><p>Preserved availability</p>
</li>
<li><p>Maintained performance</p>
</li>
<li><p>No operational instability</p>
</li>
</ul>
<hr />
<h2>What This Actually Was</h2>
<p>This was not just resizing infrastructure.</p>
<p>It was a controlled lifecycle transition of a stateful workload.</p>
<p>StatefulSet migrations are safe when you:</p>
<ul>
<li><p>Respect controller behavior</p>
</li>
<li><p>Serialize restarts</p>
</li>
<li><p>Add rollout pacing</p>
</li>
<li><p>Validate storage topology</p>
</li>
<li><p>Test with a safe workload first</p>
</li>
</ul>
<p>That’s exactly what we did.</p>
<p>And the migration completed without downtime.</p>
]]></content:encoded></item><item><title><![CDATA[Zero-Downtime Migration from NGINX Ingress to Gateway API on Amazon EKS (Production Case Study)]]></title><description><![CDATA[A Zero-Downtime, Step-by-Step Implementation Guide

1. Overview
In this post, we walk through a real production migration of a Kubernetes workload from NGINX Ingress Controller to Kubernetes Gateway API, implemented using Envoy Gateway, on Amazon EKS...]]></description><link>https://devopsofworld.com/zero-downtime-nginx-ingress-to-gateway-api-eks</link><guid isPermaLink="true">https://devopsofworld.com/zero-downtime-nginx-ingress-to-gateway-api-eks</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Amazon EKS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Gateway API]]></category><category><![CDATA[nginx]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Thu, 19 Feb 2026 03:30:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771168253814/252fc831-0443-4517-8426-2ca451f70d5d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-a-zero-downtime-step-by-step-implementation-guide">A Zero-Downtime, Step-by-Step Implementation Guide</h3>
<hr />
<h2 id="heading-1-overview">1. Overview</h2>
<p>In this post, we walk through a <strong>real production migration</strong> of a Kubernetes workload from <strong>NGINX Ingress Controller</strong> to <strong>Kubernetes Gateway API</strong>, implemented using <strong>Envoy Gateway</strong>, on <strong>Amazon EKS</strong>.</p>
<p>The key objective was to:</p>
<ul>
<li><p>Migrate safely with <strong>zero downtime</strong></p>
</li>
<li><p>Avoid introducing unnecessary cloud-specific complexity</p>
</li>
<li><p>Align the platform with Kubernetes’ <strong>future networking direction</strong></p>
</li>
</ul>
<p>This guide is written from a <strong>platform ownership perspective</strong>, not a lab or demo setup.</p>
<hr />
<h2 id="heading-2-problem-statement">2. Problem Statement</h2>
<p>The application was already running in production and exposed using <strong>NGINX Ingress Controller</strong>.</p>
<p>While the setup was stable, the following risks were identified:</p>
<ul>
<li><p>The NGINX Ingress Controller project has moved toward reduced long-term maintenance focus, increasing uncertainty around future support guarantees.</p>
</li>
<li><p>No long-term guarantees for:</p>
<ul>
<li><p>Security patches</p>
</li>
<li><p>CVE fixes</p>
</li>
<li><p>Compatibility with future Kubernetes versions</p>
</li>
</ul>
</li>
<li><p>Ingress sits at the <strong>cluster edge</strong>, making it a high-blast-radius component</p>
</li>
</ul>
<p>Although there was <strong>no immediate outage</strong>, continuing with an edge component under reduced maintenance posed long-term operational and security risks.</p>
<hr />
<h2 id="heading-3-existing-production-architecture-before-migration">3. Existing Production Architecture (Before Migration)</h2>
<pre><code class="lang-bash">User
  ↓
AWS LoadBalancer (auto-created by Service)
  ↓
NGINX Ingress Controller
  ↓
Application Service (ClusterIP)
  ↓
Application Pods
</code></pre>
<h3 id="heading-characteristics-of-the-existing-setup">Characteristics of the existing setup</h3>
<ul>
<li><p>Stable and functional</p>
</li>
<li><p>Easy to operate</p>
</li>
<li><p>Tightly coupled to controller-specific annotations</p>
</li>
<li><p>Limited separation between platform and application ownership</p>
</li>
</ul>
<hr />
<h2 id="heading-4-why-gateway-api">4. Why Gateway API?</h2>
<p>Kubernetes <strong>Gateway API</strong> is positioned as the <strong>successor to Ingress</strong>, designed to solve long-standing limitations.</p>
<h3 id="heading-key-improvements-over-ingress">Key improvements over Ingress</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Ingress</td><td>Gateway API</td></tr>
</thead>
<tbody>
<tr>
<td>Single resource</td><td>Role-oriented resources</td></tr>
<tr>
<td>Annotation-driven</td><td>Spec-defined configuration</td></tr>
<tr>
<td>Weak ownership boundaries</td><td>Clear infra vs app separation</td></tr>
<tr>
<td>Controller-specific behavior</td><td>Standardized API</td></tr>
</tbody>
</table>
</div><p>Gateway API introduces:</p>
<ul>
<li><p><strong>GatewayClass</strong> – defines platform capability</p>
</li>
<li><p><strong>Gateway</strong> – infrastructure-level entry point</p>
</li>
<li><p><strong>HTTPRoute</strong> – application-level routing rules</p>
</li>
</ul>
<p>This model is more scalable, auditable, and production-safe.</p>
<hr />
<h2 id="heading-5-why-envoy-gateway-in-this-case">5. Why Envoy Gateway in This Case?</h2>
<p>The cluster did <strong>not</strong> have AWS Load Balancer Controller installed.</p>
<p>Installing it mid-migration would have required:</p>
<ul>
<li><p>IAM and IRSA setup</p>
</li>
<li><p>Additional operational complexity</p>
</li>
<li><p>Increased blast radius during a live migration</p>
</li>
</ul>
<p>Instead, we chose <strong>Envoy Gateway</strong>, because it:</p>
<ul>
<li><p>Is a first-class Gateway API implementation</p>
</li>
<li><p>Does not depend on AWS-specific controllers</p>
</li>
<li><p>Creates and manages its own dataplane</p>
</li>
<li><p>Is vendor-neutral and portable</p>
</li>
<li><p>Allows parallel validation with minimal risk</p>
</li>
</ul>
<p>This decision was <strong>intentional</strong>, not a workaround.</p>
<p>I intentionally avoided introducing AWS Load Balancer Controller during migration to prevent IAM, IRSA, and cloud-controller changes from increasing the migration blast radius. The goal was to change one edge component at a time.</p>
<hr />
<h2 id="heading-6-migration-strategy-zero-downtime">6. Migration Strategy (Zero Downtime)</h2>
<p>A direct replacement was <strong>not acceptable</strong>.</p>
<h3 id="heading-chosen-strategy">Chosen strategy</h3>
<pre><code class="lang-bash">NGINX Ingress LoadBalancer  → continues serving production traffic
Envoy Gateway LoadBalancer → used <span class="hljs-keyword">for</span> validation
</code></pre>
<p>Traffic was cut over only after successful validation was completed.</p>
<p>The existing Ingress resource was left untouched to prevent configuration drift and unintended side effects during migration.</p>
<p>This ensured:</p>
<ul>
<li><p>No user impact</p>
</li>
<li><p>Easy rollback</p>
</li>
<li><p>Controlled blast radius</p>
</li>
</ul>
<hr />
<h2 id="heading-7-step-by-step-implementation">7. Step-by-Step Implementation</h2>
<h3 id="heading-step-1-application-deployment-already-in-place">Step 1: Application Deployment (Already in Place)</h3>
<p>The application was deployed with:</p>
<ul>
<li><p>Kubernetes <code>Deployment</code></p>
</li>
<li><p><code>Service</code> of type <code>ClusterIP</code></p>
</li>
</ul>
<p>No changes were required at the application level.</p>
<hr />
<h3 id="heading-step-2-nginx-ingress-existing-production-entry">Step 2: NGINX Ingress (Existing Production Entry)</h3>
<p>NGINX Ingress Controller was already installed and exposed the application via an AWS LoadBalancer.</p>
<p>This remained untouched during the migration.</p>
<hr />
<h3 id="heading-step-3-install-gateway-api-crds">Step 3: Install Gateway API CRDs</h3>
<p>Gateway API resources must exist before any controller can operate.</p>
<pre><code class="lang-bash">kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml
</code></pre>
<hr />
<h3 id="heading-step-4-install-envoy-gateway">Step 4: Install Envoy Gateway</h3>
<p>Envoy Gateway was installed using Helm via OCI registry.</p>
<pre><code class="lang-bash">helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.7.0 \
  -n envoy-gateway-system \
  --create-namespace
</code></pre>
<blockquote>
<p>The Envoy Gateway version was explicitly pinned to v1.7.0 after verifying compatibility with Gateway API v1.0.0 and the EKS cluster version.<br />Version pinning ensures deterministic deployments, reproducibility, and safe rollback capability in production environments.</p>
</blockquote>
<hr />
<h3 id="heading-step-5-create-gatewayclass-platform-ownership">Step 5: Create GatewayClass (Platform Ownership)</h3>
<pre><code class="lang-bash">apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
</code></pre>
<p>This explicitly defined <strong>Envoy Gateway</strong> as the cluster’s Gateway API implementation.</p>
<hr />
<h3 id="heading-step-6-create-gateway-infrastructure-entry-point">Step 6: Create Gateway (Infrastructure Entry Point)</h3>
<pre><code class="lang-bash">apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: app-gateway
  namespace: default
spec:
  gatewayClassName: envoy
  listeners:
  - protocol: HTTP
    port: 80
</code></pre>
<p>This created a <strong>new AWS LoadBalancer</strong>, separate from the existing NGINX Ingress LB.</p>
<hr />
<h3 id="heading-step-7-create-httproute-application-routing">Step 7: Create HTTPRoute (Application Routing)</h3>
<pre><code class="lang-bash">apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app
  namespace: default
spec:
  parentRefs:
  - name: app-gateway
  rules:
  - matches:
    - path:
        <span class="hljs-built_in">type</span>: PathPrefix
        value: /
    backendRefs:
    - name: app
      port: 8088
</code></pre>
<p>This replaced the Ingress routing logic using Gateway API primitives.</p>
<hr />
<h2 id="heading-8-validation">8. Validation</h2>
<p>At this stage:</p>
<p>NGINX LB → Production users<br />Gateway LB → Validation traffic</p>
<p>Validation was performed at multiple levels:</p>
<h3 id="heading-application-layer">Application Layer</h3>
<ul>
<li><p>Verified HTTP 200 responses using curl</p>
</li>
<li><p>Tested authentication flows</p>
</li>
<li><p>Executed critical user workflows</p>
</li>
<li><p>Confirmed session persistence behavior</p>
</li>
</ul>
<h3 id="heading-infrastructure-layer">Infrastructure Layer</h3>
<ul>
<li><p>Checked LoadBalancer health check status</p>
</li>
<li><p>Verified readiness and liveness probes</p>
</li>
<li><p>Monitored pod logs for errors or unexpected restarts</p>
</li>
<li><p>Confirmed correct backend service port mapping</p>
</li>
<li><p>Reviewed Envoy Gateway metrics and controller logs to ensure no reconciliation errors or route attachment failures were present.</p>
</li>
</ul>
<h3 id="heading-traffic-amp-stability">Traffic &amp; Stability</h3>
<ul>
<li><p>Compared response latency between both entry points</p>
</li>
<li><p>Monitored 4xx and 5xx error rates</p>
</li>
<li><p>Verified no increase in backend CPU or memory usage</p>
</li>
</ul>
<p>Only after all validation checkpoints passed was production cutover approved.</p>
<h2 id="heading-9-cost-considerations-during-migration"><strong>9. Cost Considerations During Migration</strong></h2>
<p>Running NGINX Ingress and Envoy Gateway in parallel resulted in two active AWS LoadBalancers during the validation window, temporarily increasing infrastructure cost.</p>
<p>However:</p>
<ul>
<li><p>The overlap period was intentionally short.</p>
</li>
<li><p>The additional cost was justified to eliminate downtime risk.</p>
</li>
<li><p>The parallel approach reduced blast radius during migration.</p>
</li>
</ul>
<p>Cost was intentionally traded for reliability and controlled risk.</p>
<h2 id="heading-10-cutover-and-cleanup">10. Cutover and Cleanup</h2>
<p>After all validation checks passed:</p>
<pre><code class="lang-bash">kubectl delete ingress app-ingress
</code></pre>
<blockquote>
<p>Traffic shift was verified immediately after deletion by validating active connections on the Gateway LoadBalancer and confirming healthy backend responses.</p>
</blockquote>
<p>The legacy NGINX Ingress was removed only after confirming stable traffic flow through the Gateway LoadBalancer.</p>
<p>Rollback plan:</p>
<ul>
<li><p>Re-apply the Ingress resource if needed</p>
</li>
<li><p>Restore DNS if traffic switch involved domain update</p>
</li>
</ul>
<p>The migration was reversible during the validation window.</p>
<p>Optionally, after a stability window:</p>
<pre><code class="lang-bash">helm uninstall ingress-nginx -n ingress-nginx
</code></pre>
<p>The Gateway API entry point became the sole production path.</p>
<hr />
<h2 id="heading-11-final-architecture-after-migration">11. Final Architecture (After Migration)</h2>
<pre><code class="lang-bash">User
  ↓
AWS LoadBalancer
  ↓
Envoy Gateway (Gateway API)
  ↓
Application Service
  ↓
Application Pods
</code></pre>
<hr />
<h2 id="heading-12-key-learnings">12. Key Learnings</h2>
<ol>
<li><p><strong>Gateway without HTTPRoute does nothing</strong> — infrastructure and routing are intentionally separated</p>
</li>
<li><p>Gateway API enforces clearer ownership boundaries than Ingress</p>
</li>
<li><p>Parallel migration is the safest approach for production workloads</p>
</li>
<li><p>Envoy Gateway is an effective bridge when cloud-native controllers are not yet in place</p>
</li>
</ol>
<hr />
<h2 id="heading-13-when-would-aws-load-balancer-controller-be-used">13. When Would AWS Load Balancer Controller Be Used?</h2>
<p>In a later phase, once the platform is stable on the Gateway API.</p>
<p>Typical evolution:</p>
<pre><code class="lang-bash">NGINX Ingress
→ Envoy Gateway (Gateway API adoption)
→ AWS Load Balancer Controller (cloud-native optimization)
</code></pre>
<hr />
<h2 id="heading-14-failure-scenarios-considered">14. Failure Scenarios Considered</h2>
<p>The following risks were evaluated before migration:</p>
<ul>
<li><p>Gateway created without HTTPRoute (no traffic routing)</p>
</li>
<li><p>Incorrect backend service port reference</p>
</li>
<li><p>Namespace mismatch between Gateway and HTTPRoute</p>
</li>
<li><p>LoadBalancer health check failures</p>
</li>
<li><p>Controller crash or misconfiguration</p>
</li>
<li><p>Gateway API CRD and controller version mismatch</p>
</li>
<li><p>DNS TTL delays during traffic switch</p>
</li>
</ul>
<p>By running both entry points in parallel, these risks were isolated and mitigated.</p>
<h2 id="heading-15-final-takeaway">15. Final Takeaway</h2>
<p>I designed and executed a zero-downtime migration from NGINX Ingress to Gateway API by running both entry points in parallel.</p>
<p>I validated routing behavior, health checks, infrastructure readiness, and traffic stability before shifting production traffic.</p>
<p>This approach reduced blast radius, preserved service availability, and aligned the platform with Kubernetes’ evolving networking model.</p>
]]></content:encoded></item><item><title><![CDATA[Production Incident: Node.js Application Did Not Start After Server Reboot (PM2 + systemd Fix)]]></title><description><![CDATA[Context
We were running a Node.js backend using PM2 on a Linux server.
Application details:

Process manager: PM2

Mode: fork

User: root

Deployment: Manual setup on VM

No containerization

No autoscaling


The service was running fine in steady st...]]></description><link>https://devopsofworld.com/production-incident-nodejs-application-did-not-start-after-server-reboot-pm2-systemd-fix</link><guid isPermaLink="true">https://devopsofworld.com/production-incident-nodejs-application-did-not-start-after-server-reboot-pm2-systemd-fix</guid><category><![CDATA[production-incident]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Node.js]]></category><category><![CDATA[pm2]]></category><category><![CDATA[Linux]]></category><category><![CDATA[systemd]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Wed, 18 Feb 2026 03:30:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770961439314/2f628061-0a40-4d15-828f-fb1a3ec5a601.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-context">Context</h2>
<p>We were running a Node.js backend using PM2 on a Linux server.</p>
<p>Application details:</p>
<ul>
<li><p>Process manager: PM2</p>
</li>
<li><p>Mode: fork</p>
</li>
<li><p>User: root</p>
</li>
<li><p>Deployment: Manual setup on VM</p>
</li>
<li><p>No containerization</p>
</li>
<li><p>No autoscaling</p>
</li>
</ul>
<p>The service was running fine in steady state.</p>
<hr />
<h2 id="heading-incident-summary"><strong>Incident Summary</strong></h2>
<p>• Trigger: Server reboot during OS patching</p>
<p>• Impact: Application unavailable for 18 minutes</p>
<p>• Root Cause: PM2 not registered with systemd</p>
<p>• Resolution: Integrated PM2 with systemd and enabled process resurrection</p>
<h2 id="heading-incident-timeline">Incident Timeline</h2>
<p>The server was restarted as part of routine OS patching.</p>
<p>After the reboot:</p>
<ul>
<li><p>The server came up successfully</p>
</li>
<li><p>SSH access was normal</p>
</li>
<li><p>But the backend application was down</p>
</li>
<li><p>API health checks failed</p>
</li>
<li><p>External traffic started returning errors</p>
</li>
</ul>
<p>Running:</p>
<pre><code class="lang-bash">pm2 ls
</code></pre>
<p>It returned no running processes because the PM2 daemon was not started after reboot.</p>
<hr />
<h2 id="heading-impact">Impact</h2>
<ul>
<li><p>Application downtime until manual intervention</p>
</li>
<li><p>No auto-recovery mechanism</p>
</li>
<li><p>Increased MTTR</p>
</li>
<li><p>Hidden operational risk exposed</p>
</li>
</ul>
<p>This exposed a design gap.</p>
<hr />
<h2 id="heading-detection"><strong>Detection</strong></h2>
<p>The issue was detected via failed API health checks after reboot.</p>
<p>There was no alert configured to monitor the PM2 daemon state.</p>
<p>Downtime lasted approximately 18 minutes until manual intervention restored the service.</p>
<h2 id="heading-root-cause-analysis">Root Cause Analysis</h2>
<p>PM2 had previously been started manually in an interactive shell session.</p>
<p>It was never registered with systemd.</p>
<p>As a result, it did not start automatically after reboot.</p>
<p>There was:</p>
<ul>
<li><p>No systemd integration</p>
</li>
<li><p>No startup registration</p>
</li>
<li><p>No process resurrection configuration</p>
</li>
</ul>
<p>On reboot:</p>
<pre><code class="lang-bash">System Boot
  ↓
No PM2 daemon started
  ↓
No Node process started
  ↓
Application Down
</code></pre>
<p>This was not a runtime failure.</p>
<p>This was a lifecycle management design failure.</p>
<hr />
<h2 id="heading-design-correction">Design Correction</h2>
<p>I redesigned the startup flow to align with production expectations.</p>
<h3 id="heading-goal">Goal</h3>
<p>Ensure that:</p>
<ul>
<li><p>PM2 daemon starts automatically on boot</p>
</li>
<li><p>Saved processes are restored</p>
</li>
<li><p>No manual intervention required</p>
</li>
</ul>
<hr />
<h2 id="heading-implementation">Implementation</h2>
<h3 id="heading-step-1-register-pm2-with-systemd">Step 1: Register PM2 with systemd</h3>
<pre><code class="lang-bash">pm2 startup
</code></pre>
<p>This generated a systemd unit file:</p>
<pre><code class="lang-bash">/etc/systemd/system/pm2-root.service
</code></pre>
<p>And enabled it:</p>
<pre><code class="lang-bash">systemctl <span class="hljs-built_in">enable</span> pm2-root
</code></pre>
<hr />
<h3 id="heading-step-2-freeze-running-processes">Step 2: Freeze Running Processes</h3>
<pre><code class="lang-bash">pm2 save
</code></pre>
<p>This created:</p>
<pre><code class="lang-bash">/root/.pm2/dump.pm2
</code></pre>
<p>Without this file, resurrection would not occur.</p>
<hr />
<h3 id="heading-step-3-ensure-systemd-controls-pm2">Step 3: Ensure systemd Controls PM2</h3>
<p>Initially:</p>
<pre><code class="lang-bash">systemctl status pm2-root
</code></pre>
<p>Showed:</p>
<pre><code class="lang-bash">inactive (dead)
</code></pre>
<p>Meaning PM2 was still running from shell, not systemd.</p>
<p>Corrected by:</p>
<pre><code class="lang-bash">pm2 <span class="hljs-built_in">kill</span>
systemctl start pm2-root
</code></pre>
<p>Now:</p>
<pre><code class="lang-bash">Active: active (running)
</code></pre>
<hr />
<h2 id="heading-final-boot-flow-production-aligned">Final Boot Flow (Production Aligned)</h2>
<pre><code class="lang-bash">System Boot
   ↓
systemd
   ↓
pm2-root.service
   ↓
pm2 resurrect
   ↓
Node Application Starts
</code></pre>
<hr />
<h2 id="heading-validation">Validation</h2>
<p>Server reboot was performed.</p>
<p>Post-reboot validation:</p>
<pre><code class="lang-bash">pm2 ls
systemctl status pm2-root
</code></pre>
<p>Result:</p>
<ul>
<li><p>Application automatically started</p>
</li>
<li><p>No manual intervention required</p>
</li>
<li><p>The application started automatically without manual intervention, reducing MTTR for reboot-related events to near zero.</p>
</li>
</ul>
<hr />
<h2 id="heading-rollback-strategy"><strong>Rollback Strategy</strong></h2>
<p>If the systemd integration failed:</p>
<p>• Disable pm2-root service</p>
<p>• Manually start PM2 using pm2 start</p>
<p>• Validate application health endpoint</p>
<p>• Restore previous working state</p>
<p>This ensured there was a recovery path during configuration changes.</p>
<h2 id="heading-preventive-measures"><strong>Preventive Measures</strong></h2>
<p>• Standardized server bootstrap process to register PM2 with systemd</p>
<p>• Added reboot validation checklist after OS patching</p>
<p>• Integrated service state checks into monitoring alerts</p>
<p>• Planned migration to a dedicated service user</p>
<p>• Documented lifecycle management requirements</p>
<h2 id="heading-risks-identified">Risks Identified</h2>
<h3 id="heading-1-running-as-root">1. Running as Root</h3>
<p>PM2 was configured under root.</p>
<p>Risk:</p>
<ul>
<li><p>Larger blast radius in case of compromise</p>
</li>
<li><p>Principle of least privilege was violated.</p>
</li>
</ul>
<p>Future improvement:</p>
<ul>
<li>Dedicated service user</li>
</ul>
<hr />
<h3 id="heading-2-using-nvm-for-node">2. Using NVM for Node</h3>
<p>Systemd environment path contained:</p>
<pre><code class="lang-bash">/root/.nvm/versions/node/...
</code></pre>
<p>Risk:</p>
<ul>
<li><p>Node version changes may break startup</p>
</li>
<li><p>NVM is not ideal for production servers</p>
</li>
</ul>
<p>Better design:</p>
<ul>
<li><p>Install Node globally</p>
</li>
<li><p>Lock version</p>
</li>
</ul>
<hr />
<h2 id="heading-production-takeaway">Production Takeaway</h2>
<p>In production systems, every long-running process must be supervised by the system init layer.</p>
<p>Running does not imply lifecycle management.</p>
<p>If the init system does not supervise your process, you do not have a resilient system.</p>
<p>The failure was not due to Node. Not due to PM2. Not due to application code.</p>
<p>It was a lifecycle management design gap.</p>
]]></content:encoded></item><item><title><![CDATA[Deploying Apache Superset on Kubernetes (Helm): From Chaos to Production]]></title><description><![CDATA[Introduction
Deploying Apache Superset on Kubernetes using the official Helm chart appears straightforward when following the documentation. In real-world environments, however, production deployments often expose issues across multiple layers — Helm...]]></description><link>https://devopsofworld.com/deploying-apache-superset-on-kubernetes-helm-from-chaos-to-production</link><guid isPermaLink="true">https://devopsofworld.com/deploying-apache-superset-on-kubernetes-helm-from-chaos-to-production</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Apache Superset]]></category><category><![CDATA[Helm]]></category><category><![CDATA[Devops]]></category><category><![CDATA[AWS]]></category><category><![CDATA[PostgreSQL]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 17 Feb 2026 03:30:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771133577563/be806fd0-daa2-4d66-bb9b-52fe7e00612a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Deploying Apache Superset on Kubernetes using the official Helm chart appears straightforward when following the documentation. In real-world environments, however, production deployments often expose issues across multiple layers — Helm dependency resolution, container image integrity, Python runtime behavior, database connectivity, and secret management.</p>
<p>This article walks through a real-world failure analysis, explains the root causes, and documents the production-ready deployment that supports:</p>
<ul>
<li><p>In-cluster PostgreSQL &amp; Redis</p>
</li>
<li><p>External PostgreSQL (e.g., AWS RDS) &amp; External Redis</p>
</li>
<li><p>Optional Kubernetes Secret–based credential injection</p>
</li>
</ul>
<p>The final architecture is flexible, secure, and restart-safe.</p>
<hr />
<h2 id="heading-1-problem-statement">1. Problem Statement</h2>
<p>We attempted to deploy Apache Superset on Kubernetes using the official Helm chart.</p>
<h3 id="heading-target-setup">Target Setup</h3>
<ul>
<li><p>Apache Superset (Web + Celery Worker)</p>
</li>
<li><p>PostgreSQL (metadata database)</p>
</li>
<li><p>Redis (Celery broker and caching)</p>
</li>
<li><p>Kubernetes</p>
</li>
<li><p>Helm-based deployment</p>
</li>
<li><p>Custom Superset image</p>
</li>
<li><p>Optional external PostgreSQL (AWS RDS)</p>
</li>
<li><p>Optional external Redis</p>
</li>
</ul>
<h3 id="heading-expected-outcome">Expected Outcome</h3>
<ul>
<li><p>Superset UI accessible</p>
</li>
<li><p>Database migrations completed successfully</p>
</li>
<li><p>Celery workers start without errors</p>
</li>
<li><p>Stable across restarts</p>
</li>
<li><p>Secure credential handling</p>
</li>
</ul>
<h3 id="heading-what-actually-happened">What Actually Happened</h3>
<p>The deployment failed at multiple stages:</p>
<ul>
<li><p>Dependency image pull failures</p>
</li>
<li><p>Python module errors inside the container</p>
</li>
<li><p>Runtime package installation failures</p>
</li>
<li><p>SECRET_KEY validation error</p>
</li>
<li><p>Database connectivity issues</p>
</li>
</ul>
<p>This was a multi-layer failure — not a single misconfiguration.</p>
<hr />
<h2 id="heading-2-issue-1-postgresql-and-redis-images-not-found">2. Issue #1: PostgreSQL and Redis Images Not Found</h2>
<h3 id="heading-observed-error">Observed Error</h3>
<pre><code class="lang-bash">ImagePullBackOff
Failed to pull image
not found
</code></pre>
<p>Both PostgreSQL and Redis pods failed to start.</p>
<h3 id="heading-root-cause">Root Cause</h3>
<p>The Helm chart referenced specific image tags that were no longer available in the container registry.</p>
<p>Helm does not validate tag existence.<br />Kubernetes only detects the failure during image pull.</p>
<p>Until dependencies are healthy:</p>
<ul>
<li><p>Superset init job cannot complete</p>
</li>
<li><p>Application errors remain hidden</p>
</li>
<li><p>Debugging becomes misleading</p>
</li>
</ul>
<p>Infrastructure must be stable before diagnosing application issues.</p>
<hr />
<h2 id="heading-3-fix-1-diagnostic-use-of-latest">3. Fix #1: Diagnostic Use of <code>latest</code></h2>
<h3 id="heading-to-confirm-whether-the-issue-was-caused-by-deprecated-image-tags-or-application-logic-dependency-images-were-temporarily-switched-to-latest">To confirm whether the issue was caused by deprecated image tags or application logic, dependency images were temporarily switched to <code>latest</code>.</h3>
<pre><code class="lang-bash">postgresql:
  image:
    tag: latest

redis:
  image:
    tag: latest
</code></pre>
<p>This confirmed:</p>
<ul>
<li><p>The Helm chart’s default tags were deprecated.</p>
</li>
<li><p>The infrastructure was blocking deployment.</p>
</li>
<li><p>Superset itself was not the initial issue.</p>
</li>
</ul>
<p>⚠ The <code>latest</code> tag was used only for diagnostics.<br />In production environments, pinned image versions are recommended for deterministic deployments.</p>
<p>Once dependencies were running, the real application error surfaced.</p>
<hr />
<h2 id="heading-4-issue-2-psycopg2-module-missing">4. Issue #2: psycopg2 Module Missing</h2>
<p>Superset failed with:</p>
<pre><code class="lang-bash">ModuleNotFoundError: No module named <span class="hljs-string">'psycopg2'</span>
</code></pre>
<p>This affected:</p>
<ul>
<li><p>Superset Web pod</p>
</li>
<li><p>Superset Worker pod</p>
</li>
<li><p>Superset Init DB job</p>
</li>
</ul>
<hr />
<h2 id="heading-5-why-this-breaks-superset">5. Why This Breaks Superset</h2>
<p>Superset requires a metadata database.</p>
<p>Dependency chain:</p>
<pre><code class="lang-bash">Superset → SQLAlchemy → psycopg2 → PostgreSQL
</code></pre>
<p>If psycopg2 is missing:</p>
<ul>
<li><p>Superset cannot start</p>
</li>
<li><p>Database migrations fail</p>
</li>
<li><p>Celery workers fail</p>
</li>
<li><p>No fallback mode exists</p>
</li>
</ul>
<hr />
<h2 id="heading-6-why-runtime-installation-failed">6. Why Runtime Installation Failed</h2>
<p>Attempts included:</p>
<ul>
<li><p><code>extraPipPackages</code></p>
</li>
<li><p><code>bootstrapScript</code></p>
</li>
<li><p>Installing packages inside running pods</p>
</li>
<li><p>Init container installation</p>
</li>
</ul>
<p>All failed.</p>
<h3 id="heading-root-cause-1">Root Cause</h3>
<p>The official Superset image runs inside a prebuilt Python virtual environment:</p>
<pre><code class="lang-bash">/app/.venv/
</code></pre>
<p>Key details:</p>
<ul>
<li><p>Superset executes strictly inside this environment.</p>
</li>
<li><p>Runtime installations either failed.</p>
</li>
<li><p>Or installed packages outside the active environment.</p>
</li>
<li><p>Container immutability was violated.</p>
</li>
</ul>
<p>Even when psycopg2 appeared installed, it was outside Superset’s active virtual environment — making it effectively unusable.</p>
<hr />
<h2 id="heading-7-correct-fix-build-a-custom-immutable-superset-image">7. Correct Fix: Build a Custom Immutable Superset Image</h2>
<p>Database drivers must be installed at image build time.</p>
<h3 id="heading-dockerfile-used">Dockerfile Used</h3>
<pre><code class="lang-bash">FROM apachesuperset.docker.scarf.sh/apache/superset:3.0.0

USER root

RUN apt-get update &amp;&amp; apt-get install -y libpq-dev gcc \
 &amp;&amp; /app/.venv/bin/python -m ensurepip --upgrade \
 &amp;&amp; /app/.venv/bin/python -m pip install --no-cache-dir psycopg2==2.9.9

USER superset
</code></pre>
<h3 id="heading-why-this-works">Why This Works</h3>
<ul>
<li><p>Installs psycopg2 inside Superset’s active virtual environment</p>
</li>
<li><p>Immutable and reproducible</p>
</li>
<li><p>Restart-safe</p>
</li>
<li><p>Production aligned</p>
</li>
</ul>
<hr />
<h3 id="heading-8-flexible-credential-management">8. Flexible Credential Management</h3>
<p>Superset supports multiple ways to provide database and Redis credentials.</p>
<h3 id="heading-option-a-directly-in-helm-values-testing-only">Option A – Directly in Helm Values (Testing Only)</h3>
<pre><code class="lang-bash">supersetNode:
  connections:
    db_type: postgresql
    db_host: my-db-endpoint
    db_port: <span class="hljs-string">"5432"</span>
    db_user: superset
    db_pass: superset123
    db_name: superset
</code></pre>
<p>Suitable for:</p>
<ul>
<li><p>Local testing</p>
</li>
<li><p>Temporary debugging</p>
</li>
<li><p>Learning environments</p>
</li>
</ul>
<p>⚠ Credentials stored in plaintext.</p>
<hr />
<h3 id="heading-option-b-kubernetes-secret-injection-recommended">Option B – Kubernetes Secret Injection (Recommended)</h3>
<p>Instead of storing credentials in Helm values, they can be injected securely.</p>
<h3 id="heading-create-secret">Create Secret</h3>
<pre><code class="lang-bash">kubectl create secret generic superset-backend-secret \
  --from-literal=DB_HOST=&lt;db-endpoint&gt; \
  --from-literal=DB_PORT=5432 \
  --from-literal=DB_USER=&lt;db-user&gt; \
  --from-literal=DB_PASSWORD=&lt;db-password&gt; \
  --from-literal=DB_NAME=&lt;db-name&gt; \
  --from-literal=REDIS_HOST=&lt;redis-endpoint&gt; \
  --from-literal=REDIS_PORT=6379
</code></pre>
<h3 id="heading-reference-secret-in-helm-values">Reference Secret in Helm Values</h3>
<pre><code class="lang-bash">envFromSecrets:
  - superset-backend-secret
</code></pre>
<p>Superset connections then use environment variables:</p>
<pre><code class="lang-bash">supersetNode:
  connections:
    db_type: postgresql
    db_host: <span class="hljs-string">"<span class="hljs-subst">$(DB_HOST)</span>"</span>
    db_port: <span class="hljs-string">"<span class="hljs-subst">$(DB_PORT)</span>"</span>
    db_user: <span class="hljs-string">"<span class="hljs-subst">$(DB_USER)</span>"</span>
    db_pass: <span class="hljs-string">"<span class="hljs-subst">$(DB_PASSWORD)</span>"</span>
    db_name: <span class="hljs-string">"<span class="hljs-subst">$(DB_NAME)</span>"</span>
    redis_host: <span class="hljs-string">"<span class="hljs-subst">$(REDIS_HOST)</span>"</span>
    redis_port: <span class="hljs-string">"<span class="hljs-subst">$(REDIS_PORT)</span>"</span>
</code></pre>
<p>Benefits:</p>
<ul>
<li><p>No plaintext credentials in Git</p>
</li>
<li><p>Secure runtime injection</p>
</li>
<li><p>Easier rotation</p>
</li>
<li><p>Environment portability</p>
</li>
</ul>
<p>Using Kubernetes Secrets is optional but strongly recommended for production.</p>
<hr />
<h2 id="heading-9-database-amp-redis-architecture-options">9. Database &amp; Redis Architecture Options</h2>
<p>Superset supports two architectural modes.</p>
<hr />
<h2 id="heading-option-1-in-cluster-postgresql-amp-redis">Option 1 – In-Cluster PostgreSQL &amp; Redis</h2>
<p>Enable Helm-managed dependencies:</p>
<pre><code class="lang-bash">postgresql:
  enabled: <span class="hljs-literal">true</span>

redis:
  enabled: <span class="hljs-literal">true</span>
</code></pre>
<p>Best for:</p>
<ul>
<li><p>Development</p>
</li>
<li><p>Testing</p>
</li>
<li><p>Small internal tools</p>
</li>
</ul>
<p>Pros:</p>
<ul>
<li><p>Simple</p>
</li>
<li><p>Self-contained</p>
</li>
</ul>
<p>Cons:</p>
<ul>
<li><p>You manage backups</p>
</li>
<li><p>You manage scaling</p>
</li>
<li><p>Higher operational overhead</p>
</li>
</ul>
<hr />
<h2 id="heading-option-2-external-postgresql-amp-redis-optional">Option 2 – External PostgreSQL &amp; Redis (Optional)</h2>
<p>Disable internal services:</p>
<pre><code class="lang-bash">postgresql:
  enabled: <span class="hljs-literal">false</span>

redis:
  enabled: <span class="hljs-literal">false</span>
</code></pre>
<p>Best for:</p>
<ul>
<li><p>Production</p>
</li>
<li><p>High availability needs</p>
</li>
<li><p>Managed backups</p>
</li>
<li><p>Reduced operational risk</p>
</li>
</ul>
<p>Pros:</p>
<ul>
<li><p>Managed durability</p>
</li>
<li><p>Better reliability</p>
</li>
<li><p>Clear stateless/stateful separation</p>
</li>
</ul>
<p>External services are optional — the deployment remains flexible.</p>
<blockquote>
<p>The final production architecture is designed to support both Helm-managed in-cluster stateful services and externally managed database/cache services (such as AWS RDS and ElastiCache), ensuring operational flexibility and scalability across environments.</p>
</blockquote>
<hr />
<h2 id="heading-10-enforcing-ssl-for-database-connections">10. Enforcing SSL for Database Connections</h2>
<pre><code class="lang-bash">import os

SQLALCHEMY_DATABASE_URI = (
    f<span class="hljs-string">"postgresql+psycopg2://{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"</span>
    f<span class="hljs-string">"@{os.environ['DB_HOST']}:{os.environ['DB_PORT']}/{os.environ['DB_NAME']}"</span>
    <span class="hljs-string">"?sslmode=require"</span>
)
</code></pre>
<p>Ensures encrypted communication with PostgreSQL.</p>
<hr />
<h2 id="heading-11-startup-readiness-handling">11. Startup Readiness Handling</h2>
<h3 id="heading-init-containers-wait-for-db-and-redis">Init containers wait for DB and Redis:</h3>
<pre><code class="lang-bash"><span class="hljs-built_in">command</span>:
  - dockerize
  - -<span class="hljs-built_in">wait</span>
  - tcp://$(DB_HOST):$(DB_PORT)
  - -<span class="hljs-built_in">wait</span>
  - tcp://$(REDIS_HOST):$(REDIS_PORT)
  - -timeout
  - 120s
</code></pre>
<p>Prevents:</p>
<ul>
<li><p>CrashLoopBackOff</p>
</li>
<li><p>Early DB connection failures</p>
</li>
<li><p>Celery startup issues</p>
</li>
</ul>
<hr />
<h2 id="heading-12-secure-supersetsecretkey">12. Secure SUPERSET_SECRET_KEY</h2>
<pre><code class="lang-bash">extraSecretEnv:
  SUPERSET_SECRET_KEY: &lt;strong-random-secret&gt;
</code></pre>
<p>Superset refuses to start without a secure secret key.</p>
<hr />
<h2 id="heading-13-final-deployment">13. Final Deployment</h2>
<pre><code class="lang-bash">helm upgrade --install superset apache/superset \
  -f values.yaml \
  --namespace superset \
  --create-namespace
</code></pre>
<hr />
<h2 id="heading-14-final-root-cause-summary">14. Final Root Cause Summary</h2>
<p>Deployment failed due to:</p>
<ul>
<li><p>Deprecated dependency image tags</p>
</li>
<li><p>Missing psycopg2 driver in container</p>
</li>
<li><p>Runtime package installation is incompatible with Superset’s virtual environment</p>
</li>
<li><p>Missing secure SECRET_KEY</p>
</li>
</ul>
<p>Resolution involved:</p>
<ul>
<li><p>Diagnosing infrastructure image failures</p>
</li>
<li><p>Building a custom immutable Superset image</p>
</li>
<li><p>Securely injecting credentials</p>
</li>
<li><p>Supporting flexible DB/Redis architecture</p>
</li>
<li><p>Enforcing SSL</p>
</li>
<li><p>Implementing readiness checks</p>
</li>
</ul>
<hr />
<h2 id="heading-15-30-second-summary">15. 30-Second Summary</h2>
<p>Apache Superset initially failed due to deprecated dependency image tags and a missing PostgreSQL driver inside the container. Runtime installation failed because the official Superset image runs inside a prebuilt Python virtual environment, making post-start package installation ineffective. The issue was resolved by building a custom immutable image with psycopg2 installed at build time, securely managing credentials, and supporting both in-cluster and external database/Redis architectures. The final deployment is stable, secure, and production-ready.</p>
<p><strong>Keywords:</strong><br />Apache Superset Kubernetes,<br />Superset Helm Chart,<br />Superset Production Deployment,<br />psycopg2 error in Superset,<br />Kubernetes ImagePullBackOff,<br />Superset with AWS RDS,<br />Superset External PostgreSQL,<br />Superset Redis configuration</p>
]]></content:encoded></item><item><title><![CDATA[Kubernetes Outage Postmortem: Nodes Stuck in NotReady Due to CNI Failure]]></title><description><![CDATA[Recently, we encountered a critical production outage in our Kubernetes cluster. New nodes provisioned during autoscaling remained in a NotReady state, leading to service disruptions and failed health checks across workloads.
In this post, I’ll walk ...]]></description><link>https://devopsofworld.com/kubernetes-outage-postmortem-nodes-stuck-in-notready-due-to-cni-failure</link><guid isPermaLink="true">https://devopsofworld.com/kubernetes-outage-postmortem-nodes-stuck-in-notready-due-to-cni-failure</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[cni]]></category><category><![CDATA[calico]]></category><category><![CDATA[Devops]]></category><category><![CDATA[postmortem]]></category><category><![CDATA[AWS]]></category><category><![CDATA[EKS]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Wed, 28 May 2025 10:30:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747926151051/bd7a91c1-4854-4968-a1b7-d1fb7f0847a7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently, we encountered a <strong>critical production outage</strong> in our Kubernetes cluster. New nodes provisioned during autoscaling remained in a <code>NotReady</code> state, leading to service disruptions and failed health checks across workloads.</p>
<p>In this post, I’ll walk you through:</p>
<ul>
<li><p>What caused the issue</p>
</li>
<li><p>How we identified and resolved it</p>
</li>
<li><p>Best practices to prevent similar failures in your clusters</p>
</li>
</ul>
<hr />
<h2 id="heading-what-went-wrong">🔥 What Went Wrong</h2>
<p>During a surge in traffic, our cluster autoscaler kicked in and added new nodes. However, these nodes failed to become <strong>Ready</strong>, resulting in:</p>
<ul>
<li><p>❌ Workloads not scheduled</p>
</li>
<li><p>❌ Services unreachable</p>
</li>
<li><p>❌ Health checks failing, pods crashing</p>
</li>
</ul>
<p>A quick check with:</p>
<pre><code class="lang-bash">kubectl get nodes
</code></pre>
<p>revealed multiple entries like:</p>
<pre><code class="lang-bash">ip-node-ip.eu-west-1.compute.internal   NotReady
</code></pre>
<p>To dig deeper, we inspected system logs and found this error:</p>
<pre><code class="lang-bash">container runtime network not ready: NetworkReady=<span class="hljs-literal">false</span>
NetworkPluginNotReady: docker: network plugin is not ready: cni config uninitialized
</code></pre>
<hr />
<h2 id="heading-root-cause-cni-plugin-failure-calico">⚠️ Root Cause: CNI Plugin Failure (Calico)</h2>
<p>These errors indicated a <strong>CNI (Container Network Interface) misconfiguration</strong>. Our cluster was using <strong>Calico</strong> as the CNI, and it wasn’t initializing properly.</p>
<p>The Calico pods responsible for managing the network stack were either stuck or not starting due to missing configurations.</p>
<hr />
<h2 id="heading-how-we-fixed-it">🛠️ How We Fixed It</h2>
<h3 id="heading-1-delete-and-recreate-calico-pods">1. Delete and Recreate Calico Pods</h3>
<p>Force Kubernetes to restart Calico:</p>
<pre><code class="lang-bash">kubectl delete pod -n kube-system -l k8s-app=calico-node
</code></pre>
<p>This helped in recreating the Calico pods with the current (and correct) configuration.</p>
<hr />
<h3 id="heading-2-reapply-calico-configuration">2. Reapply Calico Configuration</h3>
<p>On some nodes, the CNI config was missing or corrupted:</p>
<pre><code class="lang-bash">/etc/cni/net.d/10-calico.conflist
</code></pre>
<p>We reinstalled Calico using the official manifest:</p>
<pre><code class="lang-bash">kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
</code></pre>
<hr />
<h3 id="heading-3-restart-the-kubelet-on-affected-nodes">3. Restart the Kubelet on Affected Nodes</h3>
<p>Restarting the kubelet reinitialized the CNI network stack:</p>
<pre><code class="lang-bash">sudo systemctl restart kubelet
</code></pre>
<hr />
<h3 id="heading-verification">✅ Verification</h3>
<p>After applying the fixes, we verified node status:</p>
<pre><code class="lang-bash">kubectl get nodes
</code></pre>
<p>Now showed:</p>
<pre><code class="lang-bash">ip-node-ip.eu-west-1.compute.internal   Ready
</code></pre>
<p>Services started recovering, and workloads were rescheduled.</p>
<hr />
<h2 id="heading-lessons-learned-how-to-prevent-this-in-the-future">🧠 Lessons Learned: How to Prevent This in the Future</h2>
<h3 id="heading-1-backup-cni-configurations">🔁 1. Backup CNI Configurations</h3>
<p>Always back up CNI configs, especially:</p>
<pre><code class="lang-bash">/etc/cni/net.d/10-calico.conflist
</code></pre>
<p>This helps with disaster recovery and rapid bootstrapping.</p>
<hr />
<h3 id="heading-2-monitor-calico-health-and-node-status">📈 2. Monitor Calico Health and Node Status</h3>
<p>Set up monitoring and alerting for:</p>
<pre><code class="lang-bash">kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get nodes
</code></pre>
<p>Alert if:</p>
<ul>
<li><p>Calico pods crash or restart frequently</p>
</li>
<li><p>Nodes enter or stay in <code>NotReady</code></p>
</li>
</ul>
<hr />
<h3 id="heading-3-avoid-mixing-cnis">⚠️ 3. Avoid Mixing CNIs</h3>
<p><strong>Do not mix different CNI plugins</strong> (e.g., Calico and AWS VPC CNI) unless you're explicitly building a hybrid setup. It introduces instability and unexpected behavior.</p>
<blockquote>
<p>In our case, we’ve since migrated to the <strong>AWS VPC CNI</strong>, which aligns better with EKS and provides native integration with VPC IP address management.</p>
</blockquote>
<hr />
<h2 id="heading-conclusion">📌 Conclusion</h2>
<p>Networking is the backbone of Kubernetes, and when the CNI fails, everything breaks.</p>
<p>This incident was a sharp reminder of the importance of:</p>
<ul>
<li><p>Validating CNI configurations</p>
</li>
<li><p>Monitoring node readiness</p>
</li>
<li><p>Keeping your control plane and worker nodes in sync</p>
</li>
</ul>
<p>By following the steps outlined above and applying proactive monitoring, you can <strong>prevent CNI-related outages</strong> and ensure high availability for your workloads.</p>
]]></content:encoded></item><item><title><![CDATA[How to Set Up Disaster Recovery (DR) for AWS MSK with MirrorMaker 2 – Step-by-Step Guide]]></title><description><![CDATA[In today's cloud-native world, ensuring high availability and resilience for streaming platforms like Apache Kafka is mission-critical. Amazon MSK (Managed Streaming for Apache Kafka) offers a powerful, fully managed Kafka service. However, it doesn'...]]></description><link>https://devopsofworld.com/how-to-set-up-disaster-recovery-dr-for-aws-msk-with-mirrormaker-2-step-by-step-guide</link><guid isPermaLink="true">https://devopsofworld.com/how-to-set-up-disaster-recovery-dr-for-aws-msk-with-mirrormaker-2-step-by-step-guide</guid><category><![CDATA[AWS]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[big data]]></category><category><![CDATA[kafka]]></category><category><![CDATA[Apache Kafka]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 27 May 2025 10:30:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748137173547/5cd42d50-0c11-475f-bd49-4920b2090c31.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today's cloud-native world, ensuring high availability and resilience for streaming platforms like Apache Kafka is mission-critical. <strong>Amazon MSK (Managed Streaming for Apache Kafka)</strong> offers a powerful, fully managed Kafka service. However, it doesn't natively provide cross-region disaster recovery (DR). In this guide, you’ll learn how to configure <strong>cross-region DR for AWS MSK using Apache Kafka MirrorMaker 2 (MM2)</strong> — a robust, open-source replication tool.</p>
<p>This comprehensive walkthrough includes prerequisites, cluster setup, networking, and end-to-end validation to help you build a production-ready DR solution.</p>
<hr />
<h2 id="heading-what-is-aws-msk">🔍 What is AWS MSK?</h2>
<p><strong>AWS MSK (Managed Streaming for Apache Kafka)</strong> is a fully managed service that simplifies running Apache Kafka on AWS. It eliminates the operational overhead of provisioning servers, configuring clusters, and managing availability.</p>
<p><strong>Key features of AWS MSK</strong>:</p>
<ul>
<li><p>Fully managed Apache Kafka</p>
</li>
<li><p>Secure by default with VPC, TLS, and IAM integration</p>
</li>
<li><p>Scalable with automatic broker scaling and storage expansion</p>
</li>
<li><p>Native support for monitoring via CloudWatch and logging integrations</p>
</li>
</ul>
<hr />
<h2 id="heading-what-is-mirrormaker-2">🔄 What is MirrorMaker 2?</h2>
<p><strong>MirrorMaker 2 (MM2)</strong> is the enhanced replication utility introduced in <strong>Apache Kafka 2.4+</strong>. It’s designed for copying data between Kafka clusters and is built on Kafka Connect, providing modularity, scalability, and fault tolerance.</p>
<h3 id="heading-key-capabilities">Key capabilities:</h3>
<ul>
<li><p>Real-time replication of topics and consumer offsets</p>
</li>
<li><p>Support for multiple clusters</p>
</li>
<li><p>Active-passive and active-active configurations</p>
</li>
<li><p>Flexible replication policies and error handling</p>
</li>
</ul>
<hr />
<h2 id="heading-available-methods-for-kafka-dr-why-mirrormaker-2">💡 Available Methods for Kafka DR – Why MirrorMaker 2?</h2>
<p>Several options exist for disaster recovery in Kafka:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Method</td><td>Real-Time</td><td>Offset Sync</td><td>Cost</td><td>Complexity</td><td>Description</td></tr>
</thead>
<tbody>
<tr>
<td><strong>MirrorMaker 2 (MM2)</strong></td><td>✅</td><td>✅</td><td>Low–Med</td><td>Medium</td><td>Open-source Kafka-native tool ideal for AWS MSK with IAM support.</td></tr>
<tr>
<td><strong>Confluent Replicator</strong></td><td>✅</td><td>✅</td><td>High</td><td>High</td><td>Commercial-grade tool with advanced features.</td></tr>
<tr>
<td><strong>Custom Producers/Consumers</strong></td><td>✅</td><td>❌</td><td>Medium</td><td>High</td><td>Build-your-own with full control.</td></tr>
<tr>
<td><strong>Kafka Streams or Flink</strong></td><td>✅</td><td>❌</td><td>High</td><td>High</td><td>Stream processing with built-in replication logic.</td></tr>
<tr>
<td><strong>S3 Backup &amp; Restore</strong></td><td>❌</td><td>❌</td><td>Low</td><td>Low</td><td>Periodic export-import, cold DR only.</td></tr>
</tbody>
</table>
</div><p><strong>Why we chose MirrorMaker 2 for this guide</strong>:</p>
<ul>
<li><p>Seamless integration with AWS MSK and IAM authentication</p>
</li>
<li><p>No additional licensing or external dependencies</p>
</li>
<li><p>Good balance of simplicity, performance, and reliability</p>
</li>
</ul>
<hr />
<h2 id="heading-prerequisites">🧰 Prerequisites</h2>
<p>To follow this tutorial, ensure you have:</p>
<ul>
<li><p>An AWS account with permissions for MSK, EC2, IAM, and VPC</p>
</li>
<li><p>AWS CLI installed and configured</p>
</li>
<li><p>Java 11+ installed on the EC2 instance</p>
</li>
<li><p>Kafka client tools (Apache Kafka binaries)</p>
</li>
<li><p>Two VPCs in different regions (e.g., <code>ap-south-1</code> and <code>us-east-1</code>)</p>
</li>
<li><p>AWS MSK IAM Authentication JAR: <code>aws-msk-iam-auth.jar</code></p>
</li>
</ul>
<hr />
<h2 id="heading-step-1-create-primary-and-dr-msk-clusters">🏗️ Step 1: Create Primary and DR MSK Clusters</h2>
<h3 id="heading-create-primary-cluster-ap-south-1">🔹 Create Primary Cluster (<code>ap-south-1</code>)</h3>
<ol>
<li><p>Go to <strong>Amazon MSK &gt; Create Cluster</strong></p>
</li>
<li><p>Select <strong>Custom create</strong></p>
</li>
<li><p>Cluster name: <code>msk-primary</code></p>
</li>
<li><p>Kafka version: <code>3.0+</code></p>
</li>
<li><p>Brokers: <code>kafka.m5.large</code>, 3 brokers, 1000 GiB EBS each</p>
</li>
<li><p>Network: Select VPC and 3 subnets</p>
</li>
<li><p>Enable:</p>
<ul>
<li><p>Encryption at rest (KMS)</p>
</li>
<li><p>TLS in-transit encryption</p>
</li>
<li><p>IAM authentication</p>
</li>
</ul>
</li>
<li><p>Assign a security group to allow port <strong>9198</strong> from EC2</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcfwe7JRLC7qCPwnDbg8f7pJvaFLL9bItw6BcM-yPwSP8OmDUd-DSMEvj4mOAChy4Um71gpsvcTjDw8O1uWi_08l7XRbybeI7Tl6kThtlDnI6wSAqSKcmuaOXqQ1IgQoiUSMtMi4g?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
</li>
</ol>
<h3 id="heading-create-dr-cluster-us-east-1">🔹 Create DR Cluster (<code>us-east-1</code>)</h3>
<p>Repeat the above steps in the <code>us-east-1</code> region, using the cluster name <code>msk-dr</code>. Ensure consistent configuration across clusters.</p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXesMYJ57wAGFVYZj4Cgh-YLT_Ccv8AwiuSboN8ooug8INyuXQOEmOasFLwS7VGeljbIFgIQGH64NqbaGSV9OSVyRr7tHUWBnpibrMOenvFoOE9jIZ2kxqbG_IgFdb4zsoAWNmJY?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<hr />
<h2 id="heading-step-2-configure-network-and-security">🔐 Step 2: Configure Network and Security</h2>
<p>Update security group rules:</p>
<ul>
<li><p><strong>MSK SG</strong>: Allow <strong>inbound TCP 9198</strong> from EC2 SG or IP</p>
</li>
<li><p><strong>EC2 SG</strong>: Allow <strong>outbound 9198</strong> to both MSK clusters</p>
</li>
<li><p>Enable SSH (port 22) on EC2 for management access</p>
<p>  <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXexFgSA-bBAS3UVh-dTa1gKgAMhB86420zNuHKDzklcYz1EH9h910wPx7-go9X7_E_YutJmta02PYQVKe1RQqKEVYqToCCjOpxBpj9AJdz1Zgk3--wAJkU_A0JOdrGAoqFrIPpgZw?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
</li>
</ul>
<hr />
<h2 id="heading-step-3-launch-ec2-with-kafka-client">💻 Step 3: Launch EC2 with Kafka Client</h2>
<h3 id="heading-launch-ec2">🚀 Launch EC2</h3>
<ul>
<li><p>Region: <code>ap-south-1</code></p>
</li>
<li><p>Type: <code>t3.medium</code> or higher</p>
</li>
<li><p>Attach IAM role for MSK and Secrets Manager (if needed)</p>
</li>
<li><p>Ensure internet access (NAT or public IP)</p>
</li>
</ul>
<h3 id="heading-install-tools-and-configure">🛠️ Install Tools and Configure</h3>
<pre><code class="lang-bash">sudo yum update -y
sudo yum install -y java-11-amazon-corretto
wget https://downloads.apache.org/kafka/3.3.1/kafka_2.13-3.3.1.tgz
tar -xzf kafka_2.13-3.3.1.tgz
<span class="hljs-built_in">export</span> KAFKA_HOME=$(<span class="hljs-built_in">pwd</span>)/kafka_2.13-3.3.1

wget https://github.com/aws/aws-msk-iam-auth/releases/latest/download/aws-msk-iam-auth.jar
<span class="hljs-built_in">export</span> IAM_JAR=$(<span class="hljs-built_in">pwd</span>)/aws-msk-iam-auth.jar
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfYQhwJhHrxk5y_JvR2XL2w_sRf-a0_P76-2nlrjNQb-58W0iz5WlRGzyRliMmgQzGuXQDoD1BdXVMnrK7YymQgvWkNtfp5vXSFt-MP_UKe27bNFxix0V9BUBSJzTi7lwZ38uVkRw?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfC3S3k1B-8zYjrsUBR25yPPGY4UDv947CuUzBHAcnAuzJfIFN1_2mXmtujmyzTrgYGe4ZMGdZzC4WXUEKfwmvb7A4lFYaaGa8y4HSuatVa1SN_heazurb89J0tPDTrTvGXvsUWig?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcBynanupq_F3UGyPHXNcgK5SDIrdSXCeuut2ly4i-9It0oUioezYI9J_3QXcxvjF4mZO_a_yCrem2zN7H6I6hmwPxGN8vJrNieVTbTb3GCIAhuy3Z9Vn9GZgCD6NXrWy0RZH-D1A?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<h3 id="heading-create-clientpropertieshttpclientproperties">✍️ Create <a target="_blank" href="http://client.properties"><code>client.properties</code></a></h3>
<pre><code class="lang-bash">security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler

ssl.truststore.location=/home/ec2-user/msk-certs/truststore.jks
ssl.truststore.password=anjali
</code></pre>
<blockquote>
<p>☝️ Make sure the truststore includes AWS MSK’s CA certificate.</p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXe3ongNAZID_n1cTtMkzMCxWAs4LxGdWa3zlu53ySopTyhx6FarbLpA5DbcOGCY2CympwmQgER4Z7FReCklCsfAC7YhRY6erK6fV5SW4cYXmObvNM9S1twtzYXx5mDtLk1CN6Lo?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
</blockquote>
<hr />
<h2 id="heading-step-4-create-kafka-topic-on-primary-cluster">🧵 Step 4: Create Kafka Topic on Primary Cluster</h2>
<pre><code class="lang-bash">CLASSPATH=<span class="hljs-variable">$IAM_JAR</span>:<span class="hljs-variable">$KAFKA_HOME</span>/libs/* <span class="hljs-variable">$KAFKA_HOME</span>/bin/kafka-topics.sh \
  --create \
  --topic test-topic \
  --partitions 3 \
  --replication-factor 3 \
  --bootstrap-server &lt;primary-broker-list&gt; \
  --command-config /home/ec2-user/msk-certs/client.properties
</code></pre>
<p>Replace <code>&lt;primary-broker-list&gt;</code> with your actual MSK bootstrap brokers.</p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXf8DCENnIY-dL4eptLVsRZ08XZq57lj_0NwPGZh_0od4L6buYGCmaqRaP9gVEWI1jxzQNdjUn7A6h1zEkqj1CPauBZFshdkxgurFXyYrI_GuOyrtd8J780SaV5lLcQOdiB9dkosUQ?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<hr />
<h2 id="heading-step-5-configure-and-run-mirrormaker-2">⚙️ Step 5: Configure and Run MirrorMaker 2</h2>
<h3 id="heading-create-mm2propertieshttpmm2properties">✍️ Create <a target="_blank" href="http://mm2.properties"><code>mm2.properties</code></a></h3>
<pre><code class="lang-bash">clusters = primary,dr

primary.bootstrap.servers=&lt;primary-brokers&gt;
primary.security.protocol=SASL_SSL
primary.sasl.mechanism=AWS_MSK_IAM
primary.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
primary.ssl.truststore.location=/home/ec2-user/msk-certs/kafka.client.truststore.jks
primary.ssl.truststore.password=anjali

dr.bootstrap.servers=&lt;dr-brokers&gt;
dr.security.protocol=SASL_SSL
dr.sasl.mechanism=AWS_MSK_IAM
dr.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
dr.ssl.truststore.location=/home/ec2-user/msk-certs/kafka.client.truststore.jks
dr.ssl.truststore.password=anjali

tasks.max=2
topics=test-topic
groups=.*
replication.policy.class=org.apache.kafka.connect.mirror.DefaultReplicationPolicy
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcXZSys_yPQIU6rGBy7VIE_gMBJLrNPsMdaGn06Y9_JuIeK-JXfhxnKDE1SxCs-tTEFrstW_E4IzJkpmNC_O3HZbmXD2qWcs5ltv7l3RmQogGmXwtBMPfIhhzunfzK4IpVrmzvukA?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<h3 id="heading-run-mm2">▶️ Run MM2</h3>
<pre><code class="lang-bash">CLASSPATH=<span class="hljs-variable">$IAM_JAR</span>:<span class="hljs-variable">$KAFKA_HOME</span>/libs/*:<span class="hljs-variable">$KAFKA_HOME</span>/libs/connect-runtime-*.jar:<span class="hljs-variable">$KAFKA_HOME</span>/libs/connect-api-*.jar \
  <span class="hljs-variable">$KAFKA_HOME</span>/bin/connect-mirror-maker.sh /home/ec2-user/mm2.properties
</code></pre>
<hr />
<h2 id="heading-step-6-validate-replication">✅ Step 6: Validate Replication</h2>
<h3 id="heading-list-topics-on-dr">🔎 List Topics on DR</h3>
<pre><code class="lang-bash">CLASSPATH=<span class="hljs-variable">$IAM_JAR</span>:<span class="hljs-variable">$KAFKA_HOME</span>/libs/* <span class="hljs-variable">$KAFKA_HOME</span>/bin/kafka-topics.sh \
  --list \
  --bootstrap-server &lt;dr-broker&gt; \
  --command-config /home/ec2-user/msk-certs/client.properties
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdbmVytrDCrmVzHof-R4sG2EalG6Ug9zk-EX37LiOtDmOm2pUsZrkJEBkujwMprD5z9DbawYpJyjdpJX-Mtfucob3VUfcwOF6k3BZIuo72XlOl2Y0IOavDphtGiXtZYTpX4eVSwEw?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<h3 id="heading-consume-messages-from-dr">📥 Consume Messages from DR</h3>
<pre><code class="lang-bash">CLASSPATH=<span class="hljs-variable">$IAM_JAR</span>:<span class="hljs-variable">$KAFKA_HOME</span>/libs/* <span class="hljs-variable">$KAFKA_HOME</span>/bin/kafka-console-consumer.sh \
  --topic mm2-test-topic \
  --from-beginning \
  --bootstrap-server &lt;dr-broker&gt; \
  --consumer.config /home/ec2-user/msk-certs/client.properties
</code></pre>
<hr />
<h2 id="heading-step-7-test-message-flow">✍️ Step 7: Test Message Flow</h2>
<h3 id="heading-produce-to-primary">📨 Produce to Primary</h3>
<pre><code class="lang-bash">CLASSPATH=<span class="hljs-variable">$IAM_JAR</span>:<span class="hljs-variable">$KAFKA_HOME</span>/libs/* <span class="hljs-variable">$KAFKA_HOME</span>/bin/kafka-console-producer.sh \
  --topic test-topic \
  --bootstrap-server &lt;primary-broker&gt; \
  --producer.config /home/ec2-user/msk-certs/client.properties
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdaSxnzdlDhBQXHaTYQlyVPr2AvQ6yOwHQ-BAF0eKDxEQPi1dopd1m5br-f6UI18Co_gfVSeY2DARQZzQvsLYcBr_LGumLXyiH7jnxdclAMPC6b8xV5p8vQafKK79huIV3VX7vu?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<p>Type some messages and hit Enter.</p>
<h3 id="heading-confirm-on-dr">✅ Confirm on DR</h3>
<p>Re-run the consumer from the DR cluster to verify real-time replication.</p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXf2zfw9b2QDGEGZapUlBJA1yWqIo4Uc3-0R2zJ-alUkmgpzQ8Q4F9XG-wrW_syPJcyaRia3zlBmsN_WdjFu7pE03hasTHTP1RquwO3velEsxPzc_NqQ3LNgSA5TRk17ohq-i0VyCA?key=kyqryDzyBOwqmUNT6iVkzw" alt /></p>
<p>💬 <strong>Have you implemented DR for Kafka in your architecture?</strong><br />Drop your approach or challenges in the comments — I'd love to hear how others tackle cross-region resilience!</p>
<p>🔔 <strong>Follow me for more AWS infrastructure and streaming data posts!</strong></p>
]]></content:encoded></item><item><title><![CDATA[Set Up Cross-Region Disaster Recovery for DynamoDB Using Global Tables (Step-by-Step Guide)]]></title><description><![CDATA[In today’s cloud-native world, ensuring the continuous availability of data across regions is crucial. Amazon DynamoDB Global Tables offer a powerful, fully managed solution for building active-active cross-region architectures with built-in replicat...]]></description><link>https://devopsofworld.com/set-up-cross-region-disaster-recovery-for-dynamodb-using-global-tables-step-by-step-guide</link><guid isPermaLink="true">https://devopsofworld.com/set-up-cross-region-disaster-recovery-for-dynamodb-using-global-tables-step-by-step-guide</guid><category><![CDATA[AWS]]></category><category><![CDATA[DynamoDB]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[cloud architecture]]></category><category><![CDATA[serverless]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 27 May 2025 08:45:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747500298262/aea72a6a-2b53-4c32-8a33-2a69b76d525d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today’s cloud-native world, ensuring the continuous availability of data across regions is crucial. <strong>Amazon DynamoDB Global Tables</strong> offer a powerful, fully managed solution for building <strong>active-active cross-region architectures</strong> with built-in replication.</p>
<p>In this post, we’ll walk through how to set up a <strong>cross-region disaster recovery (DR)</strong> strategy for DynamoDB using Global Tables across <code>us-east-1</code> and <code>ap-south-1</code>, covering setup, replication, and failover validation.</p>
<h2 id="heading-use-case">🔧 Use Case</h2>
<ul>
<li><p><strong>Architecture</strong>: Active-Active (bi-directional replication)</p>
</li>
<li><p><strong>Primary Region</strong>: <code>us-east-1</code> (US East - N. Virginia)</p>
</li>
<li><p><strong>Disaster Recovery Region</strong>: <code>ap-south-1</code> (Asia Pacific - Mumbai)</p>
</li>
<li><p><strong>Objective</strong>: Set up Global Tables in DynamoDB to ensure DR-readiness with real-time replication.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747500356633/082d90b2-cb54-4fcc-bcd8-93e69929082b.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h2 id="heading-step-by-step-setup">🛠 Step-by-Step Setup</h2>
<h3 id="heading-step-1-create-a-dynamodb-table-in-the-primary-region">Step 1: Create a DynamoDB Table in the Primary Region</h3>
<ol>
<li><p><strong>Log in</strong> to the AWS Management Console.</p>
</li>
<li><p><strong>Switch Region</strong> to <strong>US East (N. Virginia) – us-east-1</strong>.</p>
</li>
<li><p>Navigate to <strong>Services &gt; DynamoDB</strong>.</p>
</li>
<li><p>Click on <strong>"Create Table"</strong>.</p>
</li>
<li><p>Enter table details:</p>
<ul>
<li><p><strong>Table name</strong>: <code>test-table</code></p>
</li>
<li><p><strong>Partition key</strong>: <code>user-id</code> (String)</p>
</li>
</ul>
</li>
<li><p><strong>Leave defaults</strong> as-is (On-demand billing recommended).</p>
</li>
<li><p>Click <strong>"Create Table"</strong>.</p>
</li>
<li><p>Wait until the table status is <strong>“Active”</strong>.</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXchQrBHo1zoR-bHc47PyzblZt013RkVgXQ335f7Q6nOkAk7mZD0ejXOCICVF8c-XxKYentoVZSSj_iyk02by6hNTCGJgP_pDatTJjmgOw86_0OHT4Gv3m1D35abTY0tmA8vzPxI_g?key=w7Vteh5Mb4gMkEmUSh32fg" alt /></p>
</li>
</ol>
<hr />
<h3 id="heading-step-2-add-replica-region-ap-south-1">Step 2: Add Replica Region (ap-south-1)</h3>
<ol>
<li><p>While still in <strong>us-east-1</strong>, open the <code>test-table</code>.</p>
</li>
<li><p>Go to the <strong>"Global Tables"</strong> tab.</p>
</li>
<li><p>Click <strong>"Add region"</strong>.</p>
</li>
<li><p>Select <strong>Asia Pacific (Mumbai) – ap-south-1</strong>.</p>
</li>
<li><p>Click <strong>"Create Replica"</strong>.</p>
</li>
<li><p>AWS will provision a replica in ap-south-1 with automatic schema synchronization.</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcdXvfl6_iXglQV-fTBHQPlFYzisDwDYExWRe_Zdr-EV39o67L9WJHCeDAWldt3U74H0CAgrc5gRKQQCkUS-qc4_KtUCrStLHdRcfUY7ALOhC-tNPO_Wjg_VfNF3_r21piOTLCc_Q?key=w7Vteh5Mb4gMkEmUSh32fg" alt /></p>
</li>
</ol>
<hr />
<h3 id="heading-step-3-verify-replication">Step 3: Verify Replication</h3>
<ol>
<li><p><strong>Switch Region</strong> to <strong>ap-south-1</strong>.</p>
</li>
<li><p>Go to <strong>DynamoDB &gt; Tables</strong>.</p>
</li>
<li><p>Confirm that <code>test-table</code> appears.</p>
</li>
<li><p>Open it and validate that the schema matches the original.</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcYaCyPq7azWg4P7XP48ZYMT1usKXxOZDO5Wfp87XAOa2j59ZvThKF7FJcfxe6d_deu7S6bgRzafkZ2H9NhWFvYc-IdhjJsRFC9S3N6AaiYxejdeSzBV9aTGGnhf3AI0SpeU177rQ?key=w7Vteh5Mb4gMkEmUSh32fg" alt /></p>
</li>
</ol>
<h3 id="heading-step-4-test-bi-directional-replication">Step 4: Test Bi-Directional Replication</h3>
<h4 id="heading-a-insert-data-in-primary-us-east-1">A. Insert Data in Primary (us-east-1)</h4>
<ol>
<li><p>Switch back to <strong>us-east-1</strong>.</p>
</li>
<li><p>Open <code>test-table</code> &gt; <strong>Explore table items</strong> &gt; <strong>Create item</strong>.</p>
</li>
<li><p>Add:</p>
<pre><code class="lang-json"> {
   <span class="hljs-attr">"UserID"</span>: <span class="hljs-string">"U001"</span>,
   <span class="hljs-attr">"Name"</span>: <span class="hljs-string">"Alice"</span>
 }
</code></pre>
</li>
<li><p>Click <strong>"Create item"</strong>.</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXe6qcKIGWgePCz9VXu3_XXUrCxED1xnZfMjgYX5xj3EjWlCxoPt5lh9IbkxObUUuD5sEu9P4TBj8QXwcL2br-UI-6Dc-mONWiByz5Lmta7n9iAskhYtxfJagkaqymgLP30S6apl?key=w7Vteh5Mb4gMkEmUSh32fg" alt /></p>
</li>
</ol>
<h4 id="heading-b-verify-in-dr-region-ap-south-1">B. Verify in DR Region (ap-south-1)</h4>
<ol>
<li><p>Switch to <strong>ap-south-1</strong>.</p>
</li>
<li><p>Navigate to <code>test-table</code> &gt; <strong>Explore table items</strong>.</p>
</li>
<li><p>Verify that Alice’s record is replicated.</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeAT0_87x1NTpx1etPe2qm2OKFoII8zupL2bE_5OpiMUKNxa09Je9PqA110wgnOQYqpsnJ_b__zfNt5NXMoiDH0SEftoqtrWveHpCDu3eWW-62hkU97pmgb-rmjFKPcKbfJZVFi?key=w7Vteh5Mb4gMkEmUSh32fg" alt /></p>
</li>
</ol>
<h4 id="heading-c-insert-in-dr-region-ap-south-1">C. Insert in DR Region (ap-south-1)</h4>
<ol>
<li><p>In <strong>ap-south-1</strong>, add:</p>
<pre><code class="lang-json"> {
   <span class="hljs-attr">"UserID"</span>: <span class="hljs-string">"U002"</span>,
   <span class="hljs-attr">"Name"</span>: <span class="hljs-string">"Bob"</span>
 }
</code></pre>
</li>
<li><p>Switch to <strong>us-east-1</strong> and verify Bob's entry is available there too.</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdKRN5iHt9-_9IQyjgPMgs1wkPJeVTsDcGk1PGOXAKY4RChpUp-fVBtgwGfKQSNgSqCUkjTBcmvw9pUPPtk_cQJTBtZz0Sk3dwgGKEHxa3p-eLexNiyXRz5yq6An4iHvAe_MXx83Q?key=w7Vteh5Mb4gMkEmUSh32fg" alt /></p>
</li>
</ol>
<h3 id="heading-step-5-simulate-region-failure-optional-testing">Step 5: Simulate Region Failure (Optional Testing)</h3>
<p>To test failover without bringing down the actual AWS region:</p>
<ol>
<li><p><strong>Block the us-east-1 endpoint</strong> locally:</p>
<pre><code class="lang-bash"> sudo nano /etc/hosts
</code></pre>
</li>
<li><p>Add the following line:</p>
<pre><code class="lang-bash"> 127.0.0.1 dynamodb.us-east-1.amazonaws.com
</code></pre>
</li>
<li><p>Now perform read/write operations only in <strong>ap-south-1</strong>.</p>
</li>
<li><p>Once done, remove or comment out the line from <code>/etc/hosts</code>.</p>
<p> <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcs-U8VwIhApgkhKSueY5ktxnP15-D7TDEJPzPTSYmsiMlNXM7n_DOZ8XnOW4mb03MM34CZgFkfc68lRvyzfs7mDEvfICU_l7BG0VDOfj2KpkNpUJUEtpUfRNoH7DYT8ywCDWDwnA?key=w7Vteh5Mb4gMkEmUSh32fg" alt /></p>
</li>
</ol>
<h2 id="heading-conclusion">✅ Conclusion</h2>
<p>With <strong>DynamoDB Global Tables</strong>, building a <strong>resilient, low-latency, and globally available</strong> application becomes straightforward. This architecture is ideal for mission-critical systems that demand real-time DR and global data access. By following this guide, you’ve successfully implemented an <strong>active-active cross-region DR strategy</strong> using only the AWS Console.</p>
<h2 id="heading-further-reading">📌 Further Reading</h2>
<ul>
<li><p><a target="_blank" href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html">AWS Docs: DynamoDB Global Tables</a></p>
</li>
<li><p><a target="_blank" href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.HowItWorks.html">Handling Conflicts in DynamoDB Global Tables</a></p>
</li>
<li><p><a target="_blank" href="https://aws.amazon.com/dynamodb/pricing/">DynamoDB Pricing</a></p>
</li>
</ul>
<p>💬 Have you implemented cross-region DR for DynamoDB or other AWS services? Share your setup in the comments!</p>
]]></content:encoded></item><item><title><![CDATA[MongoDB Cross-Region Disaster Recovery (DR) on AWS EC2: Step-by-Step Guide]]></title><description><![CDATA[In today’s high-availability landscape, disaster recovery (DR) is more than a best practice—it's a requirement. This tutorial walks you through implementing a cross-region MongoDB DR solution using EC2 instances with public IPs. You’ll learn to repli...]]></description><link>https://devopsofworld.com/mongodb-cross-region-disaster-recovery-dr-on-aws-ec2-step-by-step-guide</link><guid isPermaLink="true">https://devopsofworld.com/mongodb-cross-region-disaster-recovery-dr-on-aws-ec2-step-by-step-guide</guid><category><![CDATA[MongoDB]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[cloud architecture]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 27 May 2025 07:38:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748326895259/2e4c0762-8cd7-4ecb-bebd-55319dd6f89a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In today’s high-availability landscape, <strong>disaster recovery (DR)</strong> is more than a best practice—it's a requirement. This tutorial walks you through implementing a <strong>cross-region MongoDB DR solution using EC2 instances with public IPs</strong>. You’ll learn to replicate data between AWS regions to ensure business continuity even in regional outages.</p>
<hr />
<h2 id="heading-architecture-overview">🧭 Architecture Overview</h2>
<p>The DR architecture consists of:</p>
<ul>
<li><p>A <strong>primary MongoDB instance</strong> running in one AWS region.</p>
</li>
<li><p>A <strong>replica set member</strong> (secondary) in a separate AWS region.</p>
</li>
<li><p><strong>Data replication</strong> between these two nodes.</p>
</li>
</ul>
<p>📌 <em>Diagram on page 1</em> shows a simple two-region EC2-based MongoDB setup with replication arrows connecting the nodes.</p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdmrTT9golye1yAWeI6kCSR3QpDxwRAjE2eZq4hM9-dFlQqZb_JNCJfwhn0idO4w2b9DUWh9NYFK_iZKTEVFwhqWDIBTGoXU-fXDHH3qNBW9wFKhlMTbqsl21-eCGa_KUexYQ9haA?key=7COqV19dSCUFk1rY8mZjBQ" alt /></p>
<hr />
<h2 id="heading-implementation-guide">🔧 Implementation Guide</h2>
<h3 id="heading-1-ec2-instance-setup">1. EC2 Instance Setup</h3>
<ul>
<li><p><strong>Launch two Ubuntu EC2 instances</strong>, one in each AWS region.</p>
</li>
<li><p><strong>Security Groups</strong> must allow:</p>
<ul>
<li>Inbound traffic on <strong>TCP port 27017</strong> (MongoDB default port) from known IPs.</li>
</ul>
</li>
</ul>
<hr />
<h3 id="heading-2-install-mongodb">2. Install MongoDB</h3>
<p>Run these commands on both instances:</p>
<pre><code class="lang-bash">sudo apt-get install gnupg curl
curl -fsSL https://www.mongodb.org/static/pgp/server-8.0.asc | \
  sudo gpg -o /usr/share/keyrings/mongodb-server-8.0.gpg --dearmor

<span class="hljs-built_in">echo</span> <span class="hljs-string">"deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-8.0.gpg ] \
https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/8.0 multiverse"</span> | \
sudo tee /etc/apt/sources.list.d/mongodb-org-8.0.list

sudo apt-get update
sudo apt-get install -y mongodb-org
sudo systemctl start mongod
sudo systemctl <span class="hljs-built_in">enable</span> mongod
</code></pre>
<hr />
<h3 id="heading-3-configure-mongodb">3. Configure MongoDB</h3>
<p>Edit the MongoDB config file on <strong>both servers</strong>:</p>
<pre><code class="lang-bash">sudo nano /etc/mongod.conf
</code></pre>
<p>Set the following values:</p>
<pre><code class="lang-bash">net:
  port: 27017
  bindIp: 0.0.0.0

replication:
  replSetName: <span class="hljs-string">"rs0"</span>

security:
  authorization: enabled
</code></pre>
<p>Then restart the MongoDB service:</p>
<pre><code class="lang-bash">sudo systemctl restart mongod
</code></pre>
<hr />
<h3 id="heading-4-initialize-replica-set">4. Initialize Replica Set</h3>
<p>Log in to the Mongo shell on the <strong>primary</strong>:</p>
<pre><code class="lang-bash">mongosh
</code></pre>
<p>Run this setup:</p>
<pre><code class="lang-bash">rs.initiate({
  _id: <span class="hljs-string">"rs0"</span>,
  members: [
    { _id: 0, host: <span class="hljs-string">"primary-public-ip:27017"</span>, priority: 2 },
    { _id: 1, host: <span class="hljs-string">"secondary-public-ip:27017"</span>, priority: 1 }
  ]
})
</code></pre>
<p>Verify it using:</p>
<pre><code class="lang-bash">rs.status()
</code></pre>
<hr />
<h3 id="heading-5-create-admin-user">5. Create Admin User</h3>
<p>On the <strong>primary</strong>:</p>
<pre><code class="lang-bash">db.createUser({
  user: <span class="hljs-string">"adminUser"</span>,
  <span class="hljs-built_in">pwd</span>: <span class="hljs-string">"securePassword"</span>,
  roles: [{ role: <span class="hljs-string">"root"</span>, db: <span class="hljs-string">"admin"</span> }]
})
</code></pre>
<p>Reconnect with auth:</p>
<pre><code class="lang-bash">mongosh --host primary-public-ip:27017 -u adminUser -p securePassword --authenticationDatabase admin
</code></pre>
<hr />
<h3 id="heading-6-test-replication">6. Test Replication</h3>
<p><strong>Insert on primary</strong>:</p>
<pre><code class="lang-bash">use testdb
db.testCollection.insertOne({ message: <span class="hljs-string">"Testing replication"</span>, timestamp: new Date() })
</code></pre>
<p><strong>Read from the secondary</strong>:</p>
<pre><code class="lang-bash">mongosh --host secondary-public-ip:27017 -u adminUser -p securePassword --authenticationDatabase admin
</code></pre>
<p>Enable read preference:</p>
<pre><code class="lang-bash">db.getMongo().setReadPref(<span class="hljs-string">"secondaryPreferred"</span>)
use testdb
db.testCollection.find()
</code></pre>
<hr />
<h3 id="heading-7-simulate-failover">7. Simulate Failover</h3>
<p>Stop MongoDB on the <strong>primary</strong>:</p>
<pre><code class="lang-bash">sudo systemctl stop mongod
</code></pre>
<p>Then, on the <strong>secondary</strong>, verify promotion:</p>
<pre><code class="lang-bash">rs.status()  // This node should now be PRIMARY
</code></pre>
<p>Bring the primary back:</p>
<pre><code class="lang-bash">sudo systemctl start mongod
</code></pre>
<hr />
<p>📝 8. Checklist / TL;DR</p>
<p>Here’s a quick reference summary to validate your setup:</p>
<p>✅ EC2s launched in two AWS regions<br />✅ MongoDB installed and configured<br />✅ Replica set initialized<br />✅ Authentication set up<br />✅ Replication verified<br />✅ Failover tested</p>
<h2 id="heading-conclusion">✅ Conclusion</h2>
<p>This guide demonstrates how to:</p>
<ul>
<li><p>Set up a <strong>MongoDB replica set</strong> across AWS regions.</p>
</li>
<li><p>Secure it with authentication.</p>
</li>
<li><p>Test real-time <strong>replication and failover</strong>.</p>
</li>
</ul>
<p>While this is a basic manual setup, production deployments may benefit from:</p>
<ul>
<li><p>Private networking (e.g., VPC peering).</p>
</li>
<li><p>Automation via Terraform or Ansible.</p>
</li>
<li><p>Monitoring using MongoDB Ops Manager or Prometheus.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Boosting Linux Host Performance for Containerized Workloads (Nginx Benchmarking Guide)]]></title><description><![CDATA[🚀 Why Host Tuning Matters
Even though containers provide abstraction, they still rely on the host’s kernel, CPU scheduler, memory handling, and I/O layers. With smart tuning:

Network throughput can rise 📈

Memory management can become more efficie...]]></description><link>https://devopsofworld.com/boosting-linux-host-performance-for-containerized-workloads-nginx-benchmarking-guide</link><guid isPermaLink="true">https://devopsofworld.com/boosting-linux-host-performance-for-containerized-workloads-nginx-benchmarking-guide</guid><category><![CDATA[Linux]]></category><category><![CDATA[optimization]]></category><category><![CDATA[tunning]]></category><category><![CDATA[benchmarking]]></category><category><![CDATA[Performance Optimization]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 20 May 2025 11:50:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747737031009/41c13a94-f263-451b-b84a-d5f173f1c9bb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-why-host-tuning-matters">🚀 Why Host Tuning Matters</h2>
<p>Even though containers provide abstraction, they still rely on the host’s kernel, CPU scheduler, memory handling, and I/O layers. With smart tuning:</p>
<ul>
<li><p>Network throughput can rise 📈</p>
</li>
<li><p>Memory management can become more efficient 🔁</p>
</li>
<li><p>Latency can be reduced for disk and CPU operations ⚡</p>
</li>
</ul>
<h2 id="heading-flow-of-the-optimization-process">📊 Flow of the Optimization Process</h2>
<p>This clear 5-step flow ensures reproducibility and insight into performance improvements.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747737129655/6cf6e053-3b42-49e0-b96a-99bd1b8c6943.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-setup-and-preliminaries">🛠️ Setup and Preliminaries</h2>
<ul>
<li><p><strong>OS:</strong> Ubuntu 22.04</p>
</li>
<li><p><strong>Application:</strong> Nginx inside a Docker container</p>
</li>
<li><p><strong>Benchmark Tools:</strong> <code>stress-ng</code>, <code>iperf3</code>, <code>dd</code>, <code>htop</code>, <code>iotop</code>, <code>iftop</code>, <code>sysstat</code></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p><strong>Ensure Docker is installed and running:</strong></p>
<pre><code class="lang-bash">docker --version
</code></pre>
<p><strong>Install benchmarking and monitoring tools:</strong></p>
<pre><code class="lang-bash">sudo apt update
sudo add-apt-repository universe
sudo apt update
sudo apt install -y htop iotop iftop sysstat
</code></pre>
<h2 id="heading-step-1-launch-the-nginx-container">📦 Step 1: Launch the Nginx Container</h2>
<pre><code class="lang-bash">docker run -d --name nginx-test -p 8089:80 nginx
</code></pre>
<p>Verify by navigating to: <a target="_blank" href="http://localhost:8089"><code>http://localhost:8089</code></a></p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdGNG6gvLTfeZL_w_tJPblsdNK4890-erZZciWQ12IFsSSMGiT-h_T__J7nFhnZjtkCHyacF9hy-W8HMXkP4VMSXMseceZXpDnwtLfe5xve6Ro5cReVovEBNVA6JhkF7Bk7gl-FFQ?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXd-7kyhgJA014hqHn7plCpzpS3oPRNaMZYEzhmhyuPGJ_tdPXKmczMQynEj79xyBnow_pL3oAgsFzVvP0clilG_PjGSndv0ZIhW8aQ0X7XfisqKy_IUDVa4GuwV8a5xFrjqJXF3IQ?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h2 id="heading-step-2-baseline-performance-before-tuning">📉 Step 2: Baseline Performance (Before Tuning)</h2>
<h3 id="heading-cpu-test">CPU Test</h3>
<pre><code class="lang-bash">stress-ng --cpu 4 --timeout 60s --metrics-brief
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXe3shVLEKln2xm4-27gleNOCVNbAPvZnvJply-uzmz2zYk6TjpgPvkaIAy2D50qKw0SFaXeKaz3xnEmYcJGOwZJSw0tcoQh_Tfqd3y62pMb7bKAI1pwhMM_XLb2A_upUyl3fEZguA?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><strong>Memory Test</strong></p>
<pre><code class="lang-bash">stress-ng --vm 2 --vm-bytes 1G --timeout 60s --metrics-brief
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXf279CU692RxVUss_rNO4fjkOBpYWgyCMa95mIeV7TmDTc8B1L5jus2p-_0n3vMtZVSBpIs5TRwlIv_SP5Xe3ZxXtJisBjURv8_tI1ojbOckqFlGCFe4RZOhnDcS7Z1c3zJU7SoKA?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><strong>Disk I/O Test</strong></p>
<pre><code class="lang-bash">dd <span class="hljs-keyword">if</span>=/dev/zero of=testfile bs=1G count=1 oflag=dsync
rm -f testfile
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXf-8_QuJmkL8-P-APShb5KhNGhD5QQ-4U84_6Sgw5UV0Wr0QKaxlH6eitIMdQye2dZA3aZrC2mrWvqjn7bQ77XJfUe2IijQu_ij0iJgqJysHAMIKzJ-Of5ZzW3R_8moPqfk2tgk?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h3 id="heading-network-test-loopback">Network Test (Loopback)</h3>
<p><strong>Terminal 1:</strong></p>
<pre><code class="lang-bash">iperf3 -s
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfpDQk9DOIWOjObTnny4MbwOUswXj-Kk4lCZ-QUkvsgyfFQX7xC8GLDUXTfdV8TTMJn2Em9rzraMCBz4e4jtXSsiEJBvj0ZNIyBHZaqel1wQLuSp7sr-B00YjB69my4KD2zcEHDWg?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><strong>Terminal 2:</strong></p>
<pre><code class="lang-bash">iperf3 -c 127.0.0.1
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcxGTOlX2IcAZTkbsZvcvd5W3QlMIPdVar2YTIzYTRkraamY1FfzHfur2bktaGoKaLAP72AhMpI7fhiGc2eTe4_jlIGnZzVigVQh185Z3zALhvWelvNEcvvqF1uekn4g0UluG1ZPg?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h2 id="heading-step-3-apply-linux-host-tuning">⚙️ Step 3: Apply Linux Host Tuning</h2>
<h3 id="heading-cpu-tuning">CPU Tuning</h3>
<p>Install the cpupower tools and set the performance governor:</p>
<pre><code class="lang-bash">sudo apt install -y linux-tools-common linux-tools-generic
sudo cpupower frequency-set --governor performance
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfZBHIoiCMzQgJNQOcskFZpRSSNqlT2nQvFsWOrpPcVtPhNDUt-6zKmtuF4jJH5NmTnDZJOzls83HFKwJHwoyiFkJhTRjl_qL3FCVfaZH_Vbs8kQpGF40X4JZLjHhDX-P3FWeIq7A?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdWHmvnp-EjQjOoR-TS-HChZMGEVstnfKxJClsXddr0UwshTf32FIF6OY28O_3H7jMTzYRybMHL9gR1zSh-c8GjpysP0JHNLXDZiMc9MxZeskhl8q2KeVY45i8iOzt5wFzHvD9RBg?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><strong>Memory Tuning</strong></p>
<pre><code class="lang-bash">sudo sysctl -w vm.swappiness=10
sudo sysctl -w vm.dirty_ratio=20
sudo sysctl -w vm.dirty_background_ratio=10
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXe3OuceLXY9M87ecNIj8WpqQ2JCUQTwJ41FMgbi3uMPlkb5fLM25KFewImtKOmMAEgM-QMEs3byx6ik8NXtBe-oxVtzmZJtuj4fsVNRwwnp0P7yYHWIdKQm8d4ZFElre2UZaaoLSA?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdaFlQQQ2eZNtlDm5ikOsn4ObyOzTbwpUPcJDty02B6ySyhwH3ZUgEEG6bxDA2E_lFNRWhn0Gqakhu7nEiMWCf_-kjHP_tPLCI3dNNe-qwuZu-TajXYtGFptHkBVE-HtVJdOMWQ?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXflSWbdFF0K6aSlRPwGphLjw65wVMk9c8cER4bgtL8oLBHXOTnNazH0AZGn7WIRrIb8oV6qI4MjTwiNtpHels62ZXB7M_46M3aDs5k7GclYFvCd5ZQ4kFSMEgpwpDY-vSv90rNL?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h3 id="heading-disk-io-tuning">Disk I/O Tuning</h3>
<p>Change scheduler (check your block device name, e.g., <code>nvme0n1</code>):</p>
<pre><code class="lang-bash">cat /sys/block/nvme0n1/queue/scheduler
<span class="hljs-built_in">echo</span> mq-deadline | sudo tee /sys/block/nvme0n1/queue/scheduler
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXco6iuoa_uA4frJhQNzmPYxNhkqXNBsE78DGsUTnC16fx6HPVfE1mLiIoc9C8lXI9Z4LyN-bCsyPOuRZggjKbcolL4jbkzaZ5OThwz2MG7S1g7ltLGaNF4MmjKkhONAJXFP2WlgRQ?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><strong>Network Tuning</strong></p>
<pre><code class="lang-bash">sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.core.netdev_max_backlog=250000
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcQNnz-pkKJUHD6Ek-rmzTHvIEAOLwG-LzZ6xsfJIuInG8S65uxAijS1H50eTAMtk5GbeEiHSRv5T25jTQ-liz5kKfi4P4JbsysD9Eop8Fj7ANK2C8Ggd10jQ2wZABAyl6D87xq2w?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<p><strong>Kernel Parameters for Containers</strong></p>
<pre><code class="lang-bash">sudo tee /etc/sysctl.d/99-container-tuning.conf &gt; /dev/null &lt;&lt;EOF
fs.file-max = 2097152
vm.max_map_count = 262144
net.ipv4.ip_local_port_range = 1024 65535
EOF

sudo sysctl --system
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXflHmGgnWz2cIrZ3K9jAiF6QzevjGpofRL9r2zn-W5rcnXP3I9jwEVp8oMinjkY3yDFFUks8TIQ9O4g9qfnZpJf3KYLP5LCFHDVyANMXBsVfMCCc2RJwZ8FqCM3cTlVPRXK4nUbCQ?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h2 id="heading-step-4-re-deploy-and-re-test">🔁 Step 4: Re-deploy and Re-test</h2>
<p>Recreate the container:</p>
<pre><code class="lang-bash">docker rm -f nginx-test
docker run -d --name nginx-test -p 8089:80 nginx
</code></pre>
<p>Repeat benchmarks as earlier to observe improvements.</p>
<h2 id="heading-step-5-re-run-the-nginx-container"><strong>Step 5: Re-run the NGINX Container</strong></h2>
<pre><code class="lang-bash">docker rm -f nginx-test
docker run -d --name nginx-test -p 8089:80 nginx
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfhWI-tb6PD8BUFNJAOBsdiFrpA6RpqvtiqTpoyTeFsS8GQ58YiGcaHkHdy2_PEp3okeEyZ9RBhAOPvZVTdsCNnSzkq8PQUjMTfGqqgrRgz6T1UMMZ4w4VV-VdgjwRXJOW-NNJDlA?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h2 id="heading-step-6-re-test-the-performance-post-tuning">🔁 Step 6: Re-test the Performance (Post-Tuning)</h2>
<p>After applying system-level optimizations, it's time to <strong>benchmark again</strong> and compare the performance with the baseline.</p>
<h3 id="heading-cpu-re-test">✅ CPU Re-test</h3>
<p>Use the following command to stress the CPU with 4 workers for 60 seconds:</p>
<pre><code class="lang-bash">stress-ng --cpu 4 --timeout 60s --metrics-brief
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfJVGWvpQ_QzxHQJyU9ZZNauHnoHhVHPMt9ASphq7TPeDYLrYxbJgw8oe3FZBtm0LwWuGweM_mr6IGufX73mUcEVbX2vNAdXZSEABt5hPwljBLwVGFd2s5hyJJHw4NJVDBo9aeqpQ?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h3 id="heading-memory-re-test">✅ Memory Re-test</h3>
<p>Run a memory stress test with 2 workers using 1 GB each for 60 seconds:</p>
<pre><code class="lang-bash">stress-ng --vm 2 --vm-bytes 1G --timeout 60s --metrics-brief
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXfClHvTniyvnvn44H-uCNAQ3n8QSzZ8oftTxC5yWR5al93f1bk2tpxMB_MRgvf1XPYiY3L3K_IMSV9gbbISDFHyknslNd9zBbGYkNJGSfC8XIYnvFlDqwnanEGTbjwfpxIVEJG9?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h3 id="heading-disk-io-re-test">✅ Disk I/O Re-test</h3>
<p>Evaluate sequential disk write speed using <code>dd</code>:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span>=/dev/zero of=testfile bs=1G count=1 oflag=dsync
rm -f testfile
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcF7WTq1EMmHlE4Qw_Pp1_7hw24WKlnGwMMT_fkwfxxWUd3M2NWhlU-ctxakbQtNRC1xlOdr_sVvqTr-kQZct53NbADWZYBQ5FG2NZKdhol_0ZvnsCPR6BLuI5AdeoWmproIG2U?key=Ef__CL8GAMxaKEucTOeICg" alt /></p>
<h3 id="heading-network-re-test-loopback-throughput">✅ Network Re-test (Loopback Throughput)</h3>
<p><strong>Step 1: Start the iperf3 server</strong></p>
<pre><code class="lang-bash">iperf3 -s
</code></pre>
<p><strong>Step 2: In another terminal (same host), run the iperf3 client</strong></p>
<pre><code class="lang-bash">iperf3 -c 127.0.0.1
</code></pre>
<p>This will test internal loopback throughput after networking optimizations.</p>
<h2 id="heading-benchmark-results-before-vs-after">📊 Benchmark Results: Before vs. After</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Metric</strong></td><td><strong>Before Tuning</strong></td><td><strong>After Tuning</strong></td><td><strong>Result</strong></td></tr>
</thead>
<tbody>
<tr>
<td>CPU (stress-ng)</td><td>4367.92 bogo ops/s</td><td>4300.34 bogo ops/s</td><td>🔻 1.6% Decrease</td></tr>
<tr>
<td>Memory (stress-ng)</td><td>92099.68 bogo ops/s</td><td>91439.94 bogo ops/s</td><td>🔻 0.7% Slight Decrease</td></tr>
<tr>
<td>Disk I/O (dd)</td><td>1.4 GB/s</td><td>1.4 GB/s</td><td>➖ No Change</td></tr>
<tr>
<td>Network (iperf3)</td><td>67.3 Gbps</td><td>69.7 Gbps</td><td>🔺 3.6% Improvement</td></tr>
</tbody>
</table>
</div><h2 id="heading-analysis">📌 Analysis</h2>
<ul>
<li><p><strong>CPU &amp; Memory:</strong> Minor decrease in throughput—likely due to more deterministic scheduling from tuning.</p>
</li>
<li><p><strong>Disk I/O:</strong> No change—indicating the scheduler change didn’t affect sequential write speed.</p>
</li>
<li><p><strong>Network:</strong> Solid improvement in throughput thanks to buffer tuning and backlog increase.</p>
</li>
</ul>
<h2 id="heading-summary">✅ Summary</h2>
<p>Tuning a Linux host for containerized workloads <strong>isn't just about maxing out performance</strong>—it's about tailoring the system to match your application’s behavior.</p>
<h3 id="heading-key-takeaways">Key Takeaways:</h3>
<ul>
<li><p>Network stack tuning had the <strong>most measurable benefit</strong> (3.6% gain).</p>
</li>
<li><p>Memory and CPU tuning had <strong>slight trade-offs</strong>, possibly due to overhead or governor changes.</p>
</li>
<li><p>Always <strong>benchmark before and after</strong> to validate changes.</p>
</li>
<li><p>Apply these techniques to <strong>any containerized app</strong>, not just Nginx.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Building a Real-Time AWS WAF Log Analytics Dashboard with OpenSearch]]></title><description><![CDATA[This blog will walk you through a powerful end-to-end log analytics pipeline using AWS WAF, Kinesis Firehose, S3, Logstash, and OpenSearch Dashboards. We aim to analyze and visualize traffic patterns, particularly unwanted requests, filtered by WAF o...]]></description><link>https://devopsofworld.com/building-a-real-time-aws-waf-log-analytics-dashboard-with-opensearch</link><guid isPermaLink="true">https://devopsofworld.com/building-a-real-time-aws-waf-log-analytics-dashboard-with-opensearch</guid><category><![CDATA[opensearch]]></category><category><![CDATA[waf]]></category><category><![CDATA[logstash]]></category><category><![CDATA[#OpenSearchService]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Tue, 20 May 2025 07:34:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747726136476/c226f499-e338-45ec-be22-8b61def39952.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This blog will walk you through a powerful end-to-end log analytics pipeline using <strong>AWS WAF</strong>, <strong>Kinesis Firehose</strong>, <strong>S3</strong>, <strong>Logstash</strong>, and <strong>OpenSearch Dashboards</strong>. We aim to analyze and visualize traffic patterns, particularly unwanted requests, filtered by WAF on an EC2-hosted NGINX server.</p>
<h2 id="heading-architecture-overview">🧩 Architecture Overview</h2>
<p>Here’s the flow of data across our components:</p>
<ol>
<li><p><strong>NGINX</strong> application on an EC2 instance in a <strong>public subnet</strong>.</p>
</li>
<li><p>A <strong>Load Balancer</strong> is attached to the EC2 instance.</p>
</li>
<li><p><strong>AWS WAF</strong> rules are applied to block access from certain sources.</p>
</li>
<li><p>Logs sent to <strong>Amazon Kinesis Data Firehose</strong>.</p>
</li>
<li><p>Firehose delivers logs to an <strong>S3 Bucket</strong>.</p>
</li>
<li><p><strong>Logstash</strong> reads from S3 and parses logs.</p>
</li>
<li><p>Logs are indexed into <strong>OpenSearch servers</strong>.</p>
</li>
<li><p>Visualized through <strong>OpenSearch Dashboards</strong>.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747716079426/4bbac428-ec4c-4895-a7a5-3859407d5509.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-step-1-deploy-nginx-on-ec2">🏗️ Step 1: Deploy NGINX on EC2</h2>
<p>Spin up an EC2 instance in a <strong>public subnet</strong> and install NGINX:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747719785591/40c96e75-66af-4176-afec-5402a97bad13.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-bash">sudo apt-get update -y
sudo apt-get install nginx -y 
sudo systemctl <span class="hljs-built_in">enable</span> nginx
sudo systemctl start nginx
sudo systemctl status nginx
</code></pre>
<p>Ensure the EC2 is behind an <strong>Application Load Balancer (ALB)</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747720304888/3c878072-752d-45f5-89bf-32e6b43addc6.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747720350971/8686c47e-8f47-42dd-a086-718b4da36576.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747720405800/3b62dbcb-206e-401f-8184-26e37ba2db56.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747720518949/03217a05-8e60-4b6a-9ccd-a6c25b812fd6.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-step-2-protect-with-aws-waf">Step 2: Protect with AWS WAF</h2>
<p>Attach AWS WAF to your ALB. Create custom rules to filter traffic (e.g., block based on IP, User-Agent, geo, etc.).</p>
<p>Enable <strong>logging</strong> in WAF and configure <strong>Kinesis Data Firehose</strong> as the log destination.</p>
<h3 id="heading-step-by-step-instructions">🧾 Step-by-Step Instructions</h3>
<h4 id="heading-1-create-an-aws-waf-web-acl"><strong>1. Create an AWS WAF Web ACL</strong></h4>
<ol>
<li><p>Go to the <strong>AWS WAF &amp; Shield</strong> service in the AWS Console.</p>
</li>
<li><p>Click <strong>“Create web ACL”</strong>.</p>
</li>
<li><p>Provide a <strong>name</strong> and <strong>description</strong> (e.g., <code>nginx-web-acl</code>).</p>
</li>
<li><p>Select <strong>Region</strong> (e.g., <code>us-east-1</code>) and <strong>Resource type</strong> as <strong>“Regional resources”</strong>.</p>
</li>
<li><p>Choose <strong>Associated AWS resource type</strong>: select <strong>Application Load Balancer</strong>.</p>
</li>
<li><p>Choose the ALB that routes traffic to your NGINX EC2 instance.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747720959248/a3d81922-8c26-4044-a4dd-2a23b7bb7b09.png" alt class="image--center mx-auto" /></p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747720996792/e1eba011-1136-4fa1-ab42-ed9c3756a022.png" alt class="image--center mx-auto" /></p>
<h4 id="heading-2-add-waf-rules"><strong>2. Add WAF Rules</strong></h4>
<p>You can use managed rules or define custom ones. Examples:</p>
<h5 id="heading-block-local-ip-range-simulate-external-access-block">➤ <strong>Block Local IP Range (Simulate External Access Block)</strong></h5>
<p>To simulate blocking local system traffic, add a custom rule to block a specific IP:</p>
<ol>
<li><p>Under <strong>“Add rules”</strong>, click <strong>Add my own rules and rule groups</strong>.</p>
</li>
<li><p>Create a rule:</p>
<ul>
<li><p>Rule name: <code>BlockLocalIP</code></p>
</li>
<li><p>Type: <strong>IP set</strong></p>
</li>
<li><p>Create a new IP set with your <strong>local public IP</strong>.</p>
</li>
<li><p>Set action to <strong>Block</strong>.</p>
</li>
</ul>
</li>
<li><p>Add the rule to the Web ACL.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721050318/054f0eb4-aaf2-4dac-9f5e-7cf5e38121dc.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721083148/791b8d47-898d-474a-9bde-2ac02b8051a4.png" alt class="image--center mx-auto" /></p>
<h4 id="heading-3-configure-logging-to-kinesis-data-firehose"><strong>3. Configure Logging to Kinesis Data Firehose</strong></h4>
<p> WAF logs can be sent to <strong>Amazon Kinesis Data Firehose</strong>, which will forward them to <strong>S3</strong>.</p>
<ol>
<li><p>On the left nav, go to <strong>Logging and metrics</strong> under WAF.</p>
</li>
<li><p>Click <strong>“Enable logging”</strong>.</p>
</li>
<li><p>Select your <strong>Web ACL</strong>.</p>
</li>
<li><p>Choose the <strong>Kinesis Firehose delivery stream</strong> that you created earlier (<code>waf-logs-stream</code>).</p>
</li>
<li><p>Optionally, add filters or redactions.</p>
</li>
<li><p>Save changes.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721298863/2487cd79-53f9-4175-827f-df64d67d6f7c.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721369014/739ee873-2b61-4425-abce-9b6f364c464b.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721385928/4db69dd4-9c0a-42e3-85d2-9573c27e9008.png" alt class="image--center mx-auto" /></p>
<h4 id="heading-4-verify-logs-are-flowing"><strong>4. Verify Logs Are Flowing</strong></h4>
<ol>
<li><p>Send traffic through your Load Balancer (via browser or <code>curl</code>).</p>
</li>
<li><p>In the <strong>S3 bucket</strong>, JSON log files arrive via Firehose.</p>
<ul>
<li>Each log event will contain request metadata like <code>clientIp</code>, <code>action</code>, <code>ruleMatched</code>, <code>httpRequest</code>, etc.</li>
</ul>
</li>
<li><p>These logs will be picked up later by <strong>Logstash</strong> and placed in your pipeline.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721485057/dfe056b5-6213-4d3c-8e63-9a08200d5f5b.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721518439/1485627a-b37d-4050-9581-5c68db330cb3.png" alt class="image--center mx-auto" /></p>
</li>
</ol>
</li>
</ol>
</li>
</ol>
<h2 id="heading-step-3-deploy-opensearch-and-logstash-with-docker"><strong>🐳 Step 3: Deploy OpenSearch and Logstash with Docker</strong></h2>
<p>Create a Docker network:</p>
<pre><code class="lang-bash">docker network create opensearch-net
</code></pre>
<p>🔍 <strong>Run OpenSearch:</strong></p>
<pre><code class="lang-bash">docker run -d --name opensearch-node1 \
  --network opensearch-net \
  -p 9200:9200 -p 9600:9600 \
  -e <span class="hljs-string">"discovery.type=single-node"</span> \
  -e <span class="hljs-string">"OPENSEARCH_INITIAL_ADMIN_PASSWORD=Redhat@123"</span> \
  opensearchproject/opensearch:latest
</code></pre>
<p>📊 <strong>OpenSearch Dashboards:</strong></p>
<pre><code class="lang-bash">docker run -d --name opensearch-dashboards \
  --network opensearch-net \
  -p 5601:5601 \
  -e OPENSEARCH_HOSTS=<span class="hljs-string">'["https://opensearch-node1:9200"]'</span> \
  -e OPENSEARCH_USERNAME=<span class="hljs-string">'admin'</span> \
  -e OPENSEARCH_PASSWORD=<span class="hljs-string">'Redhat@123'</span> \
  opensearchproject/opensearch-dashboards:latest
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747721955952/4140b7dc-e061-4550-adf7-6fc197ab2971.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-step-5-logstash-to-process-waf-logs">📥 Step 5: Logstash to Process WAF Logs</h2>
<p>Use the official image with the OpenSearch plugin:</p>
<pre><code class="lang-bash">docker run -d --name logstash-01 \
  --network opensearch-net \
  -v /home/neetesh/cloudkeeper-workspace/waf-promethuses-02/logstash-pipeline:/usr/share/logstash/pipeline \
  -v /tmp/logstash-s3:/tmp/logstash-s3 \
  opensearchproject/logstash-oss-with-opensearch-output-plugin:latest
</code></pre>
<h3 id="heading-logstash-pipeline-configuration">🔧 Logstash Pipeline Configuration</h3>
<pre><code class="lang-bash">
input {
  s3 {
    bucket =&gt; <span class="hljs-string">"poc-s3-bucket-00000001"</span>
    prefix =&gt; <span class="hljs-string">""</span>                   <span class="hljs-comment"># Adjust if you want to limit path</span>
    region =&gt; <span class="hljs-string">"us-east-1"</span>               <span class="hljs-comment"># Replace with your bucket’s region</span>
    codec =&gt; <span class="hljs-string">"json"</span>
    sincedb_path =&gt; <span class="hljs-string">"/tmp/logstash-s3.sincedb"</span>
    temporary_directory =&gt; <span class="hljs-string">"/tmp/logstash-s3"</span> 
    access_key_id =&gt; <span class="hljs-string">"ACCESS_KEY_ID"</span>
    secret_access_key =&gt; <span class="hljs-string">"SECRET_ACCESS_KEY"</span>
  }
}

filter {
  <span class="hljs-keyword">if</span> <span class="hljs-string">"_jsonparsefailure"</span> <span class="hljs-keyword">in</span> [tags] {
    drop { }
  }

  <span class="hljs-keyword">if</span> [action] {
    mutate { add_field =&gt; { <span class="hljs-string">"action"</span> =&gt; <span class="hljs-string">"%{[action]}"</span> } }
  }

  <span class="hljs-keyword">if</span> [terminatingRuleId] {
    mutate { add_field =&gt; { <span class="hljs-string">"terminating_rule"</span> =&gt; <span class="hljs-string">"%{[terminatingRuleId]}"</span> } }
  }

  <span class="hljs-keyword">if</span> [httpRequest][clientIp] {
    mutate { add_field =&gt; { <span class="hljs-string">"client_ip"</span> =&gt; <span class="hljs-string">"%{[httpRequest][clientIp]}"</span> } }
  }

  <span class="hljs-keyword">if</span> [httpRequest][country] {
    mutate { add_field =&gt; { <span class="hljs-string">"country"</span> =&gt; <span class="hljs-string">"%{[httpRequest][country]}"</span> } }
  }

  <span class="hljs-keyword">if</span> [httpRequest][httpMethod] {
    mutate { add_field =&gt; { <span class="hljs-string">"http_method"</span> =&gt; <span class="hljs-string">"%{[httpRequest][httpMethod]}"</span> } }
  }

  <span class="hljs-keyword">if</span> [httpRequest][uri] {
    mutate { add_field =&gt; { <span class="hljs-string">"uri"</span> =&gt; <span class="hljs-string">"%{[httpRequest][uri]}"</span> } }
  }

  <span class="hljs-keyword">if</span> [httpRequest][httpVersion] {
    mutate { add_field =&gt; { <span class="hljs-string">"http_version"</span> =&gt; <span class="hljs-string">"%{[httpRequest][httpVersion]}"</span> } }
  }

  <span class="hljs-keyword">if</span> [httpRequest][headers] {
    ruby {
      code =&gt; <span class="hljs-string">'
        begin
          headers = event.get("[httpRequest][headers]")
          headers.each do |h|
            if h["name"].downcase == "user-agent"
              event.set("user_agent", h["value"])
            elsif h["name"].downcase == "host"
              event.set("host_header", h["value"])
            end
          end
        rescue =&gt; e
          event.tag("_header_parse_failure")
        end
      '</span>
    }
  }
}

output {
  opensearch {
    hosts =&gt; [<span class="hljs-string">"https://opensearch-node1:9200"</span>]
    user =&gt; <span class="hljs-string">"admin"</span>
    password =&gt; <span class="hljs-string">"Redhat@123"</span>
    ssl =&gt; <span class="hljs-literal">true</span>
    ssl_certificate_verification =&gt; <span class="hljs-literal">false</span>
    index =&gt; <span class="hljs-string">"aws-waf-logs-test-%{+YYYY.MM.dd}"</span>
  }
  stdout { codec =&gt; rubydebug }
}
</code></pre>
<h2 id="heading-step-6-visualize-logs-in-opensearch-dashboards">📈 Step 6: Visualize Logs in OpenSearch Dashboards</h2>
<p>Navigate to <a target="_blank" href="http://localhost:5601"><code>http://localhost:5601</code></a> and log in with:</p>
<ul>
<li><p><strong>Username:</strong> <code>admin</code></p>
</li>
<li><p><strong>Password:</strong> <code>Redhat@123</code></p>
</li>
</ul>
<p>Create an index pattern: <code>aws-waf-logs-test-*</code></p>
<p>You can now create rich visualizations such as:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747722151075/96aae3c3-23e1-483b-9e21-33e0b119fe61.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747722189418/8b239835-0fa7-4bd0-a5a4-16d54bba8f0e.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747722223067/d12caf61-8952-42d8-a93b-792550364465.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-full-opensearch-dashboard-json-import-snippets">📦 Full OpenSearch Dashboard: JSON Import Snippets</h2>
<p>You can now import a ready-made dashboard with rich visualizations into OpenSearch Dashboards using the export JSON you've created.</p>
<h3 id="heading-included-visualizations">📁 Included Visualizations</h3>
<p>Your exported dashboard includes:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Title</td><td>Type</td><td>Description</td></tr>
</thead>
<tbody>
<tr>
<td><strong>WAF Actions</strong></td><td>Pie Chart</td><td>Visual breakdown of <code>ALLOW</code>, <code>BLOCK</code>, <code>CAPTCHA</code> actions.</td></tr>
<tr>
<td><strong>Total HTTP Requests</strong></td><td>Metric</td><td>Count of total requests received.</td></tr>
<tr>
<td><strong>Blocked HTTP Requests</strong></td><td>Metric</td><td>Requests specifically marked as <code>BLOCK</code>.</td></tr>
<tr>
<td><strong>HTTP Versions Breakdown</strong></td><td>Pie Chart</td><td>Shows HTTP protocol versions like 1.1 vs 2.0.</td></tr>
<tr>
<td><strong>HTTP Methods</strong></td><td>Pie Chart</td><td>GET, POST, etc.</td></tr>
<tr>
<td><strong>Top Hosts</strong></td><td>Pie Chart</td><td>Popular host headers seen in WAF logs.</td></tr>
<tr>
<td><strong>Top Countries</strong></td><td>Pie Chart</td><td>Countries from which requests originated.</td></tr>
<tr>
<td><strong>Top IP Addresses</strong></td><td>Pie Chart</td><td>Most frequent source IPs.</td></tr>
<tr>
<td><strong>Top User Agents</strong></td><td>Pie Chart</td><td>Devices or clients initiating traffic.</td></tr>
<tr>
<td><strong>Top Web ACLs</strong></td><td>Table</td><td>Lists WAF WebACLs that matched requests.</td></tr>
<tr>
<td><strong>Unique IP Address Count</strong></td><td>Metric</td><td>Unique source IPs seen.</td></tr>
<tr>
<td><strong>Number of Requests per Country</strong></td><td>Bar Chart</td><td>Comparative view of traffic volume per country.</td></tr>
</tbody>
</table>
</div><pre><code class="lang-json">{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"buildNum"</span>:<span class="hljs-number">8430</span>,<span class="hljs-attr">"defaultIndex"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"3.0.0"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"config"</span>:<span class="hljs-string">"7.9.0"</span>},<span class="hljs-attr">"references"</span>:[],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"config"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T11:19:38.632Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzIsMV0="</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"fields"</span>:<span class="hljs-string">"[{\"count\":0,\"name\":\"@timestamp\",\"type\":\"date\",\"esTypes\":[\"date\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true},{\"count\":0,\"name\":\"@version\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"@version.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"@version\"}}},{\"count\":0,\"name\":\"_id\",\"type\":\"string\",\"esTypes\":[\"_id\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":false},{\"count\":0,\"name\":\"_index\",\"type\":\"string\",\"esTypes\":[\"_index\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":false},{\"count\":0,\"name\":\"_score\",\"type\":\"number\",\"scripted\":false,\"searchable\":false,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"_source\",\"type\":\"_source\",\"esTypes\":[\"_source\"],\"scripted\":false,\"searchable\":false,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"_type\",\"type\":\"string\",\"scripted\":false,\"searchable\":false,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"action\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"action.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"action\"}}},{\"count\":0,\"name\":\"captchaResponse.failureReason\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"captchaResponse.failureReason.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"captchaResponse.failureReason\"}}},{\"count\":0,\"name\":\"captchaResponse.responseCode\",\"type\":\"number\",\"esTypes\":[\"long\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true},{\"count\":0,\"name\":\"client_ip\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"client_ip.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"client_ip\"}}},{\"count\":0,\"name\":\"country\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"country.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"country\"}}},{\"count\":0,\"name\":\"event.original\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"event.original.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"event.original\"}}},{\"count\":0,\"name\":\"formatVersion\",\"type\":\"number\",\"esTypes\":[\"long\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true},{\"count\":0,\"name\":\"host.name\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"host.name.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"host.name\"}}},{\"count\":0,\"name\":\"host_header\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"host_header.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"host_header\"}}},{\"count\":0,\"name\":\"httpRequest.clientIp\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"httpRequest.clientIp.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"httpRequest.clientIp\"}}},{\"count\":0,\"name\":\"httpRequest.country\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"httpRequest.country.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"httpRequest.country\"}}},{\"count\":0,\"name\":\"httpRequest.headers.name\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"httpRequest.headers.name.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"httpRequest.headers.name\"}}},{\"count\":0,\"name\":\"httpRequest.headers.value\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"httpRequest.headers.value.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"httpRequest.headers.value\"}}},{\"count\":0,\"name\":\"httpRequest.httpMethod\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"httpRequest.httpMethod.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"httpRequest.httpMethod\"}}},{\"count\":0,\"name\":\"httpRequest.httpVersion\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"httpRequest.httpVersion.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"httpRequest.httpVersion\"}}},{\"count\":0,\"name\":\"httpRequest.uri\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"httpRequest.uri.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"httpRequest.uri\"}}},{\"count\":0,\"name\":\"http_method\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"http_method.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"http_method\"}}},{\"count\":0,\"name\":\"http_version\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"http_version.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"http_version\"}}},{\"count\":0,\"name\":\"log.file.path\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"log.file.path.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"log.file.path\"}}},{\"count\":0,\"name\":\"nonTerminatingMatchingRules.action\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"nonTerminatingMatchingRules.action.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"nonTerminatingMatchingRules.action\"}}},{\"count\":0,\"name\":\"nonTerminatingMatchingRules.ruleId\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"nonTerminatingMatchingRules.ruleId.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"nonTerminatingMatchingRules.ruleId\"}}},{\"count\":0,\"name\":\"rateBasedRuleList.customValues.key\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"rateBasedRuleList.customValues.key.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"rateBasedRuleList.customValues.key\"}}},{\"count\":0,\"name\":\"rateBasedRuleList.customValues.name\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"rateBasedRuleList.customValues.name.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"rateBasedRuleList.customValues.name\"}}},{\"count\":0,\"name\":\"rateBasedRuleList.customValues.value\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"rateBasedRuleList.customValues.value.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"rateBasedRuleList.customValues.value\"}}},{\"count\":0,\"name\":\"rateBasedRuleList.evaluationWindowSec\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"rateBasedRuleList.evaluationWindowSec.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"rateBasedRuleList.evaluationWindowSec\"}}},{\"count\":0,\"name\":\"rateBasedRuleList.limitKey\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"rateBasedRuleList.limitKey.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"rateBasedRuleList.limitKey\"}}},{\"count\":0,\"name\":\"rateBasedRuleList.maxRateAllowed\",\"type\":\"number\",\"esTypes\":[\"long\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true},{\"count\":0,\"name\":\"rateBasedRuleList.rateBasedRuleId\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"rateBasedRuleList.rateBasedRuleId.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"rateBasedRuleList.rateBasedRuleId\"}}},{\"count\":0,\"name\":\"rateBasedRuleList.rateBasedRuleName\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"rateBasedRuleList.rateBasedRuleName.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"rateBasedRuleList.rateBasedRuleName\"}}},{\"count\":0,\"name\":\"terminatingRuleId\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"terminatingRuleId.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"terminatingRuleId\"}}},{\"count\":0,\"name\":\"terminatingRuleType\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"terminatingRuleType.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"terminatingRuleType\"}}},{\"count\":0,\"name\":\"terminating_rule\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"terminating_rule.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"terminating_rule\"}}},{\"count\":0,\"name\":\"timestamp\",\"type\":\"number\",\"esTypes\":[\"long\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true},{\"count\":0,\"name\":\"uri\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"uri.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"uri\"}}},{\"count\":0,\"name\":\"user_agent\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"user_agent.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"user_agent\"}}},{\"count\":0,\"name\":\"webaclId\",\"type\":\"string\",\"esTypes\":[\"text\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":false,\"readFromDocValues\":false},{\"count\":0,\"name\":\"webaclId.keyword\",\"type\":\"string\",\"esTypes\":[\"keyword\"],\"scripted\":false,\"searchable\":true,\"aggregatable\":true,\"readFromDocValues\":true,\"subType\":{\"multi\":{\"parent\":\"webaclId\"}}}]"</span>,<span class="hljs-attr">"timeFieldName"</span>:<span class="hljs-string">"@timestamp"</span>,<span class="hljs-attr">"title"</span>:<span class="hljs-string">"aws-waf-logs-*"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"index-pattern"</span>:<span class="hljs-string">"7.6.0"</span>},<span class="hljs-attr">"references"</span>:[],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T11:19:34.933Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzEsMV0="</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"WAF Actions (ALLOW vs BLOCK vs CAPTCHA)"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{\"vis\":{\"colors\":{\"BLOCK\":\"#ef9988\"}}}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"WAF Actions (ALLOW vs BLOCK vs CAPTCHA)\",\"type\":\"pie\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"action.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"segment\"}],\"params\":{\"addLegend\":true,\"addTooltip\":true,\"isDonut\":true,\"labels\":{\"last_level\":true,\"show\":false,\"truncate\":100,\"values\":true},\"legendPosition\":\"right\",\"row\":true,\"type\":\"pie\"}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"99127f00-30b7-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T12:15:02.158Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzEzLDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Total HTTP Requests"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Total HTTP Requests\",\"type\":\"metric\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{\"customLabel\":\"Total HTTP Requests\"},\"schema\":\"metric\"}],\"params\":{\"addTooltip\":true,\"addLegend\":false,\"type\":\"metric\",\"metric\":{\"percentageMode\":false,\"useRanges\":false,\"colorSchema\":\"Green to Red\",\"metricColorMode\":\"None\",\"colorsRange\":[{\"from\":0,\"to\":10000}],\"labels\":{\"show\":true},\"invertColors\":false,\"style\":{\"bgFill\":\"#000\",\"bgColor\":false,\"labelColor\":false,\"subText\":\"\",\"fontSize\":60}}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"63e0d4c0-30b8-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T11:41:35.628Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzYsMV0="</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[{\"$state\":{\"store\":\"appState\"},\"meta\":{\"alias\":null,\"disabled\":false,\"key\":\"action.keyword\",\"negate\":false,\"params\":{\"query\":\"BLOCK\"},\"type\":\"phrase\",\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.filter[0].meta.index\"},\"query\":{\"match_phrase\":{\"action.keyword\":\"BLOCK\"}}}],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Blocked HTTP Requests"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Blocked HTTP Requests\",\"type\":\"metric\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{\"customLabel\":\"Blocked Requests\"},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"filters\",\"params\":{\"filters\":[{\"input\":{\"query\":\"\",\"language\":\"kuery\"},\"label\":\"\"}]},\"schema\":\"group\"}],\"params\":{\"addTooltip\":true,\"addLegend\":false,\"type\":\"metric\",\"metric\":{\"percentageMode\":false,\"useRanges\":false,\"colorSchema\":\"Green to Red\",\"metricColorMode\":\"None\",\"colorsRange\":[{\"from\":0,\"to\":10000}],\"labels\":{\"show\":true},\"invertColors\":false,\"style\":{\"bgFill\":\"#000\",\"bgColor\":false,\"labelColor\":false,\"subText\":\"\",\"fontSize\":60}}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"7132d620-30bb-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.filter[0].meta.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T12:03:30.357Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzgsMV0="</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"HTTP Versions Breakdown"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"HTTP Versions Breakdown\",\"type\":\"pie\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"http_version.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"segment\"}],\"params\":{\"type\":\"pie\",\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"isDonut\":true,\"labels\":{\"show\":false,\"values\":true,\"last_level\":true,\"truncate\":100}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"7119bbf0-30d3-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T14:55:14.223Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzE4LDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"HTTP Methods"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"HTTP Methods\",\"type\":\"pie\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"http_method.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"segment\"}],\"params\":{\"type\":\"pie\",\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"isDonut\":true,\"labels\":{\"show\":false,\"values\":true,\"last_level\":true,\"truncate\":100},\"row\":true}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"1f420b20-30d3-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T14:53:01.512Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzE1LDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Top  Hosts"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Top  Hosts\",\"type\":\"pie\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"host_header.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"segment\"}],\"params\":{\"type\":\"pie\",\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"isDonut\":true,\"labels\":{\"show\":false,\"values\":true,\"last_level\":true,\"truncate\":100}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"49df5b20-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T15:01:22.321Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzI3LDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Top Countries "</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Top Countries \",\"type\":\"pie\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"country.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"segment\"}],\"params\":{\"type\":\"pie\",\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"isDonut\":true,\"labels\":{\"show\":false,\"values\":true,\"last_level\":true,\"truncate\":100}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"0824e100-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T15:05:39.905Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzMwLDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Top IP Addresses"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Top IP Addresses\",\"type\":\"pie\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"client_ip.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"segment\"}],\"params\":{\"type\":\"pie\",\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"isDonut\":true,\"labels\":{\"show\":false,\"values\":true,\"last_level\":true,\"truncate\":100}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"e4418fe0-30d3-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T15:05:50.998Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzMxLDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Top User Agents"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Top User Agents\",\"type\":\"pie\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"user_agent.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"segment\"}],\"params\":{\"type\":\"pie\",\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"isDonut\":true,\"labels\":{\"show\":false,\"values\":true,\"last_level\":true,\"truncate\":100}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"26b45d80-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T15:00:35.246Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzI1LDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Top Web ACLs"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Top Web ACLs\",\"type\":\"table\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"webaclId.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"bucket\"}],\"params\":{\"perPage\":10,\"showPartialRows\":false,\"showMetricsAtAllLevels\":false,\"showTotal\":false,\"totalFunc\":\"sum\",\"percentageCol\":\"\"}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"c6952af0-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T15:05:19.523Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzI5LDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"hits"</span>:<span class="hljs-number">0</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"</span>},<span class="hljs-attr">"optionsJSON"</span>:<span class="hljs-string">"{\"hidePanelTitles\":false,\"useMargins\":true}"</span>,<span class="hljs-attr">"panelsJSON"</span>:<span class="hljs-string">"[{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"674ecf5a-ed50-411f-8178-d0c28c2f0acd\",\"w\":24,\"x\":0,\"y\":0},\"panelIndex\":\"674ecf5a-ed50-411f-8178-d0c28c2f0acd\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_0\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"9a2fb88c-a482-4adc-9ab9-1693caca9e07\",\"w\":24,\"x\":24,\"y\":0},\"panelIndex\":\"9a2fb88c-a482-4adc-9ab9-1693caca9e07\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_1\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"02404d24-4e9f-4120-bc80-5931a1e8fe7c\",\"w\":24,\"x\":24,\"y\":15},\"panelIndex\":\"02404d24-4e9f-4120-bc80-5931a1e8fe7c\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_2\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"681f6ea4-757a-4fbd-b74d-20698edf01dd\",\"w\":24,\"x\":0,\"y\":15},\"panelIndex\":\"681f6ea4-757a-4fbd-b74d-20698edf01dd\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_3\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"1c3d4763-f99f-4945-a32f-c6553518f059\",\"w\":24,\"x\":24,\"y\":30},\"panelIndex\":\"1c3d4763-f99f-4945-a32f-c6553518f059\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_4\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"9ce23390-217d-4d1d-a9df-0d9a2d858966\",\"w\":24,\"x\":0,\"y\":30},\"panelIndex\":\"9ce23390-217d-4d1d-a9df-0d9a2d858966\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_5\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"d9b7d60d-e78b-473c-9493-4ed9cdeb824f\",\"w\":24,\"x\":24,\"y\":45},\"panelIndex\":\"d9b7d60d-e78b-473c-9493-4ed9cdeb824f\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_6\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"a7e493f6-23df-4c6d-b95a-d45fe9735d57\",\"w\":24,\"x\":0,\"y\":45},\"panelIndex\":\"a7e493f6-23df-4c6d-b95a-d45fe9735d57\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_7\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"60758b57-6454-4bd0-a723-091816c4ed24\",\"w\":24,\"x\":24,\"y\":60},\"panelIndex\":\"60758b57-6454-4bd0-a723-091816c4ed24\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_8\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"26cb5b48-4840-4a02-92f5-f783e6053c98\",\"w\":24,\"x\":0,\"y\":60},\"panelIndex\":\"26cb5b48-4840-4a02-92f5-f783e6053c98\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_9\"},{\"embeddableConfig\":{},\"gridData\":{\"h\":15,\"i\":\"61169e8e-a911-47dd-8f4f-abab036fa0a7\",\"w\":24,\"x\":24,\"y\":75},\"panelIndex\":\"61169e8e-a911-47dd-8f4f-abab036fa0a7\",\"version\":\"3.0.0\",\"panelRefName\":\"panel_10\"}]"</span>,<span class="hljs-attr">"timeRestore"</span>:<span class="hljs-literal">false</span>,<span class="hljs-attr">"title"</span>:<span class="hljs-string">"WAF-Monitorings"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"7d2643d0-30bc-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"dashboard"</span>:<span class="hljs-string">"7.9.3"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"99127f00-30b7-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_0"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"63e0d4c0-30b8-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_1"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"7132d620-30bb-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_2"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"7119bbf0-30d3-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_3"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"1f420b20-30d3-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_4"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"49df5b20-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_5"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"0824e100-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_6"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"e4418fe0-30d3-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_7"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"26b45d80-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_8"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"c6952af0-30d4-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_9"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>},{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"63e0d4c0-30b8-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"panel_10"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"dashboard"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-15T06:06:12.962Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzM2LDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Unique IP Address Count"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Unique IP Address Count\",\"type\":\"metric\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"client_ip.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":5,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\"},\"schema\":\"group\"}],\"params\":{\"addTooltip\":true,\"addLegend\":false,\"type\":\"metric\",\"metric\":{\"percentageMode\":false,\"useRanges\":false,\"colorSchema\":\"Green to Red\",\"metricColorMode\":\"None\",\"colorsRange\":[{\"from\":0,\"to\":10000}],\"labels\":{\"show\":true},\"invertColors\":false,\"style\":{\"bgFill\":\"#000\",\"bgColor\":false,\"labelColor\":false,\"subText\":\"\",\"fontSize\":60}}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"8e9c5c50-30d3-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T14:56:03.733Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzE5LDFd"</span>}
{<span class="hljs-attr">"attributes"</span>:{<span class="hljs-attr">"description"</span>:<span class="hljs-string">""</span>,<span class="hljs-attr">"kibanaSavedObjectMeta"</span>:{<span class="hljs-attr">"searchSourceJSON"</span>:<span class="hljs-string">"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[],\"indexRefName\":\"kibanaSavedObjectMeta.searchSourceJSON.index\"}"</span>},<span class="hljs-attr">"title"</span>:<span class="hljs-string">"Number of Requests per Country"</span>,<span class="hljs-attr">"uiStateJSON"</span>:<span class="hljs-string">"{}"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"visState"</span>:<span class="hljs-string">"{\"title\":\"Number of Requests per Country\",\"type\":\"histogram\",\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"params\":{\"customLabel\":\"Total Requests\"},\"schema\":\"metric\"},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"params\":{\"field\":\"country.keyword\",\"orderBy\":\"1\",\"order\":\"desc\",\"size\":10,\"otherBucket\":false,\"otherBucketLabel\":\"Other\",\"missingBucket\":false,\"missingBucketLabel\":\"Missing\",\"customLabel\":\"Country\"},\"schema\":\"segment\"}],\"params\":{\"type\":\"histogram\",\"grid\":{\"categoryLines\":false},\"categoryAxes\":[{\"id\":\"CategoryAxis-1\",\"type\":\"category\",\"position\":\"bottom\",\"show\":true,\"style\":{},\"scale\":{\"type\":\"linear\"},\"labels\":{\"show\":true,\"filter\":true,\"truncate\":100},\"title\":{}}],\"valueAxes\":[{\"id\":\"ValueAxis-1\",\"name\":\"LeftAxis-1\",\"type\":\"value\",\"position\":\"left\",\"show\":true,\"style\":{},\"scale\":{\"type\":\"linear\",\"mode\":\"normal\"},\"labels\":{\"show\":true,\"rotate\":0,\"filter\":false,\"truncate\":100},\"title\":{\"text\":\"Total Requests\"}}],\"seriesParams\":[{\"show\":true,\"type\":\"histogram\",\"mode\":\"stacked\",\"data\":{\"label\":\"Total Requests\",\"id\":\"1\"},\"valueAxis\":\"ValueAxis-1\",\"drawLinesBetweenPoints\":true,\"lineWidth\":2,\"showCircles\":true}],\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"times\":[],\"addTimeMarker\":false,\"labels\":{\"show\":false},\"thresholdLine\":{\"show\":false,\"value\":10,\"width\":1,\"style\":\"full\",\"color\":\"#E7664C\"}}}"</span>},<span class="hljs-attr">"id"</span>:<span class="hljs-string">"397b01f0-30d7-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"migrationVersion"</span>:{<span class="hljs-attr">"visualization"</span>:<span class="hljs-string">"7.10.0"</span>},<span class="hljs-attr">"references"</span>:[{<span class="hljs-attr">"id"</span>:<span class="hljs-string">"50aec450-30b5-11f0-9eb5-8f6a0d106a1d"</span>,<span class="hljs-attr">"name"</span>:<span class="hljs-string">"kibanaSavedObjectMeta.searchSourceJSON.index"</span>,<span class="hljs-attr">"type"</span>:<span class="hljs-string">"index-pattern"</span>}],<span class="hljs-attr">"type"</span>:<span class="hljs-string">"visualization"</span>,<span class="hljs-attr">"updated_at"</span>:<span class="hljs-string">"2025-05-14T15:22:22.388Z"</span>,<span class="hljs-attr">"version"</span>:<span class="hljs-string">"WzM0LDFd"</span>}
{<span class="hljs-attr">"exportedCount"</span>:<span class="hljs-number">15</span>,<span class="hljs-attr">"missingRefCount"</span>:<span class="hljs-number">0</span>,<span class="hljs-attr">"missingReferences"</span>:[]}
</code></pre>
<h3 id="heading-how-to-import-the-dashboard">🛠️ How to Import the Dashboard</h3>
<ol>
<li><p>Navigate to <strong>OpenSearch Dashboards</strong> → <strong>Dashboards Management</strong> → <strong>Saved Objects</strong>.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747897938665/d1acf6db-441c-4441-bc07-abbfcc87bed3.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Click <strong>Import</strong> and upload your JSON file.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747898004708/6e897ecc-9ad5-49ce-a1df-8501590e7642.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Confirm and overwrite if prompted.</p>
</li>
<li><p>Navigate to <strong>Dashboards</strong> → <strong>WAF-Monitorings</strong>.</p>
</li>
</ol>
<p>This dashboard will instantly visualize live or historical WAF logs streamed from AWS into OpenSearch via your pipeline.</p>
<h2 id="heading-conclusion">✅ Conclusion</h2>
<p>This blog walked you through building a log analysis pipeline using AWS WAF, Firehose, S3, Logstash, and OpenSearch. With this setup, you gain full visibility into suspicious traffic, helping improve your application’s security posture in real-time.</p>
]]></content:encoded></item><item><title><![CDATA[Automatically Stop Amazon RDS Instances Daily Using AWS Lambda and EventBridge]]></title><description><![CDATA[Managing AWS costs efficiently is crucial, especially when dealing with development, testing, or QA environments. A common and effective approach is to stop RDS instances during off-hours automatically. In this guide, you’ll learn how to set up an au...]]></description><link>https://devopsofworld.com/automatically-stop-amazon-rds-instances-daily-using-aws-lambda-and-eventbridge</link><guid isPermaLink="true">https://devopsofworld.com/automatically-stop-amazon-rds-instances-daily-using-aws-lambda-and-eventbridge</guid><category><![CDATA[rds]]></category><category><![CDATA[automation]]></category><category><![CDATA[CostSavings]]></category><dc:creator><![CDATA[DevOpsofworld]]></dc:creator><pubDate>Sun, 18 May 2025 11:04:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747500564380/ffaccefc-4368-4ab9-a1ad-b1475272452a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Managing AWS costs efficiently is crucial, especially when dealing with development, testing, or QA environments. A common and effective approach is to <strong>stop RDS instances</strong> during off-hours automatically. In this guide, you’ll learn how to set up an <strong>automated daily RDS shutdown</strong> using <strong>AWS Lambda</strong>, <strong>EventBridge Scheduler</strong>, and <strong>IAM roles</strong>.</p>
<h2 id="heading-solution-overview">Solution Overview</h2>
<h3 id="heading-components-used">🧱 Components Used:</h3>
<ul>
<li><p><strong>AWS Lambda (Python)</strong>: A serverless function that stops RDS instances.</p>
</li>
<li><p><strong>Amazon EventBridge Scheduler</strong>: Triggers the Lambda function once per day.</p>
</li>
<li><p><strong>IAM Role</strong>: Grants permissions for the Lambda to stop RDS instances.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747500635846/0a00cb5b-87ab-4f52-818c-9b61a3ccf6e4.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h3 id="heading-how-it-works">⚙️ How It Works:</h3>
<ol>
<li><p><strong>EventBridge Scheduler</strong> triggers the Lambda function every day.</p>
</li>
<li><p><strong>Lambda function</strong> iterates over a list of RDS instances and attempts to stop each one.</p>
</li>
<li><p>All execution logs are stored in <strong>Amazon CloudWatch Logs</strong> for auditing and debugging.</p>
</li>
</ol>
<h2 id="heading-step-by-step-setup-guide">Step-by-Step Setup Guide</h2>
<h3 id="heading-1-create-the-lambda-function">1️⃣ . Create the Lambda Function</h3>
<ul>
<li><p><strong>Go to</strong> the AWS Lambda Console.</p>
</li>
<li><p><strong>Click</strong> “Create function” → Select <strong>Author from scratch</strong>.</p>
</li>
<li><p><strong>Function Name</strong>: <code>stop-rds-lambda-function</code></p>
</li>
<li><p><strong>Runtime</strong>: Choose <strong>Python 3.13</strong> or above.</p>
</li>
<li><p><strong>Permissions</strong>: Attach an IAM role with the following permissions:</p>
<ul>
<li><p><code>rds:StopDBInstance</code></p>
</li>
<li><p><code>rds:DescribeDBInstances</code></p>
</li>
<li><p><code>logs:CreateLogGroup</code></p>
</li>
<li><p><code>logs:CreateLogStream</code></p>
</li>
<li><p><code>logs:PutLogEvents</code></p>
</li>
<li><p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXePZWBdTv__21FXOyAHc3-Jg0TyfJ13OxAjtkakG-tah3w3B_TFq4P7rr4CUTvo4erHsoxm9FfbWHfCe15BbmPG-o9ccD6oHiMBXcNV2ixCT8cT4OQJsxrw9xMdMxIheQvdcTi74g?key=zhdBqEuEsI0vnjepXvtwiw" alt /></p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-add-lambda-code">Add Lambda Code</h3>
<p>Paste the following code in the function editor:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">import</span> logging

rds = boto3.client(<span class="hljs-string">'rds'</span>)
logger = logging.getLogger()
logger.setLevel(logging.INFO)

DB_INSTANCES = [<span class="hljs-string">'testing-db-us-east-1-demo'</span>]  <span class="hljs-comment"># Add your DB instance identifiers here</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">lambda_handler</span>(<span class="hljs-params">event, context</span>):</span>
    stopped_instances = []
    failed_instances = []

    <span class="hljs-keyword">for</span> db_id <span class="hljs-keyword">in</span> DB_INSTANCES:
        <span class="hljs-keyword">try</span>:
            logger.info(<span class="hljs-string">f"Attempting to stop DB instance: <span class="hljs-subst">{db_id}</span>"</span>)
            response = rds.stop_db_instance(DBInstanceIdentifier=db_id)
            logger.info(<span class="hljs-string">f"Stop initiated for: <span class="hljs-subst">{db_id}</span>"</span>)
            stopped_instances.append(db_id)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            logger.error(<span class="hljs-string">f"Failed to stop <span class="hljs-subst">{db_id}</span>: <span class="hljs-subst">{str(e)}</span>"</span>)
            failed_instances.append({<span class="hljs-string">'db_id'</span>: db_id, <span class="hljs-string">'error'</span>: str(e)})

    <span class="hljs-keyword">return</span> {
        <span class="hljs-string">'statusCode'</span>: <span class="hljs-number">200</span>,
        <span class="hljs-string">'stopped_instances'</span>: stopped_instances,
        <span class="hljs-string">'failed_instances'</span>: failed_instances
    }
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdhPfrI7pdS5HH59E0zNXafGpu3ulcohqlslMXhDz5j68nq_kC_IUevDKpj_tGyn_-iQbiT_zoTzCDsoQAfw48yq-8-EP-rbfk3rjyXuvFGgrb22rGpL4oUEmhx_eclWGiDU3qUow?key=zhdBqEuEsI0vnjepXvtwiw" alt /></p>
<h3 id="heading-3-create-eventbridge-schedule">3️⃣ . Create EventBridge Schedule</h3>
<ul>
<li><p><strong>Navigate to</strong> Amazon EventBridge → <strong>Scheduler</strong>.</p>
</li>
<li><p>Click <strong>Create schedule</strong>.</p>
</li>
<li><p><strong>Schedule type</strong>: Choose <strong>Rate-based schedule</strong>.</p>
</li>
<li><p><strong>Set Rate</strong>: Every <strong>1 day</strong>.</p>
</li>
<li><p><strong>Target</strong>:</p>
<ul>
<li><p>Choose a <strong>Lambda function</strong>.</p>
</li>
<li><p>Select the Lambda function you created (<code>stop-rds-lambda-function</code>).</p>
</li>
</ul>
</li>
<li><p>Leave input blank.</p>
</li>
<li><p><strong>Click</strong> Next → <strong>Create schedule</strong>.</p>
</li>
</ul>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXc1cwgRTSlVBPT-NVhzMJHEozw4HTLefZDexTa4vdUfk2XviMfOIDT79oiSCNefS70cnSBNOvHjYVMDKe0_O4nMB5cYWqpyXbmnCMTGAl-VVMjFpe7r2EDm48HHrSgNqD73B2H3Ag?key=zhdBqEuEsI0vnjepXvtwiw" alt /></p>
<h2 id="heading-testing-the-lambda">✅ Testing the Lambda</h2>
<ul>
<li><p>Go to your Lambda function.</p>
</li>
<li><p>Click <strong>Test</strong> to run it manually.</p>
</li>
<li><p><strong>Check</strong> the Amazon RDS Console to verify if the specified instances are now stopped.</p>
</li>
<li><p>Review <strong>CloudWatch Logs</strong> for detailed output and error handling.</p>
</li>
</ul>
<h2 id="heading-summary">📌 Summary</h2>
<p>By using AWS Lambda with EventBridge Scheduler, you can <strong>automate daily shutdowns of RDS instances</strong>, reducing unnecessary costs without manual intervention. This is especially helpful for non-production environments.</p>
]]></content:encoded></item></channel></rss>