Production Incident: Node.js App Not Starting After Reboot (PM2)

Context

We were running a Node.js backend using PM2 on a Linux server.

Application details:

Process manager: PM2
Mode: fork
User: root
Deployment: Manual setup on VM
No containerization
No autoscaling

The service was running fine in steady state.

Incident Summary

• Trigger: Server reboot during OS patching

• Impact: Application unavailable for 18 minutes

• Root Cause: PM2 not registered with systemd

• Resolution: Integrated PM2 with systemd and enabled process resurrection

Incident Timeline

The server was restarted as part of routine OS patching.

After the reboot:

The server came up successfully
SSH access was normal
But the backend application was down
API health checks failed
External traffic started returning errors

Running:

pm2 ls

It returned no running processes because the PM2 daemon was not started after reboot.

Impact

Application downtime until manual intervention
No auto-recovery mechanism
Increased MTTR
Hidden operational risk exposed

This exposed a design gap.

Detection

The issue was detected via failed API health checks after reboot.

There was no alert configured to monitor the PM2 daemon state.

Downtime lasted approximately 18 minutes until manual intervention restored the service.

Root Cause Analysis

PM2 had previously been started manually in an interactive shell session.

It was never registered with systemd.

As a result, it did not start automatically after reboot.

There was:

No systemd integration
No startup registration
No process resurrection configuration

On reboot:

System Boot
  ↓
No PM2 daemon started
  ↓
No Node process started
  ↓
Application Down

This was not a runtime failure.

This was a lifecycle management design failure.

Design Correction

I redesigned the startup flow to align with production expectations.

Goal

Ensure that:

PM2 daemon starts automatically on boot
Saved processes are restored
No manual intervention required

Implementation

Step 1: Register PM2 with systemd

pm2 startup

This generated a systemd unit file:

/etc/systemd/system/pm2-root.service

And enabled it:

systemctl enable pm2-root

Step 2: Freeze Running Processes

pm2 save

This created:

/root/.pm2/dump.pm2

Without this file, resurrection would not occur.

Step 3: Ensure systemd Controls PM2

Initially:

systemctl status pm2-root

Showed:

inactive (dead)

Meaning PM2 was still running from shell, not systemd.

Corrected by:

pm2 kill
systemctl start pm2-root

Now:

Active: active (running)

Final Boot Flow (Production Aligned)

System Boot
   ↓
systemd
   ↓
pm2-root.service
   ↓
pm2 resurrect
   ↓
Node Application Starts

Validation

Server reboot was performed.

Post-reboot validation:

pm2 ls
systemctl status pm2-root

Result:

Application automatically started
No manual intervention required
The application started automatically without manual intervention, reducing MTTR for reboot-related events to near zero.

Rollback Strategy

If the systemd integration failed:

• Disable pm2-root service

• Manually start PM2 using pm2 start

• Validate application health endpoint

• Restore previous working state

This ensured there was a recovery path during configuration changes.

Preventive Measures

• Standardized server bootstrap process to register PM2 with systemd

• Added reboot validation checklist after OS patching

• Integrated service state checks into monitoring alerts

• Planned migration to a dedicated service user

• Documented lifecycle management requirements

Risks Identified

1. Running as Root

PM2 was configured under root.

Risk:

Larger blast radius in case of compromise
Principle of least privilege was violated.

Future improvement:

Dedicated service user

2. Using NVM for Node

Systemd environment path contained:

/root/.nvm/versions/node/...

Risk:

Node version changes may break startup
NVM is not ideal for production servers

Better design:

Install Node globally
Lock version

Production Takeaway

In production systems, every long-running process must be supervised by the system init layer.

Running does not imply lifecycle management.

If the init system does not supervise your process, you do not have a resilient system.

The failure was not due to Node. Not due to PM2. Not due to application code.

It was a lifecycle management design gap.

Production Incident: Node.js Application Did Not Start After Server Reboot (PM2 + systemd Fix)

Context

Incident Summary

Incident Timeline

Impact

Detection

Root Cause Analysis

Design Correction

Goal

Implementation

Step 1: Register PM2 with systemd

Step 2: Freeze Running Processes

Step 3: Ensure systemd Controls PM2

Final Boot Flow (Production Aligned)

Validation

Rollback Strategy

Preventive Measures

Risks Identified

1. Running as Root

2. Using NVM for Node

Production Takeaway

Comments

Ops Fix Hub

Kubernetes Outage Postmortem: Nodes Stuck in NotReady Due to CNI Failure

More from this blog

Cost Optimization with Planned Downtime Migrating an EBS-Backed StatefulSet from Multi-AZ to Single-AZ in Amazon EKS (Production Pattern)

From Provisioning to Control Plane: Designing a Hybrid Terraform + Crossplane Architecture at Scale

Cross-Cloud VM Migration: GCP → AWS Using AWS Application Migration Service (MGN)

AWS DevOps Agent: Real Testing, Architecture & Practical Insights

Production-Grade GCS to S3 Migration: Secure, Private, and Zero-Egress Architecture

Command Palette

Context

Incident Summary

Incident Timeline

Impact

Detection

Root Cause Analysis

Design Correction

Goal

Implementation

Step 1: Register PM2 with systemd

Step 2: Freeze Running Processes

Step 3: Ensure systemd Controls PM2

Final Boot Flow (Production Aligned)

Validation

Rollback Strategy

Preventive Measures

Risks Identified

1. Running as Root

2. Using NVM for Node

Production Takeaway

Comments

Ops Fix Hub

Kubernetes Outage Postmortem: Nodes Stuck in NotReady Due to CNI Failure

More from this blog