Skip to main content

Command Palette

Search for a command to run...

Production Incident: Node.js Application Did Not Start After Server Reboot (PM2 + systemd Fix)

Updated
4 min read
Production Incident: Node.js Application Did Not Start After Server Reboot (PM2 + systemd Fix)

Context

We were running a Node.js backend using PM2 on a Linux server.

Application details:

  • Process manager: PM2

  • Mode: fork

  • User: root

  • Deployment: Manual setup on VM

  • No containerization

  • No autoscaling

The service was running fine in steady state.


Incident Summary

• Trigger: Server reboot during OS patching

• Impact: Application unavailable for 18 minutes

• Root Cause: PM2 not registered with systemd

• Resolution: Integrated PM2 with systemd and enabled process resurrection

Incident Timeline

The server was restarted as part of routine OS patching.

After the reboot:

  • The server came up successfully

  • SSH access was normal

  • But the backend application was down

  • API health checks failed

  • External traffic started returning errors

Running:

pm2 ls

It returned no running processes because the PM2 daemon was not started after reboot.


Impact

  • Application downtime until manual intervention

  • No auto-recovery mechanism

  • Increased MTTR

  • Hidden operational risk exposed

This exposed a design gap.


Detection

The issue was detected via failed API health checks after reboot.

There was no alert configured to monitor the PM2 daemon state.

Downtime lasted approximately 18 minutes until manual intervention restored the service.

Root Cause Analysis

PM2 had previously been started manually in an interactive shell session.

It was never registered with systemd.

As a result, it did not start automatically after reboot.

There was:

  • No systemd integration

  • No startup registration

  • No process resurrection configuration

On reboot:

System Boot
  ↓
No PM2 daemon started
  ↓
No Node process started
  ↓
Application Down

This was not a runtime failure.

This was a lifecycle management design failure.


Design Correction

I redesigned the startup flow to align with production expectations.

Goal

Ensure that:

  • PM2 daemon starts automatically on boot

  • Saved processes are restored

  • No manual intervention required


Implementation

Step 1: Register PM2 with systemd

pm2 startup

This generated a systemd unit file:

/etc/systemd/system/pm2-root.service

And enabled it:

systemctl enable pm2-root

Step 2: Freeze Running Processes

pm2 save

This created:

/root/.pm2/dump.pm2

Without this file, resurrection would not occur.


Step 3: Ensure systemd Controls PM2

Initially:

systemctl status pm2-root

Showed:

inactive (dead)

Meaning PM2 was still running from shell, not systemd.

Corrected by:

pm2 kill
systemctl start pm2-root

Now:

Active: active (running)

Final Boot Flow (Production Aligned)

System Boot
   ↓
systemd
   ↓
pm2-root.service
   ↓
pm2 resurrect
   ↓
Node Application Starts

Validation

Server reboot was performed.

Post-reboot validation:

pm2 ls
systemctl status pm2-root

Result:

  • Application automatically started

  • No manual intervention required

  • The application started automatically without manual intervention, reducing MTTR for reboot-related events to near zero.


Rollback Strategy

If the systemd integration failed:

• Disable pm2-root service

• Manually start PM2 using pm2 start

• Validate application health endpoint

• Restore previous working state

This ensured there was a recovery path during configuration changes.

Preventive Measures

• Standardized server bootstrap process to register PM2 with systemd

• Added reboot validation checklist after OS patching

• Integrated service state checks into monitoring alerts

• Planned migration to a dedicated service user

• Documented lifecycle management requirements

Risks Identified

1. Running as Root

PM2 was configured under root.

Risk:

  • Larger blast radius in case of compromise

  • Principle of least privilege was violated.

Future improvement:

  • Dedicated service user

2. Using NVM for Node

Systemd environment path contained:

/root/.nvm/versions/node/...

Risk:

  • Node version changes may break startup

  • NVM is not ideal for production servers

Better design:

  • Install Node globally

  • Lock version


Production Takeaway

In production systems, every long-running process must be supervised by the system init layer.

Running does not imply lifecycle management.

If the init system does not supervise your process, you do not have a resilient system.

The failure was not due to Node. Not due to PM2. Not due to application code.

It was a lifecycle management design gap.

More from this blog

D

DevOps and Cloud Mastery Online - DevOps' World

34 posts