Production Incident: Node.js Application Did Not Start After Server Reboot (PM2 + systemd Fix)

Context
We were running a Node.js backend using PM2 on a Linux server.
Application details:
Process manager: PM2
Mode: fork
User: root
Deployment: Manual setup on VM
No containerization
No autoscaling
The service was running fine in steady state.
Incident Summary
• Trigger: Server reboot during OS patching
• Impact: Application unavailable for 18 minutes
• Root Cause: PM2 not registered with systemd
• Resolution: Integrated PM2 with systemd and enabled process resurrection
Incident Timeline
The server was restarted as part of routine OS patching.
After the reboot:
The server came up successfully
SSH access was normal
But the backend application was down
API health checks failed
External traffic started returning errors
Running:
pm2 ls
It returned no running processes because the PM2 daemon was not started after reboot.
Impact
Application downtime until manual intervention
No auto-recovery mechanism
Increased MTTR
Hidden operational risk exposed
This exposed a design gap.
Detection
The issue was detected via failed API health checks after reboot.
There was no alert configured to monitor the PM2 daemon state.
Downtime lasted approximately 18 minutes until manual intervention restored the service.
Root Cause Analysis
PM2 had previously been started manually in an interactive shell session.
It was never registered with systemd.
As a result, it did not start automatically after reboot.
There was:
No systemd integration
No startup registration
No process resurrection configuration
On reboot:
System Boot
↓
No PM2 daemon started
↓
No Node process started
↓
Application Down
This was not a runtime failure.
This was a lifecycle management design failure.
Design Correction
I redesigned the startup flow to align with production expectations.
Goal
Ensure that:
PM2 daemon starts automatically on boot
Saved processes are restored
No manual intervention required
Implementation
Step 1: Register PM2 with systemd
pm2 startup
This generated a systemd unit file:
/etc/systemd/system/pm2-root.service
And enabled it:
systemctl enable pm2-root
Step 2: Freeze Running Processes
pm2 save
This created:
/root/.pm2/dump.pm2
Without this file, resurrection would not occur.
Step 3: Ensure systemd Controls PM2
Initially:
systemctl status pm2-root
Showed:
inactive (dead)
Meaning PM2 was still running from shell, not systemd.
Corrected by:
pm2 kill
systemctl start pm2-root
Now:
Active: active (running)
Final Boot Flow (Production Aligned)
System Boot
↓
systemd
↓
pm2-root.service
↓
pm2 resurrect
↓
Node Application Starts
Validation
Server reboot was performed.
Post-reboot validation:
pm2 ls
systemctl status pm2-root
Result:
Application automatically started
No manual intervention required
The application started automatically without manual intervention, reducing MTTR for reboot-related events to near zero.
Rollback Strategy
If the systemd integration failed:
• Disable pm2-root service
• Manually start PM2 using pm2 start
• Validate application health endpoint
• Restore previous working state
This ensured there was a recovery path during configuration changes.
Preventive Measures
• Standardized server bootstrap process to register PM2 with systemd
• Added reboot validation checklist after OS patching
• Integrated service state checks into monitoring alerts
• Planned migration to a dedicated service user
• Documented lifecycle management requirements
Risks Identified
1. Running as Root
PM2 was configured under root.
Risk:
Larger blast radius in case of compromise
Principle of least privilege was violated.
Future improvement:
- Dedicated service user
2. Using NVM for Node
Systemd environment path contained:
/root/.nvm/versions/node/...
Risk:
Node version changes may break startup
NVM is not ideal for production servers
Better design:
Install Node globally
Lock version
Production Takeaway
In production systems, every long-running process must be supervised by the system init layer.
Running does not imply lifecycle management.
If the init system does not supervise your process, you do not have a resilient system.
The failure was not due to Node. Not due to PM2. Not due to application code.
It was a lifecycle management design gap.






