# Production Incident: Node.js Application Did Not Start After Server Reboot (PM2 + systemd Fix)

## Context

We were running a Node.js backend using PM2 on a Linux server.

Application details:

* Process manager: PM2
    
* Mode: fork
    
* User: root
    
* Deployment: Manual setup on VM
    
* No containerization
    
* No autoscaling
    

The service was running fine in steady state.

---

## **Incident Summary**

• Trigger: Server reboot during OS patching

• Impact: Application unavailable for 18 minutes

• Root Cause: PM2 not registered with systemd

• Resolution: Integrated PM2 with systemd and enabled process resurrection

## Incident Timeline

The server was restarted as part of routine OS patching.

After the reboot:

* The server came up successfully
    
* SSH access was normal
    
* But the backend application was down
    
* API health checks failed
    
* External traffic started returning errors
    

Running:

```bash
pm2 ls
```

It returned no running processes because the PM2 daemon was not started after reboot.

---

## Impact

* Application downtime until manual intervention
    
* No auto-recovery mechanism
    
* Increased MTTR
    
* Hidden operational risk exposed
    

This exposed a design gap.

---

## **Detection**

The issue was detected via failed API health checks after reboot.

There was no alert configured to monitor the PM2 daemon state.

Downtime lasted approximately 18 minutes until manual intervention restored the service.

## Root Cause Analysis

PM2 had previously been started manually in an interactive shell session.

It was never registered with systemd.

As a result, it did not start automatically after reboot.

There was:

* No systemd integration
    
* No startup registration
    
* No process resurrection configuration
    

On reboot:

```bash
System Boot
  ↓
No PM2 daemon started
  ↓
No Node process started
  ↓
Application Down
```

This was not a runtime failure.

This was a lifecycle management design failure.

---

## Design Correction

I redesigned the startup flow to align with production expectations.

### Goal

Ensure that:

* PM2 daemon starts automatically on boot
    
* Saved processes are restored
    
* No manual intervention required
    

---

## Implementation

### Step 1: Register PM2 with systemd

```bash
pm2 startup
```

This generated a systemd unit file:

```bash
/etc/systemd/system/pm2-root.service
```

And enabled it:

```bash
systemctl enable pm2-root
```

---

### Step 2: Freeze Running Processes

```bash
pm2 save
```

This created:

```bash
/root/.pm2/dump.pm2
```

Without this file, resurrection would not occur.

---

### Step 3: Ensure systemd Controls PM2

Initially:

```bash
systemctl status pm2-root
```

Showed:

```bash
inactive (dead)
```

Meaning PM2 was still running from shell, not systemd.

Corrected by:

```bash
pm2 kill
systemctl start pm2-root
```

Now:

```bash
Active: active (running)
```

---

## Final Boot Flow (Production Aligned)

```bash
System Boot
   ↓
systemd
   ↓
pm2-root.service
   ↓
pm2 resurrect
   ↓
Node Application Starts
```

---

## Validation

Server reboot was performed.

Post-reboot validation:

```bash
pm2 ls
systemctl status pm2-root
```

Result:

* Application automatically started
    
* No manual intervention required
    
* The application started automatically without manual intervention, reducing MTTR for reboot-related events to near zero.
    

---

## **Rollback Strategy**

If the systemd integration failed:

• Disable pm2-root service

• Manually start PM2 using pm2 start

• Validate application health endpoint

• Restore previous working state

This ensured there was a recovery path during configuration changes.

## **Preventive Measures**

• Standardized server bootstrap process to register PM2 with systemd

• Added reboot validation checklist after OS patching

• Integrated service state checks into monitoring alerts

• Planned migration to a dedicated service user

• Documented lifecycle management requirements

## Risks Identified

### 1\. Running as Root

PM2 was configured under root.

Risk:

* Larger blast radius in case of compromise
    
* Principle of least privilege was violated.
    

Future improvement:

* Dedicated service user
    

---

### 2\. Using NVM for Node

Systemd environment path contained:

```bash
/root/.nvm/versions/node/...
```

Risk:

* Node version changes may break startup
    
* NVM is not ideal for production servers
    

Better design:

* Install Node globally
    
* Lock version
    

---

## Production Takeaway

In production systems, every long-running process must be supervised by the system init layer.

Running does not imply lifecycle management.

If the init system does not supervise your process, you do not have a resilient system.

The failure was not due to Node. Not due to PM2. Not due to application code.

It was a lifecycle management design gap.
