Server Watchdog

Monitor and auto-heal remote servers via SSH. Check services, databases, disk, memory — restart what's down, alert what's wrong.

Prerequisites

SSH access to target server (password or key-based)
expect available locally (for password-based SSH)
Target server runs PM2, systemd, or Docker for service management

Quick Reference

Check PM2 services

ssh user@host "pm2 list"
ssh user@host "pm2 logs --lines 20 --nostream"

Check MongoDB

# Windows
ssh user@host "net start | findstr MongoDB"
ssh user@host "powershell -Command \"(Test-NetConnection -ComputerName 127.0.0.1 -Port 27017).TcpTestSucceeded\""

# Linux
ssh user@host "systemctl status mongod"
ssh user@host "mongosh --eval 'db.runCommand({ping:1})' --quiet"

Check disk & memory

# Linux
ssh user@host "df -h && free -h"

# Windows
ssh user@host "powershell -Command \"Get-PSDrive -PSProvider FileSystem | Select Root,Used,Free; \$os=Get-CimInstance Win32_OperatingSystem; Write-Output ('RAM: '+[math]::Round((\$os.TotalVisibleMemorySize-\$os.FreePhysicalMemory)/1MB,1)+'GB / '+[math]::Round(\$os.TotalVisibleMemorySize/1MB,1)+'GB')\""

Workflow

Diagnose — SSH in, check service status, logs, disk, memory
Identify — Parse logs for errors, crashes, OOM, or unclean shutdowns
Fix — Restart crashed services (pm2 restart, net start, systemctl restart)
Verify — Confirm service is back up and responding
Alert — Notify user via messaging with summary

Crash Analysis

When a service is down, check these in order:

Service logs — pm2 logs, journalctl -u service, Windows Event Log
Application logs — Check log files at configured paths
System events — OOM killer, unexpected shutdowns, disk full
Database logs — MongoDB: check mongod.log for Fatal ("s":"F") entries

MongoDB crash patterns

"s":"F" — Fatal error (crash)
"Unhandled exception" — Internal bug (often FTDC related)
"Detected unclean shutdown" — Process killed without graceful shutdown
"WiredTiger error" — Storage engine corruption

Auto-Heal Recipes

PM2 service restart

pm2 restart <service-name>
pm2 save  # persist across reboots

MongoDB (Windows)

net stop MongoDB
timeout /t 5
net start MongoDB

MongoDB (Linux)

sudo systemctl restart mongod

Deploy watchdog service

For persistent monitoring, deploy the included watchdog script:

Copy scripts/mongodb-watchdog.js to target server
Install: npm init -y && npm install mongodb
Start: pm2 start mongodb-watchdog.js --name mongodb-watchdog
Save: pm2 save

SSH with password (via expect)

When key-based auth isn't available:

expect -c 'set timeout 20
spawn ssh -o StrictHostKeyChecking=no user@host "COMMAND"
expect {
    "password:" { send "PASSWORD\r"; exp_continue }
    eof
}
'

Alert Template

🚨 Server Alert — [hostname]

⏰ Time: [timestamp]
❌ Issue: [service] is DOWN
📋 Cause: [crash reason from logs]
🔄 Action: Auto-restarted [service]
✅ Status: [service] is back online

📊 System Health:
• Memory: X GB / Y GB
• Disk: Z% used
• Services: N/N online

server-watchdog

Safety Notice

Copy this and send it to your AI assistant to learn