Server Watchdog
Monitor and auto-heal remote servers via SSH. Check services, databases, disk, memory — restart what's down, alert what's wrong.
Prerequisites
- SSH access to target server (password or key-based)
expectavailable locally (for password-based SSH)- Target server runs PM2, systemd, or Docker for service management
Quick Reference
Check PM2 services
ssh user@host "pm2 list"
ssh user@host "pm2 logs --lines 20 --nostream"
Check MongoDB
# Windows
ssh user@host "net start | findstr MongoDB"
ssh user@host "powershell -Command \"(Test-NetConnection -ComputerName 127.0.0.1 -Port 27017).TcpTestSucceeded\""
# Linux
ssh user@host "systemctl status mongod"
ssh user@host "mongosh --eval 'db.runCommand({ping:1})' --quiet"
Check disk & memory
# Linux
ssh user@host "df -h && free -h"
# Windows
ssh user@host "powershell -Command \"Get-PSDrive -PSProvider FileSystem | Select Root,Used,Free; \$os=Get-CimInstance Win32_OperatingSystem; Write-Output ('RAM: '+[math]::Round((\$os.TotalVisibleMemorySize-\$os.FreePhysicalMemory)/1MB,1)+'GB / '+[math]::Round(\$os.TotalVisibleMemorySize/1MB,1)+'GB')\""
Workflow
- Diagnose — SSH in, check service status, logs, disk, memory
- Identify — Parse logs for errors, crashes, OOM, or unclean shutdowns
- Fix — Restart crashed services (
pm2 restart,net start,systemctl restart) - Verify — Confirm service is back up and responding
- Alert — Notify user via messaging with summary
Crash Analysis
When a service is down, check these in order:
- Service logs —
pm2 logs,journalctl -u service, Windows Event Log - Application logs — Check log files at configured paths
- System events — OOM killer, unexpected shutdowns, disk full
- Database logs — MongoDB: check
mongod.logfor Fatal ("s":"F") entries
MongoDB crash patterns
"s":"F" — Fatal error (crash)
"Unhandled exception" — Internal bug (often FTDC related)
"Detected unclean shutdown" — Process killed without graceful shutdown
"WiredTiger error" — Storage engine corruption
Auto-Heal Recipes
PM2 service restart
pm2 restart <service-name>
pm2 save # persist across reboots
MongoDB (Windows)
net stop MongoDB
timeout /t 5
net start MongoDB
MongoDB (Linux)
sudo systemctl restart mongod
Deploy watchdog service
For persistent monitoring, deploy the included watchdog script:
- Copy
scripts/mongodb-watchdog.jsto target server - Install:
npm init -y && npm install mongodb - Start:
pm2 start mongodb-watchdog.js --name mongodb-watchdog - Save:
pm2 save
SSH with password (via expect)
When key-based auth isn't available:
expect -c 'set timeout 20
spawn ssh -o StrictHostKeyChecking=no user@host "COMMAND"
expect {
"password:" { send "PASSWORD\r"; exp_continue }
eof
}
'
Alert Template
🚨 Server Alert — [hostname]
⏰ Time: [timestamp]
❌ Issue: [service] is DOWN
📋 Cause: [crash reason from logs]
🔄 Action: Auto-restarted [service]
✅ Status: [service] is back online
📊 System Health:
• Memory: X GB / Y GB
• Disk: Z% used
• Services: N/N online