Monitoring

Monitoring systems, health checks, and alerting configurations.

Overview

Comprehensive monitoring across multiple layers: system resources, service health, network connectivity, and security events.

Monitoring Components

System Monitoring

Host Resources: CPU, memory, disk usage, network traffic
Docker Metrics: Container resource usage and health
Process Monitoring: Critical process status and performance
Log Monitoring: System and application log analysis

Service Health Checks

HTTP Endpoints: Web service availability and response times
Port Connectivity: Service port accessibility and response
Docker Health: Container health status and restart counts
Application Metrics: Service-specific performance indicators

Network Monitoring

Connectivity: Internet and inter-host connectivity
DNS Resolution: Domain name resolution functionality
Certificate Status: SSL certificate validity and expiration
Firewall Status: UFW rule effectiveness and security

Alerting System

Telegram Integration

Real-time notifications via Telegram bot for:

Service outages and failures
System resource exhaustion
Security events and intrusions
Backup success/failure status
Certificate expiration warnings

Alert Categories

Critical: Immediate attention required (service down)
Warning: Potential issues (high resource usage)
Info: Operational status updates (backup completion)
Security: Security-related events (login failures)

Configuration

1
2
3
4
5
6
7
8
# Telegram bot configuration
TELEGRAM_BOT_TOKEN="your-bot-token"
TELEGRAM_CHAT_ID="your-chat-id"
TELEGRAM_BOT_URL="https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage?chat_id=${TELEGRAM_CHAT_ID}"

# Usage in scripts
telegram_success "Backup Complete" "$TELEGRAM_BOT_URL" "Tier sync completed successfully" "$NODE_NAME"
telegram_error "Service Down" "$TELEGRAM_BOT_URL" "Service failed to start" "$NODE_NAME"

Health Check Procedures

Automated Health Checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# System health check script
#!/bin/bash
source .scripts/bootstrap.sh

# Check disk space
DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
    telegram_warn "High Disk Usage" "$TELEGRAM_BOT_URL" "Disk usage: ${DISK_USAGE}%" "$NODE_NAME"
fi

# Check memory usage
MEMORY_USAGE=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
if (( $(echo "$MEMORY_USAGE > 85" | bc -l) )); then
    telegram_warn "High Memory Usage" "$TELEGRAM_BOT_URL" "Memory usage: ${MEMORY_USAGE}%" "$NODE_NAME"
fi

# Check service health
docker-compose ps --services | while read service; do
    if ! docker-compose ps $service | grep -q "Up"; then
        telegram_error "Service Down" "$TELEGRAM_BOT_URL" "Service $service is not running" "$NODE_NAME"
    fi
done

Manual Health Checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Quick system health overview
df -h                    # Disk usage
free -h                  # Memory usage
docker-compose ps        # Service status
ss -tulpn               # Network ports
systemctl status        # System services

# Service-specific checks
curl -I https://domain.com
docker-compose logs --tail=20 service-name
docker stats --no-stream

Performance Monitoring

Resource Metrics

CPU Usage: Per-core and overall utilization
Memory Usage: RAM and swap utilization
Disk I/O: Read/write operations and throughput
Network Traffic: Bandwidth usage and connection counts

Service Metrics

Response Times: HTTP request/response latency
Error Rates: Service error frequency and types
Throughput: Request processing capacity
Resource Usage: Per-service CPU and memory consumption

Monitoring Tools

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# System resource monitoring
htop                     # Interactive process viewer
iotop                    # Disk I/O monitoring
iftop                    # Network bandwidth monitoring
nethogs                 # Network usage per process

# Docker monitoring
docker stats             # Container resource usage
docker system df         # Docker disk usage
docker system events     # Docker events stream

Log Management

Log Collection

System Logs: journalctl and /var/log/* files
Service Logs: Docker container logs
Application Logs: Service-specific log files
Security Logs: Authentication and firewall logs

Log Analysis

1
2
3
4
5
6
7
8
9
# System log analysis
journalctl -f            # Follow system journal
journalctl -u service    # Service-specific logs
grep ERROR /var/log/*    # Error pattern matching

# Service log analysis
docker-compose logs -f service-name
docker-compose logs --tail=100 service-name | grep ERROR
grep -i fail service.log

Log Retention

System Logs: 30 days retention via journalctl
Service Logs: Configurable per service
Backup Logs: Retained with backup data
Security Logs: Extended retention for compliance

Backup Monitoring

Backup Verification

Success Confirmation: Verify backup completion
Integrity Checks: Validate backup file integrity
Restore Testing: Periodic restore verification
Storage Monitoring: Cloud storage usage and limits

Backup Alerts

1
2
3
4
5
6
# Backup monitoring in sync-tiers script
if rclone sync /data/tier1 remote:backup/tier1; then
    telegram_success "Backup Complete" "$TELEGRAM_BOT_URL" "Tier 1 backup successful" "$NODE_NAME"
else
    telegram_error "Backup Failed" "$TELEGRAM_BOT_URL" "Tier 1 backup failed" "$NODE_NAME"
fi

Security Monitoring

Security Events

Failed Login Attempts: SSH and service authentication failures
Firewall Blocks: UFW denied connections
Certificate Issues: SSL certificate problems
Unusual Activity: Anomalous network or system behavior

Security Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Monitor failed SSH attempts
grep "Failed password" /var/log/auth.log | tail -10

# Check firewall denials
grep "UFW BLOCK" /var/log/syslog | tail -10

# Monitor certificate expiration
for domain in vlt.ermnvldmr.com jef.ermnvldmr.com; do
    expiry=$(openssl s_client -connect $domain:443 -servername $domain 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2)
    echo "$domain expires: $expiry"
done

Monitoring Dashboard

Status Overview

A simple monitoring dashboard can be created to show:

Service status (up/down)
Resource utilization (CPU, memory, disk)
Recent alerts and events
Backup status and history

Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Simple status check script
#!/bin/bash
echo "=== Infrastructure Status ==="
echo "Date: $(date)"
echo
echo "=== System Resources ==="
df -h / | tail -1
free -h | grep Mem
echo "Load: $(uptime | awk -F'load average:' '{print $2}')"
echo
echo "=== Service Status ==="
docker-compose ps
echo
echo "=== Recent Alerts ==="
journalctl -u telegram-alerts --since="1 hour ago" --no-pager

This monitoring approach provides comprehensive visibility into system health and performance.

Last updated on October 2, 2025

Operational Procedures