Monitoring

Monitoring

Monitoring systems, health checks, and alerting configurations.

Overview

Comprehensive monitoring across multiple layers: system resources, service health, network connectivity, and security events.

Monitoring Components

System Monitoring

  • Host Resources: CPU, memory, disk usage, network traffic
  • Docker Metrics: Container resource usage and health
  • Process Monitoring: Critical process status and performance
  • Log Monitoring: System and application log analysis

Service Health Checks

  • HTTP Endpoints: Web service availability and response times
  • Port Connectivity: Service port accessibility and response
  • Docker Health: Container health status and restart counts
  • Application Metrics: Service-specific performance indicators

Network Monitoring

  • Connectivity: Internet and inter-host connectivity
  • DNS Resolution: Domain name resolution functionality
  • Certificate Status: SSL certificate validity and expiration
  • Firewall Status: UFW rule effectiveness and security

Alerting System

Telegram Integration

Real-time notifications via Telegram bot for:

  • Service outages and failures
  • System resource exhaustion
  • Security events and intrusions
  • Backup success/failure status
  • Certificate expiration warnings

Alert Categories

  • Critical: Immediate attention required (service down)
  • Warning: Potential issues (high resource usage)
  • Info: Operational status updates (backup completion)
  • Security: Security-related events (login failures)

Configuration

1
2
3
4
5
6
7
8
# Telegram bot configuration
TELEGRAM_BOT_TOKEN="your-bot-token"
TELEGRAM_CHAT_ID="your-chat-id"
TELEGRAM_BOT_URL="https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage?chat_id=${TELEGRAM_CHAT_ID}"

# Usage in scripts
telegram_success "Backup Complete" "$TELEGRAM_BOT_URL" "Tier sync completed successfully" "$NODE_NAME"
telegram_error "Service Down" "$TELEGRAM_BOT_URL" "Service failed to start" "$NODE_NAME"

Health Check Procedures

Automated Health Checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# System health check script
#!/bin/bash
source .scripts/bootstrap.sh

# Check disk space
DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
    telegram_warn "High Disk Usage" "$TELEGRAM_BOT_URL" "Disk usage: ${DISK_USAGE}%" "$NODE_NAME"
fi

# Check memory usage
MEMORY_USAGE=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
if (( $(echo "$MEMORY_USAGE > 85" | bc -l) )); then
    telegram_warn "High Memory Usage" "$TELEGRAM_BOT_URL" "Memory usage: ${MEMORY_USAGE}%" "$NODE_NAME"
fi

# Check service health
docker-compose ps --services | while read service; do
    if ! docker-compose ps $service | grep -q "Up"; then
        telegram_error "Service Down" "$TELEGRAM_BOT_URL" "Service $service is not running" "$NODE_NAME"
    fi
done

Manual Health Checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Quick system health overview
df -h                    # Disk usage
free -h                  # Memory usage
docker-compose ps        # Service status
ss -tulpn               # Network ports
systemctl status        # System services

# Service-specific checks
curl -I https://domain.com
docker-compose logs --tail=20 service-name
docker stats --no-stream

Performance Monitoring

Resource Metrics

  • CPU Usage: Per-core and overall utilization
  • Memory Usage: RAM and swap utilization
  • Disk I/O: Read/write operations and throughput
  • Network Traffic: Bandwidth usage and connection counts

Service Metrics

  • Response Times: HTTP request/response latency
  • Error Rates: Service error frequency and types
  • Throughput: Request processing capacity
  • Resource Usage: Per-service CPU and memory consumption

Monitoring Tools

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# System resource monitoring
htop                     # Interactive process viewer
iotop                    # Disk I/O monitoring
iftop                    # Network bandwidth monitoring
nethogs                 # Network usage per process

# Docker monitoring
docker stats             # Container resource usage
docker system df         # Docker disk usage
docker system events     # Docker events stream

Log Management

Log Collection

  • System Logs: journalctl and /var/log/* files
  • Service Logs: Docker container logs
  • Application Logs: Service-specific log files
  • Security Logs: Authentication and firewall logs

Log Analysis

1
2
3
4
5
6
7
8
9
# System log analysis
journalctl -f            # Follow system journal
journalctl -u service    # Service-specific logs
grep ERROR /var/log/*    # Error pattern matching

# Service log analysis
docker-compose logs -f service-name
docker-compose logs --tail=100 service-name | grep ERROR
grep -i fail service.log

Log Retention

  • System Logs: 30 days retention via journalctl
  • Service Logs: Configurable per service
  • Backup Logs: Retained with backup data
  • Security Logs: Extended retention for compliance

Backup Monitoring

Backup Verification

  • Success Confirmation: Verify backup completion
  • Integrity Checks: Validate backup file integrity
  • Restore Testing: Periodic restore verification
  • Storage Monitoring: Cloud storage usage and limits

Backup Alerts

1
2
3
4
5
6
# Backup monitoring in sync-tiers script
if rclone sync /data/tier1 remote:backup/tier1; then
    telegram_success "Backup Complete" "$TELEGRAM_BOT_URL" "Tier 1 backup successful" "$NODE_NAME"
else
    telegram_error "Backup Failed" "$TELEGRAM_BOT_URL" "Tier 1 backup failed" "$NODE_NAME"
fi

Security Monitoring

Security Events

  • Failed Login Attempts: SSH and service authentication failures
  • Firewall Blocks: UFW denied connections
  • Certificate Issues: SSL certificate problems
  • Unusual Activity: Anomalous network or system behavior

Security Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Monitor failed SSH attempts
grep "Failed password" /var/log/auth.log | tail -10

# Check firewall denials
grep "UFW BLOCK" /var/log/syslog | tail -10

# Monitor certificate expiration
for domain in vlt.ermnvldmr.com jef.ermnvldmr.com; do
    expiry=$(openssl s_client -connect $domain:443 -servername $domain 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2)
    echo "$domain expires: $expiry"
done

Monitoring Dashboard

Status Overview

A simple monitoring dashboard can be created to show:

  • Service status (up/down)
  • Resource utilization (CPU, memory, disk)
  • Recent alerts and events
  • Backup status and history

Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Simple status check script
#!/bin/bash
echo "=== Infrastructure Status ==="
echo "Date: $(date)"
echo
echo "=== System Resources ==="
df -h / | tail -1
free -h | grep Mem
echo "Load: $(uptime | awk -F'load average:' '{print $2}')"
echo
echo "=== Service Status ==="
docker-compose ps
echo
echo "=== Recent Alerts ==="
journalctl -u telegram-alerts --since="1 hour ago" --no-pager

This monitoring approach provides comprehensive visibility into system health and performance.

Last updated on