Files
ns8-backup-monitor/README.md
T

12 KiB
Raw Blame History

ns8-backup-monitor

NethServer 8 backup failure notification service.

Receives Alertmanager webhook alerts, correlates per-module backup status from the cluster Redis, optionally probes restic repositories, and sends a detailed HTML/text email through the NS8 mail relay.


Table of contents

  1. Architecture
  2. File layout
  3. Runtime paths
  4. Requirements
  5. Installation
  6. Configuration
  7. Alertmanager integration
  8. Outcome classification
  9. Redis key structure
  10. Service management
  11. Troubleshooting
  12. Uninstallation
  13. License

Architecture

Alertmanager  ──POST /alert──►  receiver.py
                                    │
                          (wait N seconds for all modules
                           to finish writing their status)
                                    │
                                    ▼
                              correlator.py
                          (reads Redis KEYS/HGETALL,
                           classifies outcome:
                           SUCCESS / PARTIAL / REPO_FAILURE)
                                    │
                                    ▼
                              repo_check.py          ← optional
                          (runagent → restic snapshots
                           on each module's repository)
                                    │
                                    ▼
                               notifier.py
                          (builds HTML + plain-text email,
                           dispatches via ns8-sendmail)

Key design decision: the service is a long-running HTTP server managed by systemd, not a one-shot script. This means it is always ready to receive an alert regardless of whether the backup was triggered manually or by a scheduled timer.


File layout

ns8-backup-monitor/
│
├── README.md                          ← this file
│
├── config/
│   └── config.yml.example             ← annotated configuration template
│                                         (copy to /etc/ns8-backup-monitor/config.yml)
│
├── deploy/
│   ├── install.sh                     ← interactive installer / uninstaller
│   └── ns8-backup-monitor.service     ← systemd unit file
│
└── ns8_backup_monitor/                ← Python package
    ├── __init__.py                    ← package metadata, version string
    ├── __main__.py                    ← entry point: arg parsing, logging init,
    │                                     hands off to receiver.run_server()
    ├── receiver.py                    ← HTTP webhook server (POST /alert)
    ├── correlator.py                  ← reads Redis, classifies backup outcome
    ├── repo_check.py                  ← probes restic repositories via runagent
    ├── notifier.py                    ← builds and sends email notifications
    └── utils.py                       ← load_config(), setup_logging()

Runtime paths

The following paths are created by deploy/install.sh and assumed by the default configuration.

Purpose Path
Python package /opt/ns8-backup-monitor/ns8_backup_monitor/
Deploy scripts /opt/ns8-backup-monitor/deploy/
Configuration /etc/ns8-backup-monitor/config.yml
systemd unit /etc/systemd/system/ns8-backup-monitor.service
Log file /var/log/ns8-backup-monitor.log
NS8 Redis socket /var/lib/nethserver/cluster/state/redis.sock

Requirements

Dependency Provided by Notes
python3 ≥ 3.8 OS Standard on AlmaLinux / Rocky 8+
pyyaml pip3 install pyyaml Only non-stdlib dependency
redis-cli NethServer 8 Used via subprocess, no Python Redis client needed
runagent NethServer 8 Required for repo_check only
ns8-sendmail NethServer 8 Required for email delivery
systemd OS Service management

This service must run on an NS8 leader node (or any node that has read access to the cluster Redis socket and runagent in PATH).


Installation

bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)

The installer will:

  1. Check prerequisites (python3, curl, tar, ns8-sendmail).
  2. Download and extract the latest source archive from the Gitea repository.
  3. Prompt interactively for sender address, recipient list, and subject prefix.
  4. Write /etc/ns8-backup-monitor/config.yml with the supplied values.
  5. Install and start the systemd service.

Manual installation

git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor

# Install Python dependency
pip3 install pyyaml

# Create directories
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor

# Copy source and config template
cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
# Edit the config before starting
nano /etc/ns8-backup-monitor/config.yml

# Install systemd unit
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now ns8-backup-monitor

Configuration

The configuration file is a YAML document. The installer writes it to /etc/ns8-backup-monitor/config.yml; a fully annotated template is available at config/config.yml.example.

# ---------------------------------------------------------------------------
# Email notification settings
# ---------------------------------------------------------------------------
# Delivery is handled by ns8-sendmail, which uses the SMTP relay already
# configured in NethServer 8.  No SMTP credentials are needed here.
mail:
  # Envelope / header sender address.
  from: "ns8-backup-monitor@yourdomain.com"

  # One or more recipient addresses.  At least one is required.
  to:
    - "admin@yourdomain.com"

  # String prepended to every email subject line.
  subject_prefix: "[NS8 Backup]"

# ---------------------------------------------------------------------------
# Webhook receiver (HTTP server)
# ---------------------------------------------------------------------------
receiver:
  # Interface to listen on.  127.0.0.1 is recommended when Alertmanager
  # runs on the same host; use 0.0.0.0 only if it runs on a different node.
  host: "127.0.0.1"
  # TCP port.  Must match the webhook URL configured in Alertmanager.
  port: 9099

# ---------------------------------------------------------------------------
# Timing
# ---------------------------------------------------------------------------
correlator:
  # Seconds to wait after receiving the alert before reading Redis.
  # This grace period allows all module agents to finish writing their
  # per-module status hashes.  30 s is sufficient for most deployments.
  wait_seconds: 30

  # Look-back window in seconds used when the alert does not include a
  # backup_id label.  Any plan whose Redis status was updated within this
  # window is considered "recent" and included in the report.
  recent_window: 3600

# ---------------------------------------------------------------------------
# Redis connection
# ---------------------------------------------------------------------------
redis:
  # Path to the NS8 cluster Redis Unix socket.
  # On a standard NS8 installation this path never changes.
  socket: "/var/lib/nethserver/cluster/state/redis.sock"

# ---------------------------------------------------------------------------
# Repository check (optional, uses runagent + restic)
# ---------------------------------------------------------------------------
repo_check:
  # Maximum seconds to wait for each repository check before giving up.
  timeout: 60
  # Extra flags passed verbatim to every restic invocation.
  # Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt"
  restic_flags: ""

# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging:
  # Python log level: DEBUG, INFO, WARNING, ERROR.
  level: INFO
  # Absolute path for the rotating log file (5 MB × 3 backups).
  # Leave empty to log to stdout / journald only.
  file: "/var/log/ns8-backup-monitor.log"

Alertmanager integration

Add a receiver pointing to the service in your Alertmanager configuration:

# alertmanager.yml (relevant excerpt)
route:
  receiver: ns8-backup-monitor
  # Only route backup-related alerts to this receiver.
  routes:
    - match:
        alertname: NethServerBackupFailed
      receiver: ns8-backup-monitor

receivers:
  - name: ns8-backup-monitor
    webhook_configs:
      - url: "http://127.0.0.1:9099/alert"
        # Send resolved alerts too so the service can log them.
        send_resolved: true

Reload Alertmanager after editing:

systemctl reload alertmanager
# or, for the NS8 metrics module:
runagent -m metrics1 systemctl reload alertmanager

Outcome classification

For each backup plan the correlator reads all per-module status hashes and produces one of three outcomes:

Outcome Condition Email subject
SUCCESS All modules finished with result=success ✅ Backup completed
PARTIAL At least one module succeeded, at least one failed ⚠️ Backup partially failed
REPO_FAILURE All modules failed or no status found in Redis ❌ Backup failed

Redis key structure

The correlator reads two families of keys from the NS8 cluster Redis:

Key pattern Description
cluster/backup/<backup_id>/status Plan-level status hash. Fields: result, timestamp, errors (integer count).
module/<module_id>/backup/<backup_id>/status Per-module status hash. Fields: result, timestamp, error (message string).

result is either "success" or "error". timestamp is an ISO 8601 string in UTC (e.g. 2024-01-15T03:00:05Z).


Service management

# Check service status
systemctl status ns8-backup-monitor

# Follow live logs via journald
journalctl -u ns8-backup-monitor -f

# Follow the rotating log file directly
tail -f /var/log/ns8-backup-monitor.log

# Restart after a config change
systemctl restart ns8-backup-monitor

# Test the webhook endpoint manually
curl -s -X POST http://127.0.0.1:9099/alert \
  -H 'Content-Type: application/json' \
  -d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}'

Troubleshooting

Service starts but no email is received

  1. Verify ns8-sendmail works independently:
    echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com
    
  2. Check mail.to in /etc/ns8-backup-monitor/config.yml.
  3. Increase log level to DEBUG and restart the service.

REPO_FAILURE on every alert even though backups succeed

  • The correlator may be reading Redis before all modules have finished.
    Increase correlator.wait_seconds (e.g. to 60).
  • Check that the Redis socket path is correct:
    redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING

Alertmanager does not reach the webhook

  • Confirm the service is listening:
    ss -tlnp | grep 9099
  • If Alertmanager runs on a different host, change receiver.host to 0.0.0.0 and open the port in the firewall.

Uninstallation

bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall

The script will stop and disable the service, remove the install directory, and optionally remove the configuration directory.


License

MIT — see LICENSE if present, otherwise contact the repository owner.