# ns8-backup-monitor A lightweight webhook receiver for **NethServer 8** that intercepts Alertmanager backup failure alerts, enriches them with per-module status data from the cluster Redis, optionally checks repository health via `restic`, and delivers a detailed email notification through the NS8 configured mail relay. Unlike solutions that hook into `run-backup` (which only fires on manual UI launches), this service listens to the Alertmanager webhook channel — the same source used by the NS8 monitoring stack — and therefore captures **both manual and scheduled automatic backups**. --- ## Architecture overview ``` Alertmanager │ POST /alert (NsBackupFailed | NsBackupMissing) ▼ [receiver.py] HTTP webhook listener (localhost:9099) │ waits N seconds for modules to settle ▼ [correlator.py] Reads Redis cluster state, classifies outcome │ SUCCESS | PARTIAL | REPO_FAILURE ▼ [repo_check.py] (only on non-SUCCESS) Probes restic repos via runagent ▼ [notifier.py] Builds HTML/text email, sends via ns8-sendmail ``` --- ## Requirements | Dependency | Notes | |---|---| | NS8 leader or worker node | Must have access to the cluster Redis socket | | `redis-cli` | Included in standard NS8 installations | | `runagent` | NS8 binary used to invoke `restic` inside module containers | | `ns8-sendmail` | NS8 mail relay script (invoked via `runagent`) | | Python 3.8+ | Standard library only — no pip dependencies | | Alertmanager | Must be configured to send webhooks to this service | --- ## File layout ``` ns8-backup-monitor/ │ ├── README.md ← This file │ ├── config/ │ └── config.yml.example ← Annotated configuration template │ ├── deploy/ │ ├── install.sh ← Interactive installer / uninstaller │ └── ns8-backup-monitor.service ← systemd unit file │ └── ns8_backup_monitor/ ← Python package (main application) ├── __init__.py ← Package marker, exposes version ├── __main__.py ← CLI entry point (`python3 -m ns8_backup_monitor`) ├── receiver.py ← HTTP webhook server (Alertmanager → pipeline) ├── correlator.py ← Redis reader and outcome classifier ├── repo_check.py ← restic repository health prober ├── notifier.py ← Email builder and sender └── utils.py ← Config loader and logging setup ``` ### Runtime paths (after installation) | Path | Purpose | |---|---| | `/opt/ns8-backup-monitor/` | Application root (Python package) | | `/etc/ns8-backup-monitor/config.yml` | Active configuration file | | `/etc/systemd/system/ns8-backup-monitor.service` | systemd unit | | `/var/log/ns8-backup-monitor/` | Log directory (if file logging is enabled) | | `/var/lib/nethserver/cluster/state/redis.sock` | NS8 cluster Redis socket (default) | --- ## Installation ### Quick install (interactive) ```bash bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh) ``` > **Note:** Use `bash <(curl ...)` rather than `curl ... | bash`. > The interactive installer reads answers from your terminal via `read`; piping stdin > from curl breaks that interaction. ### Non-interactive install (CI / automation) ```bash curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh \ | bash -s -- \ --from "backup@example.com" \ --to "admin@example.com" ``` ### What the installer does 1. Copies the Python package to `/opt/ns8-backup-monitor/` 2. Writes `/etc/ns8-backup-monitor/config.yml` from the template 3. Installs and enables the systemd unit 4. Prints the Alertmanager webhook receiver URL --- ## Uninstallation ```bash bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall ``` The uninstaller stops and removes the systemd unit, then optionally removes the configuration directory. --- ## Configuration The active configuration file is `/etc/ns8-backup-monitor/config.yml`. Edit it directly and restart the service to apply changes. ```bash nano /etc/ns8-backup-monitor/config.yml systemctl restart ns8-backup-monitor ``` See `config/config.yml.example` for a fully annotated reference with all available options. ### Key sections ```yaml # ── Mail settings ───────────────────────────────────────────── mail: from: "backup@ns02.example.com" # Envelope From address to: - "admin@example.com" # One or more recipient addresses subject_prefix: "[NS8 Backup]" # Prepended to every subject line # ── Webhook receiver ────────────────────────────────────────── receiver: host: "127.0.0.1" # Bind address (keep localhost unless Alertmanager is remote) port: 9099 # Must match the Alertmanager webhook URL # ── Correlator behaviour ───────────────────────────────────── correlator: wait_seconds: 30 # Seconds to wait after alert before reading Redis # (allows slow modules to write their final status) recent_window: 3600 # When no backup_id label is present, scan Redis for # plan status keys updated within this many seconds # ── Redis connection ───────────────────────────────────────── redis: socket: "/var/lib/nethserver/cluster/state/redis.sock" # ── Repository health check ────────────────────────────────── repo_check: enabled: true timeout: 60 # Seconds per restic check call # ── Logging ────────────────────────────────────────────────── logging: level: "INFO" # DEBUG | INFO | WARNING | ERROR file: "" # Leave empty to log to stdout (journald captures it) ``` --- ## Alertmanager integration Add a receiver to your Alertmanager configuration on the NS8 leader node (`/etc/alertmanager/alertmanager.yml` or via the NS8 `metrics1` module): ```yaml receivers: - name: ns8-backup-monitor webhook_configs: - url: "http://127.0.0.1:9099/alert" send_resolved: false route: receiver: ns8-backup-monitor group_by: [alertname] group_wait: 10s group_interval: 5m repeat_interval: 12h routes: - match_re: alertname: "NsBackupFailed|NsBackupMissing" receiver: ns8-backup-monitor ``` The service handles two alert names: | Alert name | Meaning | |---|---| | `NsBackupFailed` | One or more backup modules reported an error | | `NsBackupMissing` | Expected backup did not run within the time window | --- ## Outcome classification After reading per-module Redis keys, the correlator assigns one of three outcomes: | Outcome | Condition | Email subject | |---|---|---| | `SUCCESS` | All modules succeeded | ✅ Backup completed successfully | | `PARTIAL` | Some modules failed, some succeeded | ⚠️ Backup partially failed | | `REPO_FAILURE` | All modules failed, or no status found in Redis | ❌ Backup failed – possible repository error | On `PARTIAL` or `REPO_FAILURE`, the repo health check runs automatically and appends diagnostic information (restic error output) to the email. --- ## Redis key structure The correlator reads the following NS8 Redis key patterns: ``` cluster/backup//status → overall plan status (hash) module//backup//status → per-module status (hash) ``` Hash fields: | Field | Values | Description | |---|---|---| | `result` | `success` / `error` | Outcome of the backup operation | | `timestamp` | ISO 8601 | When the status was last written | | `error` | string | Error message, if any | | `errors` | integer | Number of module errors (plan-level hash only) | --- ## Service management ```bash # Check service status systemctl status ns8-backup-monitor # View live logs journalctl -u ns8-backup-monitor -f # Restart after config change systemctl restart ns8-backup-monitor # Disable on boot systemctl disable ns8-backup-monitor ``` --- ## Troubleshooting ### Service fails to start ```bash journalctl -u ns8-backup-monitor --no-pager -n 50 ``` Common causes: - `config.yml` not found at the expected path → check `/etc/ns8-backup-monitor/config.yml` - Port 9099 already in use → change `receiver.port` in config ### No email received after a backup failure 1. Verify Alertmanager is firing the webhook: ```bash journalctl -u ns8-backup-monitor -f ``` You should see `Received N relevant alert(s)` within a minute of the backup failure. 2. Check that `wait_seconds` has elapsed (default 30 s) and look for `Sending notification...` in the log. 3. Verify the mail relay works independently: ```bash echo "Test" | runagent ns8-sendmail -s "test" admin@example.com ``` ### Correlator finds no modules If the log shows `No recent backup status keys found in Redis`, possible causes: - `recent_window` is too short — the backup ran more than 1 hour ago - Redis socket path is wrong for your installation - The backup plan wrote status to a non-standard key pattern --- ## Development The application is pure Python 3 with no third-party dependencies. ```bash # Run locally (requires NS8 Redis socket access) python3 -m ns8_backup_monitor --config ./config/config.yml.example # Send a test webhook payload curl -s -X POST http://127.0.0.1:9099/alert \ -H "Content-Type: application/json" \ -d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed","backup_id":"1"}}]}' ``` --- ## License MIT License — contributions welcome via pull request.