# ns8-backup-monitor > **NethServer 8 backup failure notification service.** > > Receives Alertmanager webhook alerts, correlates per-module backup status > from the cluster Redis, optionally probes restic repositories, and sends a > structured plain-text email through the NS8 mail relay. --- ## Table of contents 1. [Architecture](#architecture) 2. [File layout](#file-layout) 3. [Runtime paths](#runtime-paths) 4. [Requirements](#requirements) 5. [Installation](#installation) 6. [Configuration](#configuration) 7. [Alertmanager integration](#alertmanager-integration) 8. [Outcome classification](#outcome-classification) 9. [Redis key structure](#redis-key-structure) 10. [Service management](#service-management) 11. [Troubleshooting](#troubleshooting) 12. [Uninstallation](#uninstallation) 13. [License](#license) --- ## Architecture ``` Alertmanager ──POST /alert──► receiver.py │ (wait N seconds for all modules to finish writing their status) │ ▼ correlator.py (reads Redis KEYS/HGETALL, classifies outcome: SUCCESS / PARTIAL / REPO_FAILURE) │ ▼ repo_check.py ← skipped on SUCCESS (restic snapshots --last --no-cache on each configured repository) │ ▼ notifier.py (builds plain-text email, dispatches via ns8-sendmail) ``` **Key design decision:** the service is a long-running HTTP server managed by systemd, not a one-shot script. It is always ready to receive alerts whether the backup is triggered manually from the UI or by an automatic scheduled timer. > **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named > `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`). > All four names are matched so the pipeline fires on both native and custom rules. > The plan identifier is extracted from the `id` label (native) or `backup_id` > label (custom) — both are checked. --- ## File layout ``` ns8-backup-monitor/ │ ├── README.md ← this file │ ├── config/ │ └── config.yml.example ← annotated configuration template │ (copy to /etc/ns8-backup-monitor/config.yml) │ ├── deploy/ │ ├── install.sh ← interactive installer / uninstaller │ └── ns8-backup-monitor.service ← systemd unit file │ └── ns8_backup_monitor/ ← Python package (the service code) ├── __init__.py ← package metadata and version string ├── __main__.py ← entry point: argument parsing, logging │ initialisation, calls receiver.run_server() ├── receiver.py ← HTTP webhook server (POST /alert) │ matches alert names, spawns pipeline thread ├── correlator.py ← reads NS8 Redis, classifies backup outcome ├── repo_check.py ← probes restic repositories for health status ├── notifier.py ← builds and sends the status email └── utils.py ← load_config() and setup_logging() helpers ``` --- ## Runtime paths Paths created by `deploy/install.sh` and assumed by the default configuration. | Purpose | Default path | |---------|-------------| | Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` | | Deploy scripts | `/opt/ns8-backup-monitor/deploy/` | | Configuration file | `/etc/ns8-backup-monitor/config.yml` | | systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` | | Log file | `/var/log/ns8-backup-monitor.log` | | NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` | --- ## Requirements | Dependency | Provided by | Notes | |------------|-------------|-------| | `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ | | `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency | | `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed | | `restic` | NethServer 8 / manual | Required for `repo_check` only | | `ns8-sendmail` | NethServer 8 | Required for email delivery | | `systemd` | OS | Service management | > **This service must run on an NS8 leader node** (or any node with read access > to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`). --- ## Installation ### One-liner (recommended) ```bash bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh) ``` The installer will: 1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`). 2. Download and extract the latest source from the Gitea repository. 3. Prompt interactively for sender address, recipient list, and subject prefix. 4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values. 5. Install and start the systemd service. ### Manual installation ```bash git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git cd ns8-backup-monitor pip3 install pyyaml mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor cp -r . /opt/ns8-backup-monitor/ cp config/config.yml.example /etc/ns8-backup-monitor/config.yml nano /etc/ns8-backup-monitor/config.yml # edit before starting cp deploy/ns8-backup-monitor.service /etc/systemd/system/ systemctl daemon-reload systemctl enable --now ns8-backup-monitor ``` --- ## Configuration Full reference is in `config/config.yml.example`. Key parameters: ```yaml receiver: host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager) port: 9099 # webhook listening port correlator: wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish) recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present repo_check: enabled: true # set false to skip restic health checks entirely timeout: 60 # per-repository restic timeout (seconds) notification: mail_from: ns8-backup-monitor@example.com mail_to: - admin@example.com - ops@example.com redis: socket: /var/lib/nethserver/cluster/state/redis.sock logging: level: INFO # DEBUG for verbose output during troubleshooting file: /var/log/ns8-backup-monitor.log ``` --- ## Alertmanager integration Add a receiver and route to your Alertmanager configuration: ```yaml # alertmanager.yml receivers: - name: ns8-backup-monitor webhook_configs: - url: http://127.0.0.1:9099/alert send_resolved: false # resolved alerts are intentionally ignored route: routes: - matchers: - alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing" receiver: ns8-backup-monitor group_wait: 10s group_interval: 5m repeat_interval: 12h ``` ### Supported alert names | Alert name | Rule set | Trigger | |------------|----------|---------| | `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` | | `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time | | `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` | | `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` | ### Label mapping | Label | Used by | Contains | |-------|---------|----------| | `id` | NS8 native alerts | Backup plan identifier | | `backup_id` | Custom / legacy alerts | Backup plan identifier | Both labels are checked. When neither is present the correlator falls back to scanning Redis for all plan status keys updated within `correlator.recent_window`. --- ## Outcome classification | Outcome | Condition | Email subject | |---------|-----------|---------------| | `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` | | `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` | | `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - ` | `REPO_FAILURE` covers both the case where all modules failed and the case where no status was found in Redis at all (possible repository-level or scheduling issue). The repository health check (`repo_check.py`) runs automatically for `PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics. --- ## Redis key structure Keys read by `correlator.py` and `repo_check.py`: ``` cluster/backup//status Hash fields: result "success" | "error" timestamp ISO 8601 (UTC) errors integer count of failed modules module//backup//status Hash fields: result "success" | "error" timestamp ISO 8601 (UTC) error human-readable error message (empty on success) cluster/backup_repository//parameters Hash fields: url cloud backend URL (S3, B2, rclone, ...) path local or SFTP path password restic repository password backend "s3" | "b2" | "sftp" | "rclone" | "local" aws_access_key_id S3 key ID (also used as b2_account_id for B2) aws_secret_access_key S3 secret (also used as b2_account_key for B2) rclone_config path to rclone configuration file ``` --- ## Service management ```bash # Status systemctl status ns8-backup-monitor # Restart after config change systemctl restart ns8-backup-monitor # Live logs journalctl -u ns8-backup-monitor -f # Enable debug logging without editing config.yml journalctl -u ns8-backup-monitor -f | grep -v DEBUG # Test the webhook manually curl -s -X POST http://127.0.0.1:9099/alert \ -H "Content-Type: application/json" \ -d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}' ``` --- ## Troubleshooting ### Pipeline does not trigger on automatic backups **Symptom:** email arrives when you click "Run backup" from the NS8 UI but not when the scheduled timer fires. **Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed` and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and `backup_missing` instead. Additionally, native alerts carry the plan identifier in the `id` label, not in `backup_id`. **Fix:** update to the current version; both alert name sets and both labels are now handled automatically. **Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert with `amtool alert add alertname=backup_failed id=1`. You should see the `Received alert:` DEBUG line followed by the pipeline start. ### No email received 1. Check the service is running: `systemctl status ns8-backup-monitor`. 2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`. 3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works: `echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`. 4. Increase log level to `DEBUG` in `config.yml` and restart the service. ### REPO_FAILURE with no modules found **Cause:** the correlator ran before backup modules finished writing their status to Redis. **Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60–120 is safe for most clusters. Restart the service after changing the value. --- ## Uninstallation ```bash bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall ``` --- ## License MIT — see [LICENSE](LICENSE).