ns8-backup-monitor

A lightweight webhook receiver for NethServer 8 that intercepts Alertmanager backup failure alerts, enriches them with per-module status data from the cluster Redis, optionally checks repository health via restic, and delivers a detailed email notification through the NS8 configured mail relay.

Unlike solutions that hook into run-backup (which only fires on manual UI launches), this service listens to the Alertmanager webhook channel — the same source used by the NS8 monitoring stack — and therefore captures both manual and scheduled automatic backups.


Architecture overview

Alertmanager
     │  POST /alert  (NsBackupFailed | NsBackupMissing)
     ▼
[receiver.py]            HTTP webhook listener (localhost:9099)
     │  waits N seconds for modules to settle
     ▼
[correlator.py]          Reads Redis cluster state, classifies outcome
     │  SUCCESS | PARTIAL | REPO_FAILURE
     ▼
[repo_check.py]          (only on non-SUCCESS) Probes restic repos via runagent
     ▼
[notifier.py]            Builds HTML/text email, sends via ns8-sendmail

Requirements

Dependency Notes
NS8 leader or worker node Must have access to the cluster Redis socket
redis-cli Included in standard NS8 installations
runagent NS8 binary used to invoke restic inside module containers
ns8-sendmail NS8 mail relay script (invoked via runagent)
Python 3.8+ Standard library only — no pip dependencies
Alertmanager Must be configured to send webhooks to this service

File layout

ns8-backup-monitor/
│
├── README.md                          ← This file
│
├── config/
│   └── config.yml.example             ← Annotated configuration template
│
├── deploy/
│   ├── install.sh                     ← Interactive installer / uninstaller
│   └── ns8-backup-monitor.service     ← systemd unit file
│
└── ns8_backup_monitor/                ← Python package (main application)
    ├── __init__.py                    ← Package marker, exposes version
    ├── __main__.py                    ← CLI entry point (`python3 -m ns8_backup_monitor`)
    ├── receiver.py                    ← HTTP webhook server (Alertmanager → pipeline)
    ├── correlator.py                  ← Redis reader and outcome classifier
    ├── repo_check.py                  ← restic repository health prober
    ├── notifier.py                    ← Email builder and sender
    └── utils.py                       ← Config loader and logging setup

Runtime paths (after installation)

Path Purpose
/opt/ns8-backup-monitor/ Application root (Python package)
/etc/ns8-backup-monitor/config.yml Active configuration file
/etc/systemd/system/ns8-backup-monitor.service systemd unit
/var/log/ns8-backup-monitor/ Log directory (if file logging is enabled)
/var/lib/nethserver/cluster/state/redis.sock NS8 cluster Redis socket (default)

Installation

Quick install (interactive)

bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)

Note: Use bash <(curl ...) rather than curl ... | bash. The interactive installer reads answers from your terminal via read; piping stdin from curl breaks that interaction.

Non-interactive install (CI / automation)

curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh \
  | bash -s -- \
      --from "backup@example.com" \
      --to   "admin@example.com"

What the installer does

  1. Copies the Python package to /opt/ns8-backup-monitor/
  2. Writes /etc/ns8-backup-monitor/config.yml from the template
  3. Installs and enables the systemd unit
  4. Prints the Alertmanager webhook receiver URL

Uninstallation

bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall

The uninstaller stops and removes the systemd unit, then optionally removes the configuration directory.


Configuration

The active configuration file is /etc/ns8-backup-monitor/config.yml. Edit it directly and restart the service to apply changes.

nano /etc/ns8-backup-monitor/config.yml
systemctl restart ns8-backup-monitor

See config/config.yml.example for a fully annotated reference with all available options.

Key sections

# ── Mail settings ─────────────────────────────────────────────
mail:
  from: "backup@ns02.example.com"   # Envelope From address
  to:
    - "admin@example.com"           # One or more recipient addresses
  subject_prefix: "[NS8 Backup]"    # Prepended to every subject line

# ── Webhook receiver ──────────────────────────────────────────
receiver:
  host: "127.0.0.1"   # Bind address (keep localhost unless Alertmanager is remote)
  port: 9099          # Must match the Alertmanager webhook URL

# ── Correlator behaviour ─────────────────────────────────────
correlator:
  wait_seconds: 30      # Seconds to wait after alert before reading Redis
                        # (allows slow modules to write their final status)
  recent_window: 3600   # When no backup_id label is present, scan Redis for
                        # plan status keys updated within this many seconds

# ── Redis connection ─────────────────────────────────────────
redis:
  socket: "/var/lib/nethserver/cluster/state/redis.sock"

# ── Repository health check ──────────────────────────────────
repo_check:
  enabled: true
  timeout: 60     # Seconds per restic check call

# ── Logging ──────────────────────────────────────────────────
logging:
  level: "INFO"          # DEBUG | INFO | WARNING | ERROR
  file: ""               # Leave empty to log to stdout (journald captures it)

Alertmanager integration

Add a receiver to your Alertmanager configuration on the NS8 leader node (/etc/alertmanager/alertmanager.yml or via the NS8 metrics1 module):

receivers:
  - name: ns8-backup-monitor
    webhook_configs:
      - url: "http://127.0.0.1:9099/alert"
        send_resolved: false

route:
  receiver: ns8-backup-monitor
  group_by: [alertname]
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - match_re:
        alertname: "NsBackupFailed|NsBackupMissing"
      receiver: ns8-backup-monitor

The service handles two alert names:

Alert name Meaning
NsBackupFailed One or more backup modules reported an error
NsBackupMissing Expected backup did not run within the time window

Outcome classification

After reading per-module Redis keys, the correlator assigns one of three outcomes:

Outcome Condition Email subject
SUCCESS All modules succeeded Backup completed successfully
PARTIAL Some modules failed, some succeeded ⚠️ Backup partially failed
REPO_FAILURE All modules failed, or no status found in Redis Backup failed possible repository error

On PARTIAL or REPO_FAILURE, the repo health check runs automatically and appends diagnostic information (restic error output) to the email.


Redis key structure

The correlator reads the following NS8 Redis key patterns:

cluster/backup/<backup_id>/status          → overall plan status (hash)
module/<module_id>/backup/<backup_id>/status → per-module status (hash)

Hash fields:

Field Values Description
result success / error Outcome of the backup operation
timestamp ISO 8601 When the status was last written
error string Error message, if any
errors integer Number of module errors (plan-level hash only)

Service management

# Check service status
systemctl status ns8-backup-monitor

# View live logs
journalctl -u ns8-backup-monitor -f

# Restart after config change
systemctl restart ns8-backup-monitor

# Disable on boot
systemctl disable ns8-backup-monitor

Troubleshooting

Service fails to start

journalctl -u ns8-backup-monitor --no-pager -n 50

Common causes:

  • config.yml not found at the expected path → check /etc/ns8-backup-monitor/config.yml
  • Port 9099 already in use → change receiver.port in config

No email received after a backup failure

  1. Verify Alertmanager is firing the webhook:

    journalctl -u ns8-backup-monitor -f
    

    You should see Received N relevant alert(s) within a minute of the backup failure.

  2. Check that wait_seconds has elapsed (default 30 s) and look for Sending notification... in the log.

  3. Verify the mail relay works independently:

    echo "Test" | runagent ns8-sendmail -s "test" admin@example.com
    

Correlator finds no modules

If the log shows No recent backup status keys found in Redis, possible causes:

  • recent_window is too short — the backup ran more than 1 hour ago
  • Redis socket path is wrong for your installation
  • The backup plan wrote status to a non-standard key pattern

Development

The application is pure Python 3 with no third-party dependencies.

# Run locally (requires NS8 Redis socket access)
python3 -m ns8_backup_monitor --config ./config/config.yml.example

# Send a test webhook payload
curl -s -X POST http://127.0.0.1:9099/alert \
  -H "Content-Type: application/json" \
  -d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed","backup_id":"1"}}]}'

License

MIT License — contributions welcome via pull request.

S
Description
NS8 backup monitoring system: alertmanager trigger, per-module status check, repository health check and unified notification
Readme 146 KiB
Languages
Python 82.4%
Shell 17.6%