ns8-backup-monitor
NethServer 8 backup failure notification service.
Receives Alertmanager webhook alerts, correlates per-module backup status from the cluster Redis, optionally probes restic repositories, and sends a structured plain-text email through the NS8 mail relay.
Table of contents
- Architecture
- File layout
- Runtime paths
- Requirements
- Installation
- Configuration
- Alertmanager integration
- Outcome classification
- Redis key structure
- Service management
- Troubleshooting
- Uninstallation
- License
Architecture
Alertmanager ──POST /alert──► receiver.py
│
(wait N seconds for all modules
to finish writing their status)
│
▼
correlator.py
(reads Redis KEYS/HGETALL,
classifies outcome:
SUCCESS / PARTIAL / REPO_FAILURE)
│
▼
repo_check.py ← skipped on SUCCESS
(restic snapshots --last --no-cache
on each configured repository)
│
▼
notifier.py
(builds plain-text email,
dispatches via ns8-sendmail)
Key design decision: the service is a long-running HTTP server managed by systemd, not a one-shot script. It is always ready to receive alerts whether the backup is triggered manually from the UI or by an automatic scheduled timer.
Note on automatic alerts: NS8 native Prometheus rules emit alerts named
backup_failedandbackup_missing(notNsBackupFailed/NsBackupMissing). All four names are matched so the pipeline fires on both native and custom rules. The plan identifier is extracted from theidlabel (native) orbackup_idlabel (custom) — both are checked.
File layout
ns8-backup-monitor/
│
├── README.md ← this file
│
├── config/
│ └── config.yml.example ← annotated configuration template
│ (copy to /etc/ns8-backup-monitor/config.yml)
│
├── deploy/
│ ├── install.sh ← interactive installer / uninstaller
│ └── ns8-backup-monitor.service ← systemd unit file
│
└── ns8_backup_monitor/ ← Python package (the service code)
├── __init__.py ← package metadata and version string
├── __main__.py ← entry point: argument parsing, logging
│ initialisation, calls receiver.run_server()
├── receiver.py ← HTTP webhook server (POST /alert)
│ matches alert names, spawns pipeline thread
├── correlator.py ← reads NS8 Redis, classifies backup outcome
├── repo_check.py ← probes restic repositories for health status
├── notifier.py ← builds and sends the status email
└── utils.py ← load_config() and setup_logging() helpers
Runtime paths
Paths created by deploy/install.sh and assumed by the default configuration.
| Purpose | Default path |
|---|---|
| Python package | /opt/ns8-backup-monitor/ns8_backup_monitor/ |
| Deploy scripts | /opt/ns8-backup-monitor/deploy/ |
| Configuration file | /etc/ns8-backup-monitor/config.yml |
| systemd unit | /etc/systemd/system/ns8-backup-monitor.service |
| Log file | /var/log/ns8-backup-monitor.log |
| NS8 cluster Redis socket | /var/lib/nethserver/cluster/state/redis.sock |
Requirements
| Dependency | Provided by | Notes |
|---|---|---|
python3 ≥ 3.8 |
OS | Standard on AlmaLinux / Rocky 8+ |
pyyaml |
pip3 install pyyaml |
Only non-stdlib dependency |
redis-cli |
NethServer 8 | Accessed via subprocess; no Python Redis client needed |
restic |
NethServer 8 / manual | Required for repo_check only |
ns8-sendmail |
NethServer 8 | Required for email delivery |
systemd |
OS | Service management |
This service must run on an NS8 leader node (or any node with read access to the cluster Redis Unix socket and
ns8-sendmailinPATH).
Installation
One-liner (recommended)
bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)
The installer will:
- Check prerequisites (
python3,curl,tar,ns8-sendmail). - Download and extract the latest source from the Gitea repository.
- Prompt interactively for sender address, recipient list, and subject prefix.
- Write
/etc/ns8-backup-monitor/config.ymlwith the supplied values. - Install and start the systemd service.
Manual installation
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor
pip3 install pyyaml
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
nano /etc/ns8-backup-monitor/config.yml # edit before starting
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now ns8-backup-monitor
Configuration
Full reference is in config/config.yml.example. Key parameters:
receiver:
host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager)
port: 9099 # webhook listening port
correlator:
wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish)
recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present
repo_check:
enabled: true # set false to skip restic health checks entirely
timeout: 60 # per-repository restic timeout (seconds)
notification:
mail_from: ns8-backup-monitor@example.com
mail_to:
- admin@example.com
- ops@example.com
redis:
socket: /var/lib/nethserver/cluster/state/redis.sock
logging:
level: INFO # DEBUG for verbose output during troubleshooting
file: /var/log/ns8-backup-monitor.log
Alertmanager integration
Add a receiver and route to your Alertmanager configuration:
# alertmanager.yml
receivers:
- name: ns8-backup-monitor
webhook_configs:
- url: http://127.0.0.1:9099/alert
send_resolved: false # resolved alerts are intentionally ignored
route:
routes:
- matchers:
- alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
receiver: ns8-backup-monitor
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
Supported alert names
| Alert name | Rule set | Trigger |
|---|---|---|
backup_failed |
NS8 native (node_backup_status) |
One or more plans reported result != success |
backup_missing |
NS8 native (node_backup_status) |
Expected backup did not complete in time |
NsBackupFailed |
Custom / legacy | Same semantic as backup_failed |
NsBackupMissing |
Custom / legacy | Same semantic as backup_missing |
Label mapping
| Label | Used by | Contains |
|---|---|---|
id |
NS8 native alerts | Backup plan identifier |
backup_id |
Custom / legacy alerts | Backup plan identifier |
Both labels are checked. When neither is present the correlator falls back to
scanning Redis for all plan status keys updated within correlator.recent_window.
Outcome classification
| Outcome | Condition | Email subject |
|---|---|---|
SUCCESS |
failed == 0 and total > 0 |
[ns8-backup] SUCCESS - all N module(s) backed up successfully |
PARTIAL |
0 < failed < total |
[ns8-backup] PARTIAL - N/M module(s) failed |
REPO_FAILURE |
failed == total or total == 0 |
[ns8-backup] REPO_FAILURE - <reason> |
REPO_FAILURE covers both the case where all modules failed and the case where
no status was found in Redis at all (possible repository-level or scheduling
issue). The repository health check (repo_check.py) runs automatically for
PARTIAL and REPO_FAILURE outcomes to provide additional diagnostics.
Redis key structure
Keys read by correlator.py and repo_check.py:
cluster/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
errors integer count of failed modules
module/<module_id>/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
error human-readable error message (empty on success)
cluster/backup_repository/<repo_id>/parameters
Hash fields:
url cloud backend URL (S3, B2, rclone, ...)
path local or SFTP path
password restic repository password
backend "s3" | "b2" | "sftp" | "rclone" | "local"
aws_access_key_id S3 key ID (also used as b2_account_id for B2)
aws_secret_access_key S3 secret (also used as b2_account_key for B2)
rclone_config path to rclone configuration file
Service management
# Status
systemctl status ns8-backup-monitor
# Restart after config change
systemctl restart ns8-backup-monitor
# Live logs
journalctl -u ns8-backup-monitor -f
# Enable debug logging without editing config.yml
journalctl -u ns8-backup-monitor -f | grep -v DEBUG
# Test the webhook manually
curl -s -X POST http://127.0.0.1:9099/alert \
-H "Content-Type: application/json" \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'
Troubleshooting
Pipeline does not trigger on automatic backups
Symptom: email arrives when you click "Run backup" from the NS8 UI but not when the scheduled timer fires.
Cause (pre-fix): the old ALERT_NAMES set only contained NsBackupFailed
and NsBackupMissing. NS8 native Prometheus rules emit backup_failed and
backup_missing instead. Additionally, native alerts carry the plan identifier
in the id label, not in backup_id.
Fix: update to the current version; both alert name sets and both labels are now handled automatically.
Verify: run journalctl -u ns8-backup-monitor -f and trigger a test alert
with amtool alert add alertname=backup_failed id=1. You should see the
Received alert: DEBUG line followed by the pipeline start.
No email received
- Check the service is running:
systemctl status ns8-backup-monitor. - Verify Alertmanager is routing to the correct receiver:
amtool config routes test alertname=backup_failed. - Check
mail_toinconfig.ymlis set andns8-sendmailworks:echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test". - Increase log level to
DEBUGinconfig.ymland restart the service.
REPO_FAILURE with no modules found
Cause: the correlator ran before backup modules finished writing their status to Redis.
Fix: increase correlator.wait_seconds in config.yml. A value of 60–120
is safe for most clusters. Restart the service after changing the value.
Uninstallation
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
License
MIT — see LICENSE.