Files

T

admin 07830e1467 docs: rewrite README with alert name mapping, label mapping, troubleshooting for automatic backups

2026-05-18 21:57:02 +00:00

12 KiB

Raw Blame History

ns8-backup-monitor

NethServer 8 backup failure notification service.

Receives Alertmanager webhook alerts, correlates per-module backup status from the cluster Redis, optionally probes restic repositories, and sends a structured plain-text email through the NS8 mail relay.

Architecture
File layout
Runtime paths
Requirements
Installation
Configuration
Alertmanager integration
Outcome classification
Redis key structure
Service management
Troubleshooting
Uninstallation
License

Architecture

Alertmanager  ──POST /alert──►  receiver.py
                                    │
                          (wait N seconds for all modules
                           to finish writing their status)
                                    │
                                    ▼
                              correlator.py
                          (reads Redis KEYS/HGETALL,
                           classifies outcome:
                           SUCCESS / PARTIAL / REPO_FAILURE)
                                    │
                                    ▼
                              repo_check.py          ← skipped on SUCCESS
                          (restic snapshots --last --no-cache
                           on each configured repository)
                                    │
                                    ▼
                               notifier.py
                          (builds plain-text email,
                           dispatches via ns8-sendmail)

Key design decision: the service is a long-running HTTP server managed by systemd, not a one-shot script. It is always ready to receive alerts whether the backup is triggered manually from the UI or by an automatic scheduled timer.

Note on automatic alerts: NS8 native Prometheus rules emit alerts named backup_failed and backup_missing (not NsBackupFailed / NsBackupMissing). All four names are matched so the pipeline fires on both native and custom rules. The plan identifier is extracted from the id label (native) or backup_id label (custom) — both are checked.

File layout

ns8-backup-monitor/
│
├── README.md                          ← this file
│
├── config/
│   └── config.yml.example             ← annotated configuration template
│                                         (copy to /etc/ns8-backup-monitor/config.yml)
│
├── deploy/
│   ├── install.sh                     ← interactive installer / uninstaller
│   └── ns8-backup-monitor.service     ← systemd unit file
│
└── ns8_backup_monitor/                ← Python package (the service code)
    ├── __init__.py                    ← package metadata and version string
    ├── __main__.py                    ← entry point: argument parsing, logging
    │                                     initialisation, calls receiver.run_server()
    ├── receiver.py                    ← HTTP webhook server (POST /alert)
    │                                     matches alert names, spawns pipeline thread
    ├── correlator.py                  ← reads NS8 Redis, classifies backup outcome
    ├── repo_check.py                  ← probes restic repositories for health status
    ├── notifier.py                    ← builds and sends the status email
    └── utils.py                       ← load_config() and setup_logging() helpers

Runtime paths

Paths created by deploy/install.sh and assumed by the default configuration.

Purpose	Default path
Python package	`/opt/ns8-backup-monitor/ns8_backup_monitor/`
Deploy scripts	`/opt/ns8-backup-monitor/deploy/`
Configuration file	`/etc/ns8-backup-monitor/config.yml`
systemd unit	`/etc/systemd/system/ns8-backup-monitor.service`
Log file	`/var/log/ns8-backup-monitor.log`
NS8 cluster Redis socket	`/var/lib/nethserver/cluster/state/redis.sock`

Requirements

Dependency	Provided by	Notes
`python3` ≥ 3.8	OS	Standard on AlmaLinux / Rocky 8+
`pyyaml`	`pip3 install pyyaml`	Only non-stdlib dependency
`redis-cli`	NethServer 8	Accessed via subprocess; no Python Redis client needed
`restic`	NethServer 8 / manual	Required for `repo_check` only
`ns8-sendmail`	NethServer 8	Required for email delivery
`systemd`	OS	Service management

This service must run on an NS8 leader node (or any node with read access to the cluster Redis Unix socket and ns8-sendmail in PATH).

Installation

One-liner (recommended)

bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)

The installer will:

Check prerequisites (python3, curl, tar, ns8-sendmail).
Download and extract the latest source from the Gitea repository.
Prompt interactively for sender address, recipient list, and subject prefix.
Write /etc/ns8-backup-monitor/config.yml with the supplied values.
Install and start the systemd service.

Manual installation

git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor

pip3 install pyyaml

mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
nano /etc/ns8-backup-monitor/config.yml   # edit before starting

cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now ns8-backup-monitor

Configuration

Full reference is in config/config.yml.example. Key parameters:

receiver:
  host: 127.0.0.1   # bind address (use 0.0.0.0 only for remote Alertmanager)
  port: 9099         # webhook listening port

correlator:
  wait_seconds: 30   # wait after alert before reading Redis (allow slow modules to finish)
  recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present

repo_check:
  enabled: true      # set false to skip restic health checks entirely
  timeout: 60        # per-repository restic timeout (seconds)

notification:
  mail_from: ns8-backup-monitor@example.com
  mail_to:
    - admin@example.com
    - ops@example.com

redis:
  socket: /var/lib/nethserver/cluster/state/redis.sock

logging:
  level: INFO        # DEBUG for verbose output during troubleshooting
  file: /var/log/ns8-backup-monitor.log

Alertmanager integration

Add a receiver and route to your Alertmanager configuration:

# alertmanager.yml

receivers:
  - name: ns8-backup-monitor
    webhook_configs:
      - url: http://127.0.0.1:9099/alert
        send_resolved: false   # resolved alerts are intentionally ignored

route:
  routes:
    - matchers:
        - alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
      receiver: ns8-backup-monitor
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 12h

Supported alert names

Alert name	Rule set	Trigger
`backup_failed`	NS8 native (`node_backup_status`)	One or more plans reported `result != success`
`backup_missing`	NS8 native (`node_backup_status`)	Expected backup did not complete in time
`NsBackupFailed`	Custom / legacy	Same semantic as `backup_failed`
`NsBackupMissing`	Custom / legacy	Same semantic as `backup_missing`

Label mapping

Label	Used by	Contains
`id`	NS8 native alerts	Backup plan identifier
`backup_id`	Custom / legacy alerts	Backup plan identifier

Both labels are checked. When neither is present the correlator falls back to scanning Redis for all plan status keys updated within correlator.recent_window.

Outcome classification

Outcome	Condition	Email subject
`SUCCESS`	`failed == 0` and `total > 0`	`[ns8-backup] SUCCESS - all N module(s) backed up successfully`
`PARTIAL`	`0 < failed < total`	`[ns8-backup] PARTIAL - N/M module(s) failed`
`REPO_FAILURE`	`failed == total` or `total == 0`	`[ns8-backup] REPO_FAILURE - <reason>`

REPO_FAILURE covers both the case where all modules failed and the case where no status was found in Redis at all (possible repository-level or scheduling issue). The repository health check (repo_check.py) runs automatically for PARTIAL and REPO_FAILURE outcomes to provide additional diagnostics.

Redis key structure

Keys read by correlator.py and repo_check.py:

cluster/backup/<backup_id>/status
    Hash fields:
        result     "success" | "error"
        timestamp  ISO 8601 (UTC)
        errors     integer count of failed modules

module/<module_id>/backup/<backup_id>/status
    Hash fields:
        result     "success" | "error"
        timestamp  ISO 8601 (UTC)
        error      human-readable error message (empty on success)

cluster/backup_repository/<repo_id>/parameters
    Hash fields:
        url               cloud backend URL (S3, B2, rclone, ...)
        path              local or SFTP path
        password          restic repository password
        backend           "s3" | "b2" | "sftp" | "rclone" | "local"
        aws_access_key_id S3 key ID (also used as b2_account_id for B2)
        aws_secret_access_key S3 secret (also used as b2_account_key for B2)
        rclone_config     path to rclone configuration file

Service management

# Status
systemctl status ns8-backup-monitor

# Restart after config change
systemctl restart ns8-backup-monitor

# Live logs
journalctl -u ns8-backup-monitor -f

# Enable debug logging without editing config.yml
journalctl -u ns8-backup-monitor -f | grep -v DEBUG

# Test the webhook manually
curl -s -X POST http://127.0.0.1:9099/alert \
  -H "Content-Type: application/json" \
  -d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'

Troubleshooting

Pipeline does not trigger on automatic backups

Symptom: email arrives when you click "Run backup" from the NS8 UI but not when the scheduled timer fires.

Cause (pre-fix): the old ALERT_NAMES set only contained NsBackupFailed and NsBackupMissing. NS8 native Prometheus rules emit backup_failed and backup_missing instead. Additionally, native alerts carry the plan identifier in the id label, not in backup_id.

Fix: update to the current version; both alert name sets and both labels are now handled automatically.

Verify: run journalctl -u ns8-backup-monitor -f and trigger a test alert with amtool alert add alertname=backup_failed id=1. You should see the Received alert: DEBUG line followed by the pipeline start.

No email received

Check the service is running: systemctl status ns8-backup-monitor.
Verify Alertmanager is routing to the correct receiver: amtool config routes test alertname=backup_failed.
Check mail_to in config.yml is set and ns8-sendmail works: echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test".
Increase log level to DEBUG in config.yml and restart the service.

REPO_FAILURE with no modules found

Cause: the correlator ran before backup modules finished writing their status to Redis.

Fix: increase correlator.wait_seconds in config.yml. A value of 60–120 is safe for most clusters. Restart the service after changing the value.

Uninstallation

bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall

License

MIT — see LICENSE.

12 KiB Raw Blame History Unescape Escape