Files
ns8-backup-monitor/README.md
T

12 KiB
Raw Blame History

ns8-backup-monitor

NethServer 8 backup failure notification service.

Receives Alertmanager webhook alerts, correlates per-module backup status from the cluster Redis, optionally probes restic repositories, and sends a structured plain-text email through the NS8 mail relay.


Table of contents

  1. Architecture
  2. File layout
  3. Runtime paths
  4. Requirements
  5. Installation
  6. Configuration
  7. Alertmanager integration
  8. Outcome classification
  9. Redis key structure
  10. Service management
  11. Troubleshooting
  12. Uninstallation
  13. License

Architecture

Alertmanager  ──POST /alert──►  receiver.py
                                    │
                          (wait N seconds for all modules
                           to finish writing their status)
                                    │
                                    ▼
                              correlator.py
                          (reads Redis KEYS/HGETALL,
                           classifies outcome:
                           SUCCESS / PARTIAL / REPO_FAILURE)
                                    │
                                    ▼
                              repo_check.py          ← skipped on SUCCESS
                          (restic snapshots --last --no-cache
                           on each configured repository)
                                    │
                                    ▼
                               notifier.py
                          (builds plain-text email,
                           dispatches via ns8-sendmail)

Key design decision: the service is a long-running HTTP server managed by systemd, not a one-shot script. It is always ready to receive alerts whether the backup is triggered manually from the UI or by an automatic scheduled timer.

Note on automatic alerts: NS8 native Prometheus rules emit alerts named backup_failed and backup_missing (not NsBackupFailed / NsBackupMissing). All four names are matched so the pipeline fires on both native and custom rules. The plan identifier is extracted from the id label (native) or backup_id label (custom) — both are checked.


File layout

ns8-backup-monitor/
│
├── README.md                          ← this file
│
├── config/
│   └── config.yml.example             ← annotated configuration template
│                                         (copy to /etc/ns8-backup-monitor/config.yml)
│
├── deploy/
│   ├── install.sh                     ← interactive installer / uninstaller
│   └── ns8-backup-monitor.service     ← systemd unit file
│
└── ns8_backup_monitor/                ← Python package (the service code)
    ├── __init__.py                    ← package metadata and version string
    ├── __main__.py                    ← entry point: argument parsing, logging
    │                                     initialisation, calls receiver.run_server()
    ├── receiver.py                    ← HTTP webhook server (POST /alert)
    │                                     matches alert names, spawns pipeline thread
    ├── correlator.py                  ← reads NS8 Redis, classifies backup outcome
    ├── repo_check.py                  ← probes restic repositories for health status
    ├── notifier.py                    ← builds and sends the status email
    └── utils.py                       ← load_config() and setup_logging() helpers

Runtime paths

Paths created by deploy/install.sh and assumed by the default configuration.

Purpose Default path
Python package /opt/ns8-backup-monitor/ns8_backup_monitor/
Deploy scripts /opt/ns8-backup-monitor/deploy/
Configuration file /etc/ns8-backup-monitor/config.yml
systemd unit /etc/systemd/system/ns8-backup-monitor.service
Log file /var/log/ns8-backup-monitor.log
NS8 cluster Redis socket /var/lib/nethserver/cluster/state/redis.sock

Requirements

Dependency Provided by Notes
python3 ≥ 3.8 OS Standard on AlmaLinux / Rocky 8+
pyyaml pip3 install pyyaml Only non-stdlib dependency
redis-cli NethServer 8 Accessed via subprocess; no Python Redis client needed
restic NethServer 8 / manual Required for repo_check only
ns8-sendmail NethServer 8 Required for email delivery
systemd OS Service management

This service must run on an NS8 leader node (or any node with read access to the cluster Redis Unix socket and ns8-sendmail in PATH).


Installation

bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)

The installer will:

  1. Check prerequisites (python3, curl, tar, ns8-sendmail).
  2. Download and extract the latest source from the Gitea repository.
  3. Prompt interactively for sender address, recipient list, and subject prefix.
  4. Write /etc/ns8-backup-monitor/config.yml with the supplied values.
  5. Install and start the systemd service.

Manual installation

git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor

pip3 install pyyaml

mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
nano /etc/ns8-backup-monitor/config.yml   # edit before starting

cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now ns8-backup-monitor

Configuration

Full reference is in config/config.yml.example. Key parameters:

receiver:
  host: 127.0.0.1   # bind address (use 0.0.0.0 only for remote Alertmanager)
  port: 9099         # webhook listening port

correlator:
  wait_seconds: 30   # wait after alert before reading Redis (allow slow modules to finish)
  recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present

repo_check:
  enabled: true      # set false to skip restic health checks entirely
  timeout: 60        # per-repository restic timeout (seconds)

notification:
  mail_from: ns8-backup-monitor@example.com
  mail_to:
    - admin@example.com
    - ops@example.com

redis:
  socket: /var/lib/nethserver/cluster/state/redis.sock

logging:
  level: INFO        # DEBUG for verbose output during troubleshooting
  file: /var/log/ns8-backup-monitor.log

Alertmanager integration

Add a receiver and route to your Alertmanager configuration:

# alertmanager.yml

receivers:
  - name: ns8-backup-monitor
    webhook_configs:
      - url: http://127.0.0.1:9099/alert
        send_resolved: false   # resolved alerts are intentionally ignored

route:
  routes:
    - matchers:
        - alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
      receiver: ns8-backup-monitor
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 12h

Supported alert names

Alert name Rule set Trigger
backup_failed NS8 native (node_backup_status) One or more plans reported result != success
backup_missing NS8 native (node_backup_status) Expected backup did not complete in time
NsBackupFailed Custom / legacy Same semantic as backup_failed
NsBackupMissing Custom / legacy Same semantic as backup_missing

Label mapping

Label Used by Contains
id NS8 native alerts Backup plan identifier
backup_id Custom / legacy alerts Backup plan identifier

Both labels are checked. When neither is present the correlator falls back to scanning Redis for all plan status keys updated within correlator.recent_window.


Outcome classification

Outcome Condition Email subject
SUCCESS failed == 0 and total > 0 [ns8-backup] SUCCESS - all N module(s) backed up successfully
PARTIAL 0 < failed < total [ns8-backup] PARTIAL - N/M module(s) failed
REPO_FAILURE failed == total or total == 0 [ns8-backup] REPO_FAILURE - <reason>

REPO_FAILURE covers both the case where all modules failed and the case where no status was found in Redis at all (possible repository-level or scheduling issue). The repository health check (repo_check.py) runs automatically for PARTIAL and REPO_FAILURE outcomes to provide additional diagnostics.


Redis key structure

Keys read by correlator.py and repo_check.py:

cluster/backup/<backup_id>/status
    Hash fields:
        result     "success" | "error"
        timestamp  ISO 8601 (UTC)
        errors     integer count of failed modules

module/<module_id>/backup/<backup_id>/status
    Hash fields:
        result     "success" | "error"
        timestamp  ISO 8601 (UTC)
        error      human-readable error message (empty on success)

cluster/backup_repository/<repo_id>/parameters
    Hash fields:
        url               cloud backend URL (S3, B2, rclone, ...)
        path              local or SFTP path
        password          restic repository password
        backend           "s3" | "b2" | "sftp" | "rclone" | "local"
        aws_access_key_id S3 key ID (also used as b2_account_id for B2)
        aws_secret_access_key S3 secret (also used as b2_account_key for B2)
        rclone_config     path to rclone configuration file

Service management

# Status
systemctl status ns8-backup-monitor

# Restart after config change
systemctl restart ns8-backup-monitor

# Live logs
journalctl -u ns8-backup-monitor -f

# Enable debug logging without editing config.yml
journalctl -u ns8-backup-monitor -f | grep -v DEBUG

# Test the webhook manually
curl -s -X POST http://127.0.0.1:9099/alert \
  -H "Content-Type: application/json" \
  -d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'

Troubleshooting

Pipeline does not trigger on automatic backups

Symptom: email arrives when you click "Run backup" from the NS8 UI but not when the scheduled timer fires.

Cause (pre-fix): the old ALERT_NAMES set only contained NsBackupFailed and NsBackupMissing. NS8 native Prometheus rules emit backup_failed and backup_missing instead. Additionally, native alerts carry the plan identifier in the id label, not in backup_id.

Fix: update to the current version; both alert name sets and both labels are now handled automatically.

Verify: run journalctl -u ns8-backup-monitor -f and trigger a test alert with amtool alert add alertname=backup_failed id=1. You should see the Received alert: DEBUG line followed by the pipeline start.

No email received

  1. Check the service is running: systemctl status ns8-backup-monitor.
  2. Verify Alertmanager is routing to the correct receiver: amtool config routes test alertname=backup_failed.
  3. Check mail_to in config.yml is set and ns8-sendmail works: echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test".
  4. Increase log level to DEBUG in config.yml and restart the service.

REPO_FAILURE with no modules found

Cause: the correlator ran before backup modules finished writing their status to Redis.

Fix: increase correlator.wait_seconds in config.yml. A value of 60120 is safe for most clusters. Restart the service after changing the value.


Uninstallation

bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall

License

MIT — see LICENSE.