diff --git a/README.md b/README.md index c5a7a6d..167ad58 100644 --- a/README.md +++ b/README.md @@ -1,40 +1,60 @@ # ns8-backup-monitor -A lightweight webhook receiver for **NethServer 8** that intercepts Alertmanager backup failure alerts, enriches them with per-module status data from the cluster Redis, optionally checks repository health via `restic`, and delivers a detailed email notification through the NS8 configured mail relay. - -Unlike solutions that hook into `run-backup` (which only fires on manual UI launches), this service listens to the Alertmanager webhook channel — the same source used by the NS8 monitoring stack — and therefore captures **both manual and scheduled automatic backups**. +> **NethServer 8 backup failure notification service.** +> +> Receives Alertmanager webhook alerts, correlates per-module backup status +> from the cluster Redis, optionally probes restic repositories, and sends a +> detailed HTML/text email through the NS8 mail relay. --- -## Architecture overview +## Table of contents -``` -Alertmanager - │ POST /alert (NsBackupFailed | NsBackupMissing) - ▼ -[receiver.py] HTTP webhook listener (localhost:9099) - │ waits N seconds for modules to settle - ▼ -[correlator.py] Reads Redis cluster state, classifies outcome - │ SUCCESS | PARTIAL | REPO_FAILURE - ▼ -[repo_check.py] (only on non-SUCCESS) Probes restic repos via runagent - ▼ -[notifier.py] Builds HTML/text email, sends via ns8-sendmail -``` +1. [Architecture](#architecture) +2. [File layout](#file-layout) +3. [Runtime paths](#runtime-paths) +4. [Requirements](#requirements) +5. [Installation](#installation) +6. [Configuration](#configuration) +7. [Alertmanager integration](#alertmanager-integration) +8. [Outcome classification](#outcome-classification) +9. [Redis key structure](#redis-key-structure) +10. [Service management](#service-management) +11. [Troubleshooting](#troubleshooting) +12. [Uninstallation](#uninstallation) +13. [License](#license) --- -## Requirements +## Architecture -| Dependency | Notes | -|---|---| -| NS8 leader or worker node | Must have access to the cluster Redis socket | -| `redis-cli` | Included in standard NS8 installations | -| `runagent` | NS8 binary used to invoke `restic` inside module containers | -| `ns8-sendmail` | NS8 mail relay script (invoked via `runagent`) | -| Python 3.8+ | Standard library only — no pip dependencies | -| Alertmanager | Must be configured to send webhooks to this service | +``` +Alertmanager ──POST /alert──► receiver.py + │ + (wait N seconds for all modules + to finish writing their status) + │ + ▼ + correlator.py + (reads Redis KEYS/HGETALL, + classifies outcome: + SUCCESS / PARTIAL / REPO_FAILURE) + │ + ▼ + repo_check.py ← optional + (runagent → restic snapshots + on each module's repository) + │ + ▼ + notifier.py + (builds HTML + plain-text email, + dispatches via ns8-sendmail) +``` + +**Key design decision:** the service is a long-running HTTP server managed by +systemd, not a one-shot script. This means it is always ready to receive an +alert regardless of whether the backup was triggered manually or by a scheduled +timer. --- @@ -43,193 +63,236 @@ Alertmanager ``` ns8-backup-monitor/ │ -├── README.md ← This file +├── README.md ← this file │ ├── config/ -│ └── config.yml.example ← Annotated configuration template +│ └── config.yml.example ← annotated configuration template +│ (copy to /etc/ns8-backup-monitor/config.yml) │ ├── deploy/ -│ ├── install.sh ← Interactive installer / uninstaller +│ ├── install.sh ← interactive installer / uninstaller │ └── ns8-backup-monitor.service ← systemd unit file │ -└── ns8_backup_monitor/ ← Python package (main application) - ├── __init__.py ← Package marker, exposes version - ├── __main__.py ← CLI entry point (`python3 -m ns8_backup_monitor`) - ├── receiver.py ← HTTP webhook server (Alertmanager → pipeline) - ├── correlator.py ← Redis reader and outcome classifier - ├── repo_check.py ← restic repository health prober - ├── notifier.py ← Email builder and sender - └── utils.py ← Config loader and logging setup +└── ns8_backup_monitor/ ← Python package + ├── __init__.py ← package metadata, version string + ├── __main__.py ← entry point: arg parsing, logging init, + │ hands off to receiver.run_server() + ├── receiver.py ← HTTP webhook server (POST /alert) + ├── correlator.py ← reads Redis, classifies backup outcome + ├── repo_check.py ← probes restic repositories via runagent + ├── notifier.py ← builds and sends email notifications + └── utils.py ← load_config(), setup_logging() ``` -### Runtime paths (after installation) +--- -| Path | Purpose | -|---|---| -| `/opt/ns8-backup-monitor/` | Application root (Python package) | -| `/etc/ns8-backup-monitor/config.yml` | Active configuration file | -| `/etc/systemd/system/ns8-backup-monitor.service` | systemd unit | -| `/var/log/ns8-backup-monitor/` | Log directory (if file logging is enabled) | -| `/var/lib/nethserver/cluster/state/redis.sock` | NS8 cluster Redis socket (default) | +## Runtime paths + +The following paths are created by `deploy/install.sh` and assumed by the +default configuration. + +| Purpose | Path | +|---------|------| +| Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` | +| Deploy scripts | `/opt/ns8-backup-monitor/deploy/` | +| Configuration | `/etc/ns8-backup-monitor/config.yml` | +| systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` | +| Log file | `/var/log/ns8-backup-monitor.log` | +| NS8 Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` | + +--- + +## Requirements + +| Dependency | Provided by | Notes | +|------------|------------|-------| +| `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ | +| `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency | +| `redis-cli` | NethServer 8 | Used via subprocess, no Python Redis client needed | +| `runagent` | NethServer 8 | Required for `repo_check` only | +| `ns8-sendmail` | NethServer 8 | Required for email delivery | +| `systemd` | OS | Service management | + +> **This service must run on an NS8 leader node** (or any node that has +> read access to the cluster Redis socket and `runagent` in `PATH`). --- ## Installation -### Quick install (interactive) +### One-liner (recommended) ```bash bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh) ``` -> **Note:** Use `bash <(curl ...)` rather than `curl ... | bash`. -> The interactive installer reads answers from your terminal via `read`; piping stdin -> from curl breaks that interaction. +The installer will: +1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`). +2. Download and extract the latest source archive from the Gitea repository. +3. Prompt interactively for sender address, recipient list, and subject prefix. +4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values. +5. Install and start the systemd service. -### Non-interactive install (CI / automation) +### Manual installation ```bash -curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh \ - | bash -s -- \ - --from "backup@example.com" \ - --to "admin@example.com" +git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git +cd ns8-backup-monitor + +# Install Python dependency +pip3 install pyyaml + +# Create directories +mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor + +# Copy source and config template +cp -r . /opt/ns8-backup-monitor/ +cp config/config.yml.example /etc/ns8-backup-monitor/config.yml +# Edit the config before starting +nano /etc/ns8-backup-monitor/config.yml + +# Install systemd unit +cp deploy/ns8-backup-monitor.service /etc/systemd/system/ +systemctl daemon-reload +systemctl enable --now ns8-backup-monitor ``` -### What the installer does - -1. Copies the Python package to `/opt/ns8-backup-monitor/` -2. Writes `/etc/ns8-backup-monitor/config.yml` from the template -3. Installs and enables the systemd unit -4. Prints the Alertmanager webhook receiver URL - ---- - -## Uninstallation - -```bash -bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall -``` - -The uninstaller stops and removes the systemd unit, then optionally removes the configuration directory. - --- ## Configuration -The active configuration file is `/etc/ns8-backup-monitor/config.yml`. -Edit it directly and restart the service to apply changes. - -```bash -nano /etc/ns8-backup-monitor/config.yml -systemctl restart ns8-backup-monitor -``` - -See `config/config.yml.example` for a fully annotated reference with all available options. - -### Key sections +The configuration file is a YAML document. The installer writes it to +`/etc/ns8-backup-monitor/config.yml`; a fully annotated template is available +at `config/config.yml.example`. ```yaml -# ── Mail settings ───────────────────────────────────────────── +# --------------------------------------------------------------------------- +# Email notification settings +# --------------------------------------------------------------------------- +# Delivery is handled by ns8-sendmail, which uses the SMTP relay already +# configured in NethServer 8. No SMTP credentials are needed here. mail: - from: "backup@ns02.example.com" # Envelope From address + # Envelope / header sender address. + from: "ns8-backup-monitor@yourdomain.com" + + # One or more recipient addresses. At least one is required. to: - - "admin@example.com" # One or more recipient addresses - subject_prefix: "[NS8 Backup]" # Prepended to every subject line + - "admin@yourdomain.com" -# ── Webhook receiver ────────────────────────────────────────── + # String prepended to every email subject line. + subject_prefix: "[NS8 Backup]" + +# --------------------------------------------------------------------------- +# Webhook receiver (HTTP server) +# --------------------------------------------------------------------------- receiver: - host: "127.0.0.1" # Bind address (keep localhost unless Alertmanager is remote) - port: 9099 # Must match the Alertmanager webhook URL + # Interface to listen on. 127.0.0.1 is recommended when Alertmanager + # runs on the same host; use 0.0.0.0 only if it runs on a different node. + host: "127.0.0.1" + # TCP port. Must match the webhook URL configured in Alertmanager. + port: 9099 -# ── Correlator behaviour ───────────────────────────────────── +# --------------------------------------------------------------------------- +# Timing +# --------------------------------------------------------------------------- correlator: - wait_seconds: 30 # Seconds to wait after alert before reading Redis - # (allows slow modules to write their final status) - recent_window: 3600 # When no backup_id label is present, scan Redis for - # plan status keys updated within this many seconds + # Seconds to wait after receiving the alert before reading Redis. + # This grace period allows all module agents to finish writing their + # per-module status hashes. 30 s is sufficient for most deployments. + wait_seconds: 30 -# ── Redis connection ───────────────────────────────────────── + # Look-back window in seconds used when the alert does not include a + # backup_id label. Any plan whose Redis status was updated within this + # window is considered "recent" and included in the report. + recent_window: 3600 + +# --------------------------------------------------------------------------- +# Redis connection +# --------------------------------------------------------------------------- redis: + # Path to the NS8 cluster Redis Unix socket. + # On a standard NS8 installation this path never changes. socket: "/var/lib/nethserver/cluster/state/redis.sock" -# ── Repository health check ────────────────────────────────── +# --------------------------------------------------------------------------- +# Repository check (optional, uses runagent + restic) +# --------------------------------------------------------------------------- repo_check: - enabled: true - timeout: 60 # Seconds per restic check call + # Maximum seconds to wait for each repository check before giving up. + timeout: 60 + # Extra flags passed verbatim to every restic invocation. + # Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt" + restic_flags: "" -# ── Logging ────────────────────────────────────────────────── +# --------------------------------------------------------------------------- +# Logging +# --------------------------------------------------------------------------- logging: - level: "INFO" # DEBUG | INFO | WARNING | ERROR - file: "" # Leave empty to log to stdout (journald captures it) + # Python log level: DEBUG, INFO, WARNING, ERROR. + level: INFO + # Absolute path for the rotating log file (5 MB × 3 backups). + # Leave empty to log to stdout / journald only. + file: "/var/log/ns8-backup-monitor.log" ``` --- ## Alertmanager integration -Add a receiver to your Alertmanager configuration on the NS8 leader node -(`/etc/alertmanager/alertmanager.yml` or via the NS8 `metrics1` module): +Add a receiver pointing to the service in your Alertmanager configuration: ```yaml +# alertmanager.yml (relevant excerpt) +route: + receiver: ns8-backup-monitor + # Only route backup-related alerts to this receiver. + routes: + - match: + alertname: NethServerBackupFailed + receiver: ns8-backup-monitor + receivers: - name: ns8-backup-monitor webhook_configs: - url: "http://127.0.0.1:9099/alert" - send_resolved: false - -route: - receiver: ns8-backup-monitor - group_by: [alertname] - group_wait: 10s - group_interval: 5m - repeat_interval: 12h - routes: - - match_re: - alertname: "NsBackupFailed|NsBackupMissing" - receiver: ns8-backup-monitor + # Send resolved alerts too so the service can log them. + send_resolved: true ``` -The service handles two alert names: +Reload Alertmanager after editing: -| Alert name | Meaning | -|---|---| -| `NsBackupFailed` | One or more backup modules reported an error | -| `NsBackupMissing` | Expected backup did not run within the time window | +```bash +systemctl reload alertmanager +# or, for the NS8 metrics module: +runagent -m metrics1 systemctl reload alertmanager +``` --- ## Outcome classification -After reading per-module Redis keys, the correlator assigns one of three outcomes: +For each backup plan the correlator reads all per-module status hashes and +produces one of three outcomes: | Outcome | Condition | Email subject | -|---|---|---| -| `SUCCESS` | All modules succeeded | ✅ Backup completed successfully | -| `PARTIAL` | Some modules failed, some succeeded | ⚠️ Backup partially failed | -| `REPO_FAILURE` | All modules failed, or no status found in Redis | ❌ Backup failed – possible repository error | - -On `PARTIAL` or `REPO_FAILURE`, the repo health check runs automatically and appends -diagnostic information (restic error output) to the email. +|---------|-----------|---------------| +| `SUCCESS` | All modules finished with `result=success` | `✅ Backup completed` | +| `PARTIAL` | At least one module succeeded, at least one failed | `⚠️ Backup partially failed` | +| `REPO_FAILURE` | All modules failed **or** no status found in Redis | `❌ Backup failed` | --- ## Redis key structure -The correlator reads the following NS8 Redis key patterns: +The correlator reads two families of keys from the NS8 cluster Redis: -``` -cluster/backup//status → overall plan status (hash) -module//backup//status → per-module status (hash) -``` +| Key pattern | Description | +|-------------|-------------| +| `cluster/backup//status` | Plan-level status hash. Fields: `result`, `timestamp`, `errors` (integer count). | +| `module//backup//status` | Per-module status hash. Fields: `result`, `timestamp`, `error` (message string). | -Hash fields: - -| Field | Values | Description | -|---|---|---| -| `result` | `success` / `error` | Outcome of the backup operation | -| `timestamp` | ISO 8601 | When the status was last written | -| `error` | string | Error message, if any | -| `errors` | integer | Number of module errors (plan-level hash only) | +`result` is either `"success"` or `"error"`. `timestamp` is an ISO 8601 +string in UTC (e.g. `2024-01-15T03:00:05Z`). --- @@ -239,70 +302,61 @@ Hash fields: # Check service status systemctl status ns8-backup-monitor -# View live logs +# Follow live logs via journald journalctl -u ns8-backup-monitor -f -# Restart after config change +# Follow the rotating log file directly +tail -f /var/log/ns8-backup-monitor.log + +# Restart after a config change systemctl restart ns8-backup-monitor -# Disable on boot -systemctl disable ns8-backup-monitor +# Test the webhook endpoint manually +curl -s -X POST http://127.0.0.1:9099/alert \ + -H 'Content-Type: application/json' \ + -d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}' ``` --- ## Troubleshooting -### Service fails to start +### Service starts but no email is received -```bash -journalctl -u ns8-backup-monitor --no-pager -n 50 -``` - -Common causes: -- `config.yml` not found at the expected path → check `/etc/ns8-backup-monitor/config.yml` -- Port 9099 already in use → change `receiver.port` in config - -### No email received after a backup failure - -1. Verify Alertmanager is firing the webhook: +1. Verify `ns8-sendmail` works independently: ```bash - journalctl -u ns8-backup-monitor -f + echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com ``` - You should see `Received N relevant alert(s)` within a minute of the backup failure. +2. Check `mail.to` in `/etc/ns8-backup-monitor/config.yml`. +3. Increase log level to `DEBUG` and restart the service. -2. Check that `wait_seconds` has elapsed (default 30 s) and look for `Sending notification...` in the log. +### `REPO_FAILURE` on every alert even though backups succeed -3. Verify the mail relay works independently: - ```bash - echo "Test" | runagent ns8-sendmail -s "test" admin@example.com - ``` +- The correlator may be reading Redis before all modules have finished. + Increase `correlator.wait_seconds` (e.g. to `60`). +- Check that the Redis socket path is correct: + `redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING` -### Correlator finds no modules +### Alertmanager does not reach the webhook -If the log shows `No recent backup status keys found in Redis`, possible causes: -- `recent_window` is too short — the backup ran more than 1 hour ago -- Redis socket path is wrong for your installation -- The backup plan wrote status to a non-standard key pattern +- Confirm the service is listening: + `ss -tlnp | grep 9099` +- If Alertmanager runs on a different host, change `receiver.host` to + `0.0.0.0` and open the port in the firewall. --- -## Development - -The application is pure Python 3 with no third-party dependencies. +## Uninstallation ```bash -# Run locally (requires NS8 Redis socket access) -python3 -m ns8_backup_monitor --config ./config/config.yml.example - -# Send a test webhook payload -curl -s -X POST http://127.0.0.1:9099/alert \ - -H "Content-Type: application/json" \ - -d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed","backup_id":"1"}}]}' +bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall ``` +The script will stop and disable the service, remove the install directory, +and optionally remove the configuration directory. + --- ## License -MIT License — contributions welcome via pull request. +MIT — see [LICENSE](LICENSE) if present, otherwise contact the repository owner.