diff --git a/README.md b/README.md index 5089723..c5a7a6d 100644 --- a/README.md +++ b/README.md @@ -1,143 +1,308 @@ # ns8-backup-monitor -Sistema di monitoraggio dei backup per **NethServer 8** basato su tre livelli: +A lightweight webhook receiver for **NethServer 8** that intercepts Alertmanager backup failure alerts, enriches them with per-module status data from the cluster Redis, optionally checks repository health via `restic`, and delivers a detailed email notification through the NS8 configured mail relay. -1. **Trigger**: riceve l'alert da Prometheus/Alertmanager (`NsBackupFailed`, `NsBackupMissing`) -2. **Correlazione**: interroga lo stato del piano e dei singoli moduli via Redis cluster -3. **Classificazione**: distingue tra successo totale, fallimento parziale o fallimento globale di repository +Unlike solutions that hook into `run-backup` (which only fires on manual UI launches), this service listens to the Alertmanager webhook channel — the same source used by the NS8 monitoring stack — and therefore captures **both manual and scheduled automatic backups**. -## Architettura +--- + +## Architecture overview ``` -Alertmanager --webhook--> receiver.py - | - +--------------v--------------+ - | correlator.py | <- stato piano + per-modulo via Redis HGETALL - +--------------+--------------+ - | - +--------------v--------------+ - | repo_check.py | <- verifica repository destinazione (restic) - +--------------+--------------+ - | - +--------------v--------------+ - | notifier.py | <- email unica con esito classificato (HTML+text) - +-----------------------------+ +Alertmanager + │ POST /alert (NsBackupFailed | NsBackupMissing) + ▼ +[receiver.py] HTTP webhook listener (localhost:9099) + │ waits N seconds for modules to settle + ▼ +[correlator.py] Reads Redis cluster state, classifies outcome + │ SUCCESS | PARTIAL | REPO_FAILURE + ▼ +[repo_check.py] (only on non-SUCCESS) Probes restic repos via runagent + ▼ +[notifier.py] Builds HTML/text email, sends via ns8-sendmail ``` -## Logica di classificazione +--- -| Esito | Condizione | +## Requirements + +| Dependency | Notes | |---|---| -| `SUCCESS` | Tutti i moduli del piano completati, nessun errore repo | -| `PARTIAL` | Almeno un modulo fallito, repository raggiungibile | -| `REPO_FAILURE` | Nessuno stato trovato in Redis, o errori di connessione/scrittura sulla destinazione | +| NS8 leader or worker node | Must have access to the cluster Redis socket | +| `redis-cli` | Included in standard NS8 installations | +| `runagent` | NS8 binary used to invoke `restic` inside module containers | +| `ns8-sendmail` | NS8 mail relay script (invoked via `runagent`) | +| Python 3.8+ | Standard library only — no pip dependencies | +| Alertmanager | Must be configured to send webhooks to this service | -## Requisiti +--- -- NethServer 8 (leader node) -- Python 3.9+ -- `redis-cli` installato (pacchetto `redis` su Rocky Linux) -- `restic` installato e nel PATH (per `repo_check.py`) -- Accesso Redis locale del cluster NS8 via socket Unix -- `metrics1` configurato con Alertmanager webhook abilitato verso `http://localhost:9099/alert` - -## Struttura file +## File layout ``` ns8-backup-monitor/ -├── README.md -├── install.sh -├── ns8_backup_monitor/ -│ ├── __init__.py -│ ├── __main__.py # entry point: python3 -m ns8_backup_monitor -│ ├── receiver.py # HTTP webhook receiver (porta 9099) -│ ├── correlator.py # correlazione stato backup cluster -│ ├── repo_check.py # verifica repository destinazione -│ ├── notifier.py # invio email con esito classificato -│ └── utils.py # config loading + logging setup +│ +├── README.md ← This file +│ +├── config/ +│ └── config.yml.example ← Annotated configuration template +│ ├── deploy/ -│ └── ns8-backup-monitor.service # systemd unit -└── config/ - └── config.yml.example +│ ├── install.sh ← Interactive installer / uninstaller +│ └── ns8-backup-monitor.service ← systemd unit file +│ +└── ns8_backup_monitor/ ← Python package (main application) + ├── __init__.py ← Package marker, exposes version + ├── __main__.py ← CLI entry point (`python3 -m ns8_backup_monitor`) + ├── receiver.py ← HTTP webhook server (Alertmanager → pipeline) + ├── correlator.py ← Redis reader and outcome classifier + ├── repo_check.py ← restic repository health prober + ├── notifier.py ← Email builder and sender + └── utils.py ← Config loader and logging setup ``` -## Installazione +### Runtime paths (after installation) + +| Path | Purpose | +|---|---| +| `/opt/ns8-backup-monitor/` | Application root (Python package) | +| `/etc/ns8-backup-monitor/config.yml` | Active configuration file | +| `/etc/systemd/system/ns8-backup-monitor.service` | systemd unit | +| `/var/log/ns8-backup-monitor/` | Log directory (if file logging is enabled) | +| `/var/lib/nethserver/cluster/state/redis.sock` | NS8 cluster Redis socket (default) | + +--- + +## Installation + +### Quick install (interactive) ```bash -# 1. Clona la repo -cd /opt -git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git -cd ns8-backup-monitor - -# 2. Installa dipendenze Python -pip3 install pyyaml - -# 3. Crea configurazione -mkdir -p /etc/ns8-backup-monitor -cp config/config.yml.example /etc/ns8-backup-monitor/config.yml -# Edita /etc/ns8-backup-monitor/config.yml con smtp, mail.to, ecc. - -# 4. Installa e avvia il servizio -cp deploy/ns8-backup-monitor.service /etc/systemd/system/ -systemctl daemon-reload -systemctl enable --now ns8-backup-monitor - -# 5. Verifica -systemctl status ns8-backup-monitor -journalctl -u ns8-backup-monitor -f +bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh) ``` -## Configurazione Alertmanager +> **Note:** Use `bash <(curl ...)` rather than `curl ... | bash`. +> The interactive installer reads answers from your terminal via `read`; piping stdin +> from curl breaks that interaction. -Aggiungere in `alertmanager.yml` il receiver: +### Non-interactive install (CI / automation) + +```bash +curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh \ + | bash -s -- \ + --from "backup@example.com" \ + --to "admin@example.com" +``` + +### What the installer does + +1. Copies the Python package to `/opt/ns8-backup-monitor/` +2. Writes `/etc/ns8-backup-monitor/config.yml` from the template +3. Installs and enables the systemd unit +4. Prints the Alertmanager webhook receiver URL + +--- + +## Uninstallation + +```bash +bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall +``` + +The uninstaller stops and removes the systemd unit, then optionally removes the configuration directory. + +--- + +## Configuration + +The active configuration file is `/etc/ns8-backup-monitor/config.yml`. +Edit it directly and restart the service to apply changes. + +```bash +nano /etc/ns8-backup-monitor/config.yml +systemctl restart ns8-backup-monitor +``` + +See `config/config.yml.example` for a fully annotated reference with all available options. + +### Key sections + +```yaml +# ── Mail settings ───────────────────────────────────────────── +mail: + from: "backup@ns02.example.com" # Envelope From address + to: + - "admin@example.com" # One or more recipient addresses + subject_prefix: "[NS8 Backup]" # Prepended to every subject line + +# ── Webhook receiver ────────────────────────────────────────── +receiver: + host: "127.0.0.1" # Bind address (keep localhost unless Alertmanager is remote) + port: 9099 # Must match the Alertmanager webhook URL + +# ── Correlator behaviour ───────────────────────────────────── +correlator: + wait_seconds: 30 # Seconds to wait after alert before reading Redis + # (allows slow modules to write their final status) + recent_window: 3600 # When no backup_id label is present, scan Redis for + # plan status keys updated within this many seconds + +# ── Redis connection ───────────────────────────────────────── +redis: + socket: "/var/lib/nethserver/cluster/state/redis.sock" + +# ── Repository health check ────────────────────────────────── +repo_check: + enabled: true + timeout: 60 # Seconds per restic check call + +# ── Logging ────────────────────────────────────────────────── +logging: + level: "INFO" # DEBUG | INFO | WARNING | ERROR + file: "" # Leave empty to log to stdout (journald captures it) +``` + +--- + +## Alertmanager integration + +Add a receiver to your Alertmanager configuration on the NS8 leader node +(`/etc/alertmanager/alertmanager.yml` or via the NS8 `metrics1` module): ```yaml receivers: - name: ns8-backup-monitor webhook_configs: - - url: 'http://localhost:9099/alert' - send_resolved: true + - url: "http://127.0.0.1:9099/alert" + send_resolved: false route: receiver: ns8-backup-monitor - matchers: - - alertname =~ "NsBackupFailed|NsBackupMissing" + group_by: [alertname] + group_wait: 10s + group_interval: 5m + repeat_interval: 12h + routes: + - match_re: + alertname: "NsBackupFailed|NsBackupMissing" + receiver: ns8-backup-monitor ``` -Riavviare Alertmanager dopo la modifica: -```bash -systemctl restart alertmanager -``` +The service handles two alert names: -## Backend supportati per repo_check +| Alert name | Meaning | +|---|---| +| `NsBackupFailed` | One or more backup modules reported an error | +| `NsBackupMissing` | Expected backup did not run within the time window | -`repo_check.py` legge le credenziali direttamente da Redis e imposta le variabili d'ambiente necessarie per `restic`: +--- -| Backend | Campi Redis letti | Env vars impostate | +## Outcome classification + +After reading per-module Redis keys, the correlator assigns one of three outcomes: + +| Outcome | Condition | Email subject | |---|---|---| -| `local` / `fs` | `url` o `path` | `RESTIC_PASSWORD` | -| `s3` / `aws` | `url`, `aws_access_key_id`, `aws_secret_access_key` | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` | -| `b2` / `backblaze` | `url`, `b2_account_id`, `b2_account_key` | `B2_ACCOUNT_ID`, `B2_ACCOUNT_KEY` | -| `sftp` | `url` (formato `sftp:host:path`) | `RESTIC_PASSWORD` | -| `rclone` | `url`, `rclone_config` | `RCLONE_CONFIG` | +| `SUCCESS` | All modules succeeded | ✅ Backup completed successfully | +| `PARTIAL` | Some modules failed, some succeeded | ⚠️ Backup partially failed | +| `REPO_FAILURE` | All modules failed, or no status found in Redis | ❌ Backup failed – possible repository error | -## Debug / test manuale +On `PARTIAL` or `REPO_FAILURE`, the repo health check runs automatically and appends +diagnostic information (restic error output) to the email. + +--- + +## Redis key structure + +The correlator reads the following NS8 Redis key patterns: + +``` +cluster/backup//status → overall plan status (hash) +module//backup//status → per-module status (hash) +``` + +Hash fields: + +| Field | Values | Description | +|---|---|---| +| `result` | `success` / `error` | Outcome of the backup operation | +| `timestamp` | ISO 8601 | When the status was last written | +| `error` | string | Error message, if any | +| `errors` | integer | Number of module errors (plan-level hash only) | + +--- + +## Service management ```bash -# Test del correlatore (senza inviare email) -python3 -c " -import json -from ns8_backup_monitor.utils import load_config -from ns8_backup_monitor.correlator import correlate_backup_status -cfg = load_config() -print(json.dumps(correlate_backup_status(cfg), indent=2)) -" +# Check service status +systemctl status ns8-backup-monitor -# Test invio webhook simulato -curl -s -X POST http://localhost:9099/alert \ - -H 'Content-Type: application/json' \ - -d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed"}}]}' +# View live logs +journalctl -u ns8-backup-monitor -f -# Verifica log -journalctl -u ns8-backup-monitor --since '1 hour ago' +# Restart after config change +systemctl restart ns8-backup-monitor + +# Disable on boot +systemctl disable ns8-backup-monitor ``` + +--- + +## Troubleshooting + +### Service fails to start + +```bash +journalctl -u ns8-backup-monitor --no-pager -n 50 +``` + +Common causes: +- `config.yml` not found at the expected path → check `/etc/ns8-backup-monitor/config.yml` +- Port 9099 already in use → change `receiver.port` in config + +### No email received after a backup failure + +1. Verify Alertmanager is firing the webhook: + ```bash + journalctl -u ns8-backup-monitor -f + ``` + You should see `Received N relevant alert(s)` within a minute of the backup failure. + +2. Check that `wait_seconds` has elapsed (default 30 s) and look for `Sending notification...` in the log. + +3. Verify the mail relay works independently: + ```bash + echo "Test" | runagent ns8-sendmail -s "test" admin@example.com + ``` + +### Correlator finds no modules + +If the log shows `No recent backup status keys found in Redis`, possible causes: +- `recent_window` is too short — the backup ran more than 1 hour ago +- Redis socket path is wrong for your installation +- The backup plan wrote status to a non-standard key pattern + +--- + +## Development + +The application is pure Python 3 with no third-party dependencies. + +```bash +# Run locally (requires NS8 Redis socket access) +python3 -m ns8_backup_monitor --config ./config/config.yml.example + +# Send a test webhook payload +curl -s -X POST http://127.0.0.1:9099/alert \ + -H "Content-Type: application/json" \ + -d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed","backup_id":"1"}}]}' +``` + +--- + +## License + +MIT License — contributions welcome via pull request.