diff --git a/README.md b/README.md index 167ad58..43ffd04 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ > > Receives Alertmanager webhook alerts, correlates per-module backup status > from the cluster Redis, optionally probes restic repositories, and sends a -> detailed HTML/text email through the NS8 mail relay. +> structured plain-text email through the NS8 mail relay. --- @@ -41,20 +41,25 @@ Alertmanager ──POST /alert──► receiver.py SUCCESS / PARTIAL / REPO_FAILURE) │ ▼ - repo_check.py ← optional - (runagent → restic snapshots - on each module's repository) + repo_check.py ← skipped on SUCCESS + (restic snapshots --last --no-cache + on each configured repository) │ ▼ notifier.py - (builds HTML + plain-text email, + (builds plain-text email, dispatches via ns8-sendmail) ``` **Key design decision:** the service is a long-running HTTP server managed by -systemd, not a one-shot script. This means it is always ready to receive an -alert regardless of whether the backup was triggered manually or by a scheduled -timer. +systemd, not a one-shot script. It is always ready to receive alerts whether +the backup is triggered manually from the UI or by an automatic scheduled timer. + +> **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named +> `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`). +> All four names are matched so the pipeline fires on both native and custom rules. +> The plan identifier is extracted from the `id` label (native) or `backup_id` +> label (custom) — both are checked. --- @@ -73,48 +78,48 @@ ns8-backup-monitor/ │ ├── install.sh ← interactive installer / uninstaller │ └── ns8-backup-monitor.service ← systemd unit file │ -└── ns8_backup_monitor/ ← Python package - ├── __init__.py ← package metadata, version string - ├── __main__.py ← entry point: arg parsing, logging init, - │ hands off to receiver.run_server() +└── ns8_backup_monitor/ ← Python package (the service code) + ├── __init__.py ← package metadata and version string + ├── __main__.py ← entry point: argument parsing, logging + │ initialisation, calls receiver.run_server() ├── receiver.py ← HTTP webhook server (POST /alert) - ├── correlator.py ← reads Redis, classifies backup outcome - ├── repo_check.py ← probes restic repositories via runagent - ├── notifier.py ← builds and sends email notifications - └── utils.py ← load_config(), setup_logging() + │ matches alert names, spawns pipeline thread + ├── correlator.py ← reads NS8 Redis, classifies backup outcome + ├── repo_check.py ← probes restic repositories for health status + ├── notifier.py ← builds and sends the status email + └── utils.py ← load_config() and setup_logging() helpers ``` --- ## Runtime paths -The following paths are created by `deploy/install.sh` and assumed by the -default configuration. +Paths created by `deploy/install.sh` and assumed by the default configuration. -| Purpose | Path | -|---------|------| +| Purpose | Default path | +|---------|-------------| | Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` | | Deploy scripts | `/opt/ns8-backup-monitor/deploy/` | -| Configuration | `/etc/ns8-backup-monitor/config.yml` | +| Configuration file | `/etc/ns8-backup-monitor/config.yml` | | systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` | | Log file | `/var/log/ns8-backup-monitor.log` | -| NS8 Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` | +| NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` | --- ## Requirements | Dependency | Provided by | Notes | -|------------|------------|-------| +|------------|-------------|-------| | `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ | | `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency | -| `redis-cli` | NethServer 8 | Used via subprocess, no Python Redis client needed | -| `runagent` | NethServer 8 | Required for `repo_check` only | +| `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed | +| `restic` | NethServer 8 / manual | Required for `repo_check` only | | `ns8-sendmail` | NethServer 8 | Required for email delivery | | `systemd` | OS | Service management | -> **This service must run on an NS8 leader node** (or any node that has -> read access to the cluster Redis socket and `runagent` in `PATH`). +> **This service must run on an NS8 leader node** (or any node with read access +> to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`). --- @@ -128,7 +133,7 @@ bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/ The installer will: 1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`). -2. Download and extract the latest source archive from the Gitea repository. +2. Download and extract the latest source from the Gitea repository. 3. Prompt interactively for sender address, recipient list, and subject prefix. 4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values. 5. Install and start the systemd service. @@ -139,19 +144,13 @@ The installer will: git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git cd ns8-backup-monitor -# Install Python dependency pip3 install pyyaml -# Create directories mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor - -# Copy source and config template cp -r . /opt/ns8-backup-monitor/ cp config/config.yml.example /etc/ns8-backup-monitor/config.yml -# Edit the config before starting -nano /etc/ns8-backup-monitor/config.yml +nano /etc/ns8-backup-monitor/config.yml # edit before starting -# Install systemd unit cp deploy/ns8-backup-monitor.service /etc/systemd/system/ systemctl daemon-reload systemctl enable --now ns8-backup-monitor @@ -161,188 +160,183 @@ systemctl enable --now ns8-backup-monitor ## Configuration -The configuration file is a YAML document. The installer writes it to -`/etc/ns8-backup-monitor/config.yml`; a fully annotated template is available -at `config/config.yml.example`. +Full reference is in `config/config.yml.example`. Key parameters: ```yaml -# --------------------------------------------------------------------------- -# Email notification settings -# --------------------------------------------------------------------------- -# Delivery is handled by ns8-sendmail, which uses the SMTP relay already -# configured in NethServer 8. No SMTP credentials are needed here. -mail: - # Envelope / header sender address. - from: "ns8-backup-monitor@yourdomain.com" - - # One or more recipient addresses. At least one is required. - to: - - "admin@yourdomain.com" - - # String prepended to every email subject line. - subject_prefix: "[NS8 Backup]" - -# --------------------------------------------------------------------------- -# Webhook receiver (HTTP server) -# --------------------------------------------------------------------------- receiver: - # Interface to listen on. 127.0.0.1 is recommended when Alertmanager - # runs on the same host; use 0.0.0.0 only if it runs on a different node. - host: "127.0.0.1" - # TCP port. Must match the webhook URL configured in Alertmanager. - port: 9099 + host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager) + port: 9099 # webhook listening port -# --------------------------------------------------------------------------- -# Timing -# --------------------------------------------------------------------------- correlator: - # Seconds to wait after receiving the alert before reading Redis. - # This grace period allows all module agents to finish writing their - # per-module status hashes. 30 s is sufficient for most deployments. - wait_seconds: 30 + wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish) + recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present - # Look-back window in seconds used when the alert does not include a - # backup_id label. Any plan whose Redis status was updated within this - # window is considered "recent" and included in the report. - recent_window: 3600 - -# --------------------------------------------------------------------------- -# Redis connection -# --------------------------------------------------------------------------- -redis: - # Path to the NS8 cluster Redis Unix socket. - # On a standard NS8 installation this path never changes. - socket: "/var/lib/nethserver/cluster/state/redis.sock" - -# --------------------------------------------------------------------------- -# Repository check (optional, uses runagent + restic) -# --------------------------------------------------------------------------- repo_check: - # Maximum seconds to wait for each repository check before giving up. - timeout: 60 - # Extra flags passed verbatim to every restic invocation. - # Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt" - restic_flags: "" + enabled: true # set false to skip restic health checks entirely + timeout: 60 # per-repository restic timeout (seconds) + +notification: + mail_from: ns8-backup-monitor@example.com + mail_to: + - admin@example.com + - ops@example.com + +redis: + socket: /var/lib/nethserver/cluster/state/redis.sock -# --------------------------------------------------------------------------- -# Logging -# --------------------------------------------------------------------------- logging: - # Python log level: DEBUG, INFO, WARNING, ERROR. - level: INFO - # Absolute path for the rotating log file (5 MB × 3 backups). - # Leave empty to log to stdout / journald only. - file: "/var/log/ns8-backup-monitor.log" + level: INFO # DEBUG for verbose output during troubleshooting + file: /var/log/ns8-backup-monitor.log ``` --- ## Alertmanager integration -Add a receiver pointing to the service in your Alertmanager configuration: +Add a receiver and route to your Alertmanager configuration: ```yaml -# alertmanager.yml (relevant excerpt) -route: - receiver: ns8-backup-monitor - # Only route backup-related alerts to this receiver. - routes: - - match: - alertname: NethServerBackupFailed - receiver: ns8-backup-monitor +# alertmanager.yml receivers: - name: ns8-backup-monitor webhook_configs: - - url: "http://127.0.0.1:9099/alert" - # Send resolved alerts too so the service can log them. - send_resolved: true + - url: http://127.0.0.1:9099/alert + send_resolved: false # resolved alerts are intentionally ignored + +route: + routes: + - matchers: + - alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing" + receiver: ns8-backup-monitor + group_wait: 10s + group_interval: 5m + repeat_interval: 12h ``` -Reload Alertmanager after editing: +### Supported alert names -```bash -systemctl reload alertmanager -# or, for the NS8 metrics module: -runagent -m metrics1 systemctl reload alertmanager -``` +| Alert name | Rule set | Trigger | +|------------|----------|---------| +| `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` | +| `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time | +| `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` | +| `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` | + +### Label mapping + +| Label | Used by | Contains | +|-------|---------|----------| +| `id` | NS8 native alerts | Backup plan identifier | +| `backup_id` | Custom / legacy alerts | Backup plan identifier | + +Both labels are checked. When neither is present the correlator falls back to +scanning Redis for all plan status keys updated within `correlator.recent_window`. --- ## Outcome classification -For each backup plan the correlator reads all per-module status hashes and -produces one of three outcomes: - | Outcome | Condition | Email subject | |---------|-----------|---------------| -| `SUCCESS` | All modules finished with `result=success` | `✅ Backup completed` | -| `PARTIAL` | At least one module succeeded, at least one failed | `⚠️ Backup partially failed` | -| `REPO_FAILURE` | All modules failed **or** no status found in Redis | `❌ Backup failed` | +| `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` | +| `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` | +| `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - ` | + +`REPO_FAILURE` covers both the case where all modules failed and the case where +no status was found in Redis at all (possible repository-level or scheduling +issue). The repository health check (`repo_check.py`) runs automatically for +`PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics. --- ## Redis key structure -The correlator reads two families of keys from the NS8 cluster Redis: +Keys read by `correlator.py` and `repo_check.py`: -| Key pattern | Description | -|-------------|-------------| -| `cluster/backup//status` | Plan-level status hash. Fields: `result`, `timestamp`, `errors` (integer count). | -| `module//backup//status` | Per-module status hash. Fields: `result`, `timestamp`, `error` (message string). | +``` +cluster/backup//status + Hash fields: + result "success" | "error" + timestamp ISO 8601 (UTC) + errors integer count of failed modules -`result` is either `"success"` or `"error"`. `timestamp` is an ISO 8601 -string in UTC (e.g. `2024-01-15T03:00:05Z`). +module//backup//status + Hash fields: + result "success" | "error" + timestamp ISO 8601 (UTC) + error human-readable error message (empty on success) + +cluster/backup_repository//parameters + Hash fields: + url cloud backend URL (S3, B2, rclone, ...) + path local or SFTP path + password restic repository password + backend "s3" | "b2" | "sftp" | "rclone" | "local" + aws_access_key_id S3 key ID (also used as b2_account_id for B2) + aws_secret_access_key S3 secret (also used as b2_account_key for B2) + rclone_config path to rclone configuration file +``` --- ## Service management ```bash -# Check service status +# Status systemctl status ns8-backup-monitor -# Follow live logs via journald -journalctl -u ns8-backup-monitor -f - -# Follow the rotating log file directly -tail -f /var/log/ns8-backup-monitor.log - -# Restart after a config change +# Restart after config change systemctl restart ns8-backup-monitor -# Test the webhook endpoint manually +# Live logs +journalctl -u ns8-backup-monitor -f + +# Enable debug logging without editing config.yml +journalctl -u ns8-backup-monitor -f | grep -v DEBUG + +# Test the webhook manually curl -s -X POST http://127.0.0.1:9099/alert \ - -H 'Content-Type: application/json' \ - -d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}' + -H "Content-Type: application/json" \ + -d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}' ``` --- ## Troubleshooting -### Service starts but no email is received +### Pipeline does not trigger on automatic backups -1. Verify `ns8-sendmail` works independently: - ```bash - echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com - ``` -2. Check `mail.to` in `/etc/ns8-backup-monitor/config.yml`. -3. Increase log level to `DEBUG` and restart the service. +**Symptom:** email arrives when you click "Run backup" from the NS8 UI but not +when the scheduled timer fires. -### `REPO_FAILURE` on every alert even though backups succeed +**Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed` +and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and +`backup_missing` instead. Additionally, native alerts carry the plan identifier +in the `id` label, not in `backup_id`. -- The correlator may be reading Redis before all modules have finished. - Increase `correlator.wait_seconds` (e.g. to `60`). -- Check that the Redis socket path is correct: - `redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING` +**Fix:** update to the current version; both alert name sets and both labels +are now handled automatically. -### Alertmanager does not reach the webhook +**Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert +with `amtool alert add alertname=backup_failed id=1`. You should see the +`Received alert:` DEBUG line followed by the pipeline start. -- Confirm the service is listening: - `ss -tlnp | grep 9099` -- If Alertmanager runs on a different host, change `receiver.host` to - `0.0.0.0` and open the port in the firewall. +### No email received + +1. Check the service is running: `systemctl status ns8-backup-monitor`. +2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`. +3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works: + `echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`. +4. Increase log level to `DEBUG` in `config.yml` and restart the service. + +### REPO_FAILURE with no modules found + +**Cause:** the correlator ran before backup modules finished writing their +status to Redis. + +**Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60–120 +is safe for most clusters. Restart the service after changing the value. --- @@ -352,11 +346,8 @@ curl -s -X POST http://127.0.0.1:9099/alert \ bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall ``` -The script will stop and disable the service, remove the install directory, -and optionally remove the configuration directory. - --- ## License -MIT — see [LICENSE](LICENSE) if present, otherwise contact the repository owner. +MIT — see [LICENSE](LICENSE).