docs: rewrite README with alert name mapping, label mapping, troubleshooting for automatic backups

2026-05-18 21:57:02 +00:00
parent 80f3ff5e50
commit 07830e1467
1 changed files with 152 additions and 161 deletions
@@ -4,7 +4,7 @@
 >
 > Receives Alertmanager webhook alerts, correlates per-module backup status
 > from the cluster Redis, optionally probes restic repositories, and sends a
-> detailed HTML/text email through the NS8 mail relay.
+> structured plain-text email through the NS8 mail relay.
 ---
@@ -41,20 +41,25 @@ Alertmanager  ──POST /alert──►  receiver.py
                           SUCCESS / PARTIAL / REPO_FAILURE)
                                    │
                                    ▼
-                              repo_check.py          ← optional
+                              repo_check.py          ← skipped on SUCCESS
-                          (runagent → restic snapshots
+                          (restic snapshots --last --no-cache
-                           on each module's repository)
+                           on each configured repository)
                                    │
                                    ▼
                               notifier.py
-                          (builds HTML + plain-text email,
+                          (builds plain-text email,
                           dispatches via ns8-sendmail)
 ```
 **Key design decision:** the service is a long-running HTTP server managed by
-systemd, not a one-shot script.  This means it is always ready to receive an
+systemd, not a one-shot script. It is always ready to receive alerts whether
-alert regardless of whether the backup was triggered manually or by a scheduled
+the backup is triggered manually from the UI or by an automatic scheduled timer.
-timer.
+
 > **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named
 > `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`).
 > All four names are matched so the pipeline fires on both native and custom rules.
 > The plan identifier is extracted from the `id` label (native) or `backup_id`
 > label (custom) — both are checked.
 ---
@@ -73,48 +78,48 @@ ns8-backup-monitor/
 │   ├── install.sh                     ← interactive installer / uninstaller
 │   └── ns8-backup-monitor.service     ← systemd unit file
 │
-└── ns8_backup_monitor/                ← Python package
+└── ns8_backup_monitor/                ← Python package (the service code)
-    ├── __init__.py                    ← package metadata, version string
+    ├── __init__.py                    ← package metadata and version string
-    ├── __main__.py                    ← entry point: arg parsing, logging init,
+    ├── __main__.py                    ← entry point: argument parsing, logging
-    │                                     hands off to receiver.run_server()
+    │                                     initialisation, calls receiver.run_server()
    ├── receiver.py                    ← HTTP webhook server (POST /alert)
-    ├── correlator.py                  ← reads Redis, classifies backup outcome
+    │                                     matches alert names, spawns pipeline thread
-    ├── repo_check.py                  ← probes restic repositories via runagent
+    ├── correlator.py                  ← reads NS8 Redis, classifies backup outcome
-    ├── notifier.py                    ← builds and sends email notifications
+    ├── repo_check.py                  ← probes restic repositories for health status
-    └── utils.py                       ← load_config(), setup_logging()
+    ├── notifier.py                    ← builds and sends the status email
    └── utils.py                       ← load_config() and setup_logging() helpers
 ```
 ---
 ## Runtime paths
-The following paths are created by `deploy/install.sh` and assumed by the
+Paths created by `deploy/install.sh` and assumed by the default configuration.
 default configuration.
-| Purpose | Path |
+| Purpose | Default path |
-|---------|------|
+|---------|-------------|
 | Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` |
 | Deploy scripts | `/opt/ns8-backup-monitor/deploy/` |
-| Configuration | `/etc/ns8-backup-monitor/config.yml` |
+| Configuration file | `/etc/ns8-backup-monitor/config.yml` |
 | systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` |
 | Log file | `/var/log/ns8-backup-monitor.log` |
-| NS8 Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
+| NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
 ---
 ## Requirements
 | Dependency | Provided by | Notes |
-|------------|------------|-------|
+|------------|-------------|-------|
 | `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ |
 | `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency |
-| `redis-cli` | NethServer 8 | Used via subprocess, no Python Redis client needed |
+| `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed |
-| `runagent` | NethServer 8 | Required for `repo_check` only |
+| `restic` | NethServer 8 / manual | Required for `repo_check` only |
 | `ns8-sendmail` | NethServer 8 | Required for email delivery |
 | `systemd` | OS | Service management |
-> **This service must run on an NS8 leader node** (or any node that has
+> **This service must run on an NS8 leader node** (or any node with read access
-> read access to the cluster Redis socket and `runagent` in `PATH`).
+> to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`).
 ---
@@ -128,7 +133,7 @@ bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/
 The installer will:
 1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`).
-2. Download and extract the latest source archive from the Gitea repository.
+2. Download and extract the latest source from the Gitea repository.
 3. Prompt interactively for sender address, recipient list, and subject prefix.
 4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values.
 5. Install and start the systemd service.
@@ -139,19 +144,13 @@ The installer will:
 git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
 cd ns8-backup-monitor
 # Install Python dependency
 pip3 install pyyaml
 # Create directories
 mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
 # Copy source and config template
 cp -r . /opt/ns8-backup-monitor/
 cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
-# Edit the config before starting
+nano /etc/ns8-backup-monitor/config.yml   # edit before starting
 nano /etc/ns8-backup-monitor/config.yml
 # Install systemd unit
 cp deploy/ns8-backup-monitor.service /etc/systemd/system/
 systemctl daemon-reload
 systemctl enable --now ns8-backup-monitor
@@ -161,188 +160,183 @@ systemctl enable --now ns8-backup-monitor
 ## Configuration
-The configuration file is a YAML document.  The installer writes it to
+Full reference is in `config/config.yml.example`. Key parameters:
 `/etc/ns8-backup-monitor/config.yml`; a fully annotated template is available
 at `config/config.yml.example`.
 ```yaml
 # ---------------------------------------------------------------------------
 # Email notification settings
 # ---------------------------------------------------------------------------
 # Delivery is handled by ns8-sendmail, which uses the SMTP relay already
 # configured in NethServer 8.  No SMTP credentials are needed here.
 mail:
  # Envelope / header sender address.
  from: "ns8-backup-monitor@yourdomain.com"
  # One or more recipient addresses.  At least one is required.
  to:
    - "admin@yourdomain.com"
  # String prepended to every email subject line.
  subject_prefix: "[NS8 Backup]"
 # ---------------------------------------------------------------------------
 # Webhook receiver (HTTP server)
 # ---------------------------------------------------------------------------
 receiver:
-  # Interface to listen on.  127.0.0.1 is recommended when Alertmanager
+  host: 127.0.0.1   # bind address (use 0.0.0.0 only for remote Alertmanager)
-  # runs on the same host; use 0.0.0.0 only if it runs on a different node.
+  port: 9099         # webhook listening port
  host: "127.0.0.1"
  # TCP port.  Must match the webhook URL configured in Alertmanager.
  port: 9099
 # ---------------------------------------------------------------------------
 # Timing
 # ---------------------------------------------------------------------------
 correlator:
-  # Seconds to wait after receiving the alert before reading Redis.
+  wait_seconds: 30   # wait after alert before reading Redis (allow slow modules to finish)
-  # This grace period allows all module agents to finish writing their
+  recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present
  # per-module status hashes.  30 s is sufficient for most deployments.
  wait_seconds: 30
  # Look-back window in seconds used when the alert does not include a
  # backup_id label.  Any plan whose Redis status was updated within this
  # window is considered "recent" and included in the report.
  recent_window: 3600
 # ---------------------------------------------------------------------------
 # Redis connection
 # ---------------------------------------------------------------------------
 redis:
  # Path to the NS8 cluster Redis Unix socket.
  # On a standard NS8 installation this path never changes.
  socket: "/var/lib/nethserver/cluster/state/redis.sock"
 # ---------------------------------------------------------------------------
 # Repository check (optional, uses runagent + restic)
 # ---------------------------------------------------------------------------
 repo_check:
-  # Maximum seconds to wait for each repository check before giving up.
+  enabled: true      # set false to skip restic health checks entirely
-  timeout: 60
+  timeout: 60        # per-repository restic timeout (seconds)
-  # Extra flags passed verbatim to every restic invocation.
+
-  # Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt"
+notification:
-  restic_flags: ""
+  mail_from: ns8-backup-monitor@example.com
  mail_to:
    - admin@example.com
    - ops@example.com
 redis:
  socket: /var/lib/nethserver/cluster/state/redis.sock
 # ---------------------------------------------------------------------------
 # Logging
 # ---------------------------------------------------------------------------
 logging:
-  # Python log level: DEBUG, INFO, WARNING, ERROR.
+  level: INFO        # DEBUG for verbose output during troubleshooting
-  level: INFO
+  file: /var/log/ns8-backup-monitor.log
  # Absolute path for the rotating log file (5 MB × 3 backups).
  # Leave empty to log to stdout / journald only.
  file: "/var/log/ns8-backup-monitor.log"
 ```
 ---
 ## Alertmanager integration
-Add a receiver pointing to the service in your Alertmanager configuration:
+Add a receiver and route to your Alertmanager configuration:
 ```yaml
-# alertmanager.yml (relevant excerpt)
+# alertmanager.yml
 route:
  receiver: ns8-backup-monitor
  # Only route backup-related alerts to this receiver.
  routes:
    - match:
        alertname: NethServerBackupFailed
      receiver: ns8-backup-monitor
 receivers:
  - name: ns8-backup-monitor
    webhook_configs:
-      - url: "http://127.0.0.1:9099/alert"
+      - url: http://127.0.0.1:9099/alert
-        # Send resolved alerts too so the service can log them.
+        send_resolved: false   # resolved alerts are intentionally ignored
-        send_resolved: true
+
 route:
  routes:
    - matchers:
        - alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
      receiver: ns8-backup-monitor
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 12h
 ```
-Reload Alertmanager after editing:
+### Supported alert names
-```bash
+| Alert name | Rule set | Trigger |
-systemctl reload alertmanager
+|------------|----------|---------|
-# or, for the NS8 metrics module:
+| `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` |
-runagent -m metrics1 systemctl reload alertmanager
+| `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time |
-```
+| `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` |
 | `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` |
 ### Label mapping
 | Label | Used by | Contains |
 |-------|---------|----------|
 | `id` | NS8 native alerts | Backup plan identifier |
 | `backup_id` | Custom / legacy alerts | Backup plan identifier |
 Both labels are checked. When neither is present the correlator falls back to
 scanning Redis for all plan status keys updated within `correlator.recent_window`.
 ---
 ## Outcome classification
 For each backup plan the correlator reads all per-module status hashes and
 produces one of three outcomes:
 | Outcome | Condition | Email subject |
 |---------|-----------|---------------|
-| `SUCCESS` | All modules finished with `result=success` | `✅ Backup completed` |
+| `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` |
-| `PARTIAL` | At least one module succeeded, at least one failed | `⚠️ Backup partially failed` |
+| `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` |
-| `REPO_FAILURE` | All modules failed **or** no status found in Redis | `❌ Backup failed` |
+| `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - <reason>` |
 `REPO_FAILURE` covers both the case where all modules failed and the case where
 no status was found in Redis at all (possible repository-level or scheduling
 issue). The repository health check (`repo_check.py`) runs automatically for
 `PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics.
 ---
 ## Redis key structure
-The correlator reads two families of keys from the NS8 cluster Redis:
+Keys read by `correlator.py` and `repo_check.py`:
-| Key pattern | Description |
+```
-|-------------|-------------|
+cluster/backup/<backup_id>/status
-| `cluster/backup/<backup_id>/status` | Plan-level status hash. Fields: `result`, `timestamp`, `errors` (integer count). |
+    Hash fields:
-| `module/<module_id>/backup/<backup_id>/status` | Per-module status hash. Fields: `result`, `timestamp`, `error` (message string). |
+        result     "success" | "error"
        timestamp  ISO 8601 (UTC)
        errors     integer count of failed modules
-`result` is either `"success"` or `"error"`.  `timestamp` is an ISO 8601
+module/<module_id>/backup/<backup_id>/status
-string in UTC (e.g. `2024-01-15T03:00:05Z`).
+    Hash fields:
        result     "success" | "error"
        timestamp  ISO 8601 (UTC)
        error      human-readable error message (empty on success)
 cluster/backup_repository/<repo_id>/parameters
    Hash fields:
        url               cloud backend URL (S3, B2, rclone, ...)
        path              local or SFTP path
        password          restic repository password
        backend           "s3" | "b2" | "sftp" | "rclone" | "local"
        aws_access_key_id S3 key ID (also used as b2_account_id for B2)
        aws_secret_access_key S3 secret (also used as b2_account_key for B2)
        rclone_config     path to rclone configuration file
 ```
 ---
 ## Service management
 ```bash
-# Check service status
+# Status
 systemctl status ns8-backup-monitor
-# Follow live logs via journald
+# Restart after config change
 journalctl -u ns8-backup-monitor -f
 # Follow the rotating log file directly
 tail -f /var/log/ns8-backup-monitor.log
 # Restart after a config change
 systemctl restart ns8-backup-monitor
-# Test the webhook endpoint manually
+# Live logs
 journalctl -u ns8-backup-monitor -f
 # Enable debug logging without editing config.yml
 journalctl -u ns8-backup-monitor -f | grep -v DEBUG
 # Test the webhook manually
 curl -s -X POST http://127.0.0.1:9099/alert \
-  -H 'Content-Type: application/json' \
+  -H "Content-Type: application/json" \
-  -d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}'
+  -d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'
 ```
 ---
 ## Troubleshooting
-### Service starts but no email is received
+### Pipeline does not trigger on automatic backups
-1. Verify `ns8-sendmail` works independently:
+**Symptom:** email arrives when you click "Run backup" from the NS8 UI but not
-   ```bash
+when the scheduled timer fires.
   echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com
   ```
 2. Check `mail.to` in `/etc/ns8-backup-monitor/config.yml`.
 3. Increase log level to `DEBUG` and restart the service.
-### `REPO_FAILURE` on every alert even though backups succeed
+**Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed`
 and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and
 `backup_missing` instead. Additionally, native alerts carry the plan identifier
 in the `id` label, not in `backup_id`.
- The correlator may be reading Redis before all modules have finished.  
+**Fix:** update to the current version; both alert name sets and both labels
-  Increase `correlator.wait_seconds` (e.g. to `60`).
+are now handled automatically.
 - Check that the Redis socket path is correct:  
  `redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING`
-### Alertmanager does not reach the webhook
+**Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert
 with `amtool alert add alertname=backup_failed id=1`. You should see the
 `Received alert:` DEBUG line followed by the pipeline start.
- Confirm the service is listening:  
+### No email received
-  `ss -tlnp | grep 9099`
+
- If Alertmanager runs on a different host, change `receiver.host` to
+1. Check the service is running: `systemctl status ns8-backup-monitor`.
-  `0.0.0.0` and open the port in the firewall.
+2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`.
 3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works:
   `echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`.
 4. Increase log level to `DEBUG` in `config.yml` and restart the service.
 ### REPO_FAILURE with no modules found
 **Cause:** the correlator ran before backup modules finished writing their
 status to Redis.
 **Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60–120
 is safe for most clusters. Restart the service after changing the value.
 ---
@@ -352,11 +346,8 @@ curl -s -X POST http://127.0.0.1:9099/alert \
 bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
 ```
 The script will stop and disable the service, remove the install directory,
 and optionally remove the configuration directory.
 ---
 ## License
-MIT — see [LICENSE](LICENSE) if present, otherwise contact the repository owner.
+MIT — see [LICENSE](LICENSE).