docs: rewrite README with alert name mapping, label mapping, troubleshooting for automatic backups

This commit is contained in:
2026-05-18 21:57:02 +00:00
parent 80f3ff5e50
commit 07830e1467
+152 -161
View File
@@ -4,7 +4,7 @@
> >
> Receives Alertmanager webhook alerts, correlates per-module backup status > Receives Alertmanager webhook alerts, correlates per-module backup status
> from the cluster Redis, optionally probes restic repositories, and sends a > from the cluster Redis, optionally probes restic repositories, and sends a
> detailed HTML/text email through the NS8 mail relay. > structured plain-text email through the NS8 mail relay.
--- ---
@@ -41,20 +41,25 @@ Alertmanager ──POST /alert──► receiver.py
SUCCESS / PARTIAL / REPO_FAILURE) SUCCESS / PARTIAL / REPO_FAILURE)
repo_check.py ← optional repo_check.py ← skipped on SUCCESS
(runagent → restic snapshots (restic snapshots --last --no-cache
on each module's repository) on each configured repository)
notifier.py notifier.py
(builds HTML + plain-text email, (builds plain-text email,
dispatches via ns8-sendmail) dispatches via ns8-sendmail)
``` ```
**Key design decision:** the service is a long-running HTTP server managed by **Key design decision:** the service is a long-running HTTP server managed by
systemd, not a one-shot script. This means it is always ready to receive an systemd, not a one-shot script. It is always ready to receive alerts whether
alert regardless of whether the backup was triggered manually or by a scheduled the backup is triggered manually from the UI or by an automatic scheduled timer.
timer.
> **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named
> `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`).
> All four names are matched so the pipeline fires on both native and custom rules.
> The plan identifier is extracted from the `id` label (native) or `backup_id`
> label (custom) — both are checked.
--- ---
@@ -73,48 +78,48 @@ ns8-backup-monitor/
│ ├── install.sh ← interactive installer / uninstaller │ ├── install.sh ← interactive installer / uninstaller
│ └── ns8-backup-monitor.service ← systemd unit file │ └── ns8-backup-monitor.service ← systemd unit file
└── ns8_backup_monitor/ ← Python package └── ns8_backup_monitor/ ← Python package (the service code)
├── __init__.py ← package metadata, version string ├── __init__.py ← package metadata and version string
├── __main__.py ← entry point: arg parsing, logging init, ├── __main__.py ← entry point: argument parsing, logging
hands off to receiver.run_server() initialisation, calls receiver.run_server()
├── receiver.py ← HTTP webhook server (POST /alert) ├── receiver.py ← HTTP webhook server (POST /alert)
├── correlator.py ← reads Redis, classifies backup outcome │ matches alert names, spawns pipeline thread
├── repo_check.py ← probes restic repositories via runagent ├── correlator.py ← reads NS8 Redis, classifies backup outcome
├── notifier.py builds and sends email notifications ├── repo_check.py ← probes restic repositories for health status
── utils.py ← load_config(), setup_logging() ── notifier.py ← builds and sends the status email
└── utils.py ← load_config() and setup_logging() helpers
``` ```
--- ---
## Runtime paths ## Runtime paths
The following paths are created by `deploy/install.sh` and assumed by the Paths created by `deploy/install.sh` and assumed by the default configuration.
default configuration.
| Purpose | Path | | Purpose | Default path |
|---------|------| |---------|-------------|
| Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` | | Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` |
| Deploy scripts | `/opt/ns8-backup-monitor/deploy/` | | Deploy scripts | `/opt/ns8-backup-monitor/deploy/` |
| Configuration | `/etc/ns8-backup-monitor/config.yml` | | Configuration file | `/etc/ns8-backup-monitor/config.yml` |
| systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` | | systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` |
| Log file | `/var/log/ns8-backup-monitor.log` | | Log file | `/var/log/ns8-backup-monitor.log` |
| NS8 Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` | | NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
--- ---
## Requirements ## Requirements
| Dependency | Provided by | Notes | | Dependency | Provided by | Notes |
|------------|------------|-------| |------------|-------------|-------|
| `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ | | `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ |
| `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency | | `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency |
| `redis-cli` | NethServer 8 | Used via subprocess, no Python Redis client needed | | `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed |
| `runagent` | NethServer 8 | Required for `repo_check` only | | `restic` | NethServer 8 / manual | Required for `repo_check` only |
| `ns8-sendmail` | NethServer 8 | Required for email delivery | | `ns8-sendmail` | NethServer 8 | Required for email delivery |
| `systemd` | OS | Service management | | `systemd` | OS | Service management |
> **This service must run on an NS8 leader node** (or any node that has > **This service must run on an NS8 leader node** (or any node with read access
> read access to the cluster Redis socket and `runagent` in `PATH`). > to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`).
--- ---
@@ -128,7 +133,7 @@ bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/
The installer will: The installer will:
1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`). 1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`).
2. Download and extract the latest source archive from the Gitea repository. 2. Download and extract the latest source from the Gitea repository.
3. Prompt interactively for sender address, recipient list, and subject prefix. 3. Prompt interactively for sender address, recipient list, and subject prefix.
4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values. 4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values.
5. Install and start the systemd service. 5. Install and start the systemd service.
@@ -139,19 +144,13 @@ The installer will:
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor cd ns8-backup-monitor
# Install Python dependency
pip3 install pyyaml pip3 install pyyaml
# Create directories
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
# Copy source and config template
cp -r . /opt/ns8-backup-monitor/ cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
# Edit the config before starting nano /etc/ns8-backup-monitor/config.yml # edit before starting
nano /etc/ns8-backup-monitor/config.yml
# Install systemd unit
cp deploy/ns8-backup-monitor.service /etc/systemd/system/ cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload systemctl daemon-reload
systemctl enable --now ns8-backup-monitor systemctl enable --now ns8-backup-monitor
@@ -161,188 +160,183 @@ systemctl enable --now ns8-backup-monitor
## Configuration ## Configuration
The configuration file is a YAML document. The installer writes it to Full reference is in `config/config.yml.example`. Key parameters:
`/etc/ns8-backup-monitor/config.yml`; a fully annotated template is available
at `config/config.yml.example`.
```yaml ```yaml
# ---------------------------------------------------------------------------
# Email notification settings
# ---------------------------------------------------------------------------
# Delivery is handled by ns8-sendmail, which uses the SMTP relay already
# configured in NethServer 8. No SMTP credentials are needed here.
mail:
# Envelope / header sender address.
from: "ns8-backup-monitor@yourdomain.com"
# One or more recipient addresses. At least one is required.
to:
- "admin@yourdomain.com"
# String prepended to every email subject line.
subject_prefix: "[NS8 Backup]"
# ---------------------------------------------------------------------------
# Webhook receiver (HTTP server)
# ---------------------------------------------------------------------------
receiver: receiver:
# Interface to listen on. 127.0.0.1 is recommended when Alertmanager host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager)
# runs on the same host; use 0.0.0.0 only if it runs on a different node. port: 9099 # webhook listening port
host: "127.0.0.1"
# TCP port. Must match the webhook URL configured in Alertmanager.
port: 9099
# ---------------------------------------------------------------------------
# Timing
# ---------------------------------------------------------------------------
correlator: correlator:
# Seconds to wait after receiving the alert before reading Redis. wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish)
# This grace period allows all module agents to finish writing their recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present
# per-module status hashes. 30 s is sufficient for most deployments.
wait_seconds: 30
# Look-back window in seconds used when the alert does not include a
# backup_id label. Any plan whose Redis status was updated within this
# window is considered "recent" and included in the report.
recent_window: 3600
# ---------------------------------------------------------------------------
# Redis connection
# ---------------------------------------------------------------------------
redis:
# Path to the NS8 cluster Redis Unix socket.
# On a standard NS8 installation this path never changes.
socket: "/var/lib/nethserver/cluster/state/redis.sock"
# ---------------------------------------------------------------------------
# Repository check (optional, uses runagent + restic)
# ---------------------------------------------------------------------------
repo_check: repo_check:
# Maximum seconds to wait for each repository check before giving up. enabled: true # set false to skip restic health checks entirely
timeout: 60 timeout: 60 # per-repository restic timeout (seconds)
# Extra flags passed verbatim to every restic invocation.
# Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt" notification:
restic_flags: "" mail_from: ns8-backup-monitor@example.com
mail_to:
- admin@example.com
- ops@example.com
redis:
socket: /var/lib/nethserver/cluster/state/redis.sock
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging: logging:
# Python log level: DEBUG, INFO, WARNING, ERROR. level: INFO # DEBUG for verbose output during troubleshooting
level: INFO file: /var/log/ns8-backup-monitor.log
# Absolute path for the rotating log file (5 MB × 3 backups).
# Leave empty to log to stdout / journald only.
file: "/var/log/ns8-backup-monitor.log"
``` ```
--- ---
## Alertmanager integration ## Alertmanager integration
Add a receiver pointing to the service in your Alertmanager configuration: Add a receiver and route to your Alertmanager configuration:
```yaml ```yaml
# alertmanager.yml (relevant excerpt) # alertmanager.yml
route:
receiver: ns8-backup-monitor
# Only route backup-related alerts to this receiver.
routes:
- match:
alertname: NethServerBackupFailed
receiver: ns8-backup-monitor
receivers: receivers:
- name: ns8-backup-monitor - name: ns8-backup-monitor
webhook_configs: webhook_configs:
- url: "http://127.0.0.1:9099/alert" - url: http://127.0.0.1:9099/alert
# Send resolved alerts too so the service can log them. send_resolved: false # resolved alerts are intentionally ignored
send_resolved: true
route:
routes:
- matchers:
- alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
receiver: ns8-backup-monitor
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
``` ```
Reload Alertmanager after editing: ### Supported alert names
```bash | Alert name | Rule set | Trigger |
systemctl reload alertmanager |------------|----------|---------|
# or, for the NS8 metrics module: | `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` |
runagent -m metrics1 systemctl reload alertmanager | `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time |
``` | `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` |
| `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` |
### Label mapping
| Label | Used by | Contains |
|-------|---------|----------|
| `id` | NS8 native alerts | Backup plan identifier |
| `backup_id` | Custom / legacy alerts | Backup plan identifier |
Both labels are checked. When neither is present the correlator falls back to
scanning Redis for all plan status keys updated within `correlator.recent_window`.
--- ---
## Outcome classification ## Outcome classification
For each backup plan the correlator reads all per-module status hashes and
produces one of three outcomes:
| Outcome | Condition | Email subject | | Outcome | Condition | Email subject |
|---------|-----------|---------------| |---------|-----------|---------------|
| `SUCCESS` | All modules finished with `result=success` | `✅ Backup completed` | | `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` |
| `PARTIAL` | At least one module succeeded, at least one failed | `⚠️ Backup partially failed` | | `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` |
| `REPO_FAILURE` | All modules failed **or** no status found in Redis | `❌ Backup failed` | | `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - <reason>` |
`REPO_FAILURE` covers both the case where all modules failed and the case where
no status was found in Redis at all (possible repository-level or scheduling
issue). The repository health check (`repo_check.py`) runs automatically for
`PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics.
--- ---
## Redis key structure ## Redis key structure
The correlator reads two families of keys from the NS8 cluster Redis: Keys read by `correlator.py` and `repo_check.py`:
| Key pattern | Description | ```
|-------------|-------------| cluster/backup/<backup_id>/status
| `cluster/backup/<backup_id>/status` | Plan-level status hash. Fields: `result`, `timestamp`, `errors` (integer count). | Hash fields:
| `module/<module_id>/backup/<backup_id>/status` | Per-module status hash. Fields: `result`, `timestamp`, `error` (message string). | result "success" | "error"
timestamp ISO 8601 (UTC)
errors integer count of failed modules
`result` is either `"success"` or `"error"`. `timestamp` is an ISO 8601 module/<module_id>/backup/<backup_id>/status
string in UTC (e.g. `2024-01-15T03:00:05Z`). Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
error human-readable error message (empty on success)
cluster/backup_repository/<repo_id>/parameters
Hash fields:
url cloud backend URL (S3, B2, rclone, ...)
path local or SFTP path
password restic repository password
backend "s3" | "b2" | "sftp" | "rclone" | "local"
aws_access_key_id S3 key ID (also used as b2_account_id for B2)
aws_secret_access_key S3 secret (also used as b2_account_key for B2)
rclone_config path to rclone configuration file
```
--- ---
## Service management ## Service management
```bash ```bash
# Check service status # Status
systemctl status ns8-backup-monitor systemctl status ns8-backup-monitor
# Follow live logs via journald # Restart after config change
journalctl -u ns8-backup-monitor -f
# Follow the rotating log file directly
tail -f /var/log/ns8-backup-monitor.log
# Restart after a config change
systemctl restart ns8-backup-monitor systemctl restart ns8-backup-monitor
# Test the webhook endpoint manually # Live logs
journalctl -u ns8-backup-monitor -f
# Enable debug logging without editing config.yml
journalctl -u ns8-backup-monitor -f | grep -v DEBUG
# Test the webhook manually
curl -s -X POST http://127.0.0.1:9099/alert \ curl -s -X POST http://127.0.0.1:9099/alert \
-H 'Content-Type: application/json' \ -H "Content-Type: application/json" \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}' -d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'
``` ```
--- ---
## Troubleshooting ## Troubleshooting
### Service starts but no email is received ### Pipeline does not trigger on automatic backups
1. Verify `ns8-sendmail` works independently: **Symptom:** email arrives when you click "Run backup" from the NS8 UI but not
```bash when the scheduled timer fires.
echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com
```
2. Check `mail.to` in `/etc/ns8-backup-monitor/config.yml`.
3. Increase log level to `DEBUG` and restart the service.
### `REPO_FAILURE` on every alert even though backups succeed **Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed`
and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and
`backup_missing` instead. Additionally, native alerts carry the plan identifier
in the `id` label, not in `backup_id`.
- The correlator may be reading Redis before all modules have finished. **Fix:** update to the current version; both alert name sets and both labels
Increase `correlator.wait_seconds` (e.g. to `60`). are now handled automatically.
- Check that the Redis socket path is correct:
`redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING`
### Alertmanager does not reach the webhook **Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert
with `amtool alert add alertname=backup_failed id=1`. You should see the
`Received alert:` DEBUG line followed by the pipeline start.
- Confirm the service is listening: ### No email received
`ss -tlnp | grep 9099`
- If Alertmanager runs on a different host, change `receiver.host` to 1. Check the service is running: `systemctl status ns8-backup-monitor`.
`0.0.0.0` and open the port in the firewall. 2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`.
3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works:
`echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`.
4. Increase log level to `DEBUG` in `config.yml` and restart the service.
### REPO_FAILURE with no modules found
**Cause:** the correlator ran before backup modules finished writing their
status to Redis.
**Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60120
is safe for most clusters. Restart the service after changing the value.
--- ---
@@ -352,11 +346,8 @@ curl -s -X POST http://127.0.0.1:9099/alert \
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
``` ```
The script will stop and disable the service, remove the install directory,
and optionally remove the configuration directory.
--- ---
## License ## License
MIT — see [LICENSE](LICENSE) if present, otherwise contact the repository owner. MIT — see [LICENSE](LICENSE).