docs: rewrite README with alert name mapping, label mapping, troubleshooting for automatic backups

This commit is contained in:
2026-05-18 21:57:02 +00:00
parent 80f3ff5e50
commit 07830e1467
+152 -161
View File
@@ -4,7 +4,7 @@
>
> Receives Alertmanager webhook alerts, correlates per-module backup status
> from the cluster Redis, optionally probes restic repositories, and sends a
> detailed HTML/text email through the NS8 mail relay.
> structured plain-text email through the NS8 mail relay.
---
@@ -41,20 +41,25 @@ Alertmanager ──POST /alert──► receiver.py
SUCCESS / PARTIAL / REPO_FAILURE)
repo_check.py ← optional
(runagent → restic snapshots
on each module's repository)
repo_check.py ← skipped on SUCCESS
(restic snapshots --last --no-cache
on each configured repository)
notifier.py
(builds HTML + plain-text email,
(builds plain-text email,
dispatches via ns8-sendmail)
```
**Key design decision:** the service is a long-running HTTP server managed by
systemd, not a one-shot script. This means it is always ready to receive an
alert regardless of whether the backup was triggered manually or by a scheduled
timer.
systemd, not a one-shot script. It is always ready to receive alerts whether
the backup is triggered manually from the UI or by an automatic scheduled timer.
> **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named
> `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`).
> All four names are matched so the pipeline fires on both native and custom rules.
> The plan identifier is extracted from the `id` label (native) or `backup_id`
> label (custom) — both are checked.
---
@@ -73,48 +78,48 @@ ns8-backup-monitor/
│ ├── install.sh ← interactive installer / uninstaller
│ └── ns8-backup-monitor.service ← systemd unit file
└── ns8_backup_monitor/ ← Python package
├── __init__.py ← package metadata, version string
├── __main__.py ← entry point: arg parsing, logging init,
hands off to receiver.run_server()
└── ns8_backup_monitor/ ← Python package (the service code)
├── __init__.py ← package metadata and version string
├── __main__.py ← entry point: argument parsing, logging
initialisation, calls receiver.run_server()
├── receiver.py ← HTTP webhook server (POST /alert)
├── correlator.py ← reads Redis, classifies backup outcome
├── repo_check.py ← probes restic repositories via runagent
├── notifier.py builds and sends email notifications
── utils.py ← load_config(), setup_logging()
│ matches alert names, spawns pipeline thread
├── correlator.py ← reads NS8 Redis, classifies backup outcome
├── repo_check.py ← probes restic repositories for health status
── notifier.py ← builds and sends the status email
└── utils.py ← load_config() and setup_logging() helpers
```
---
## Runtime paths
The following paths are created by `deploy/install.sh` and assumed by the
default configuration.
Paths created by `deploy/install.sh` and assumed by the default configuration.
| Purpose | Path |
|---------|------|
| Purpose | Default path |
|---------|-------------|
| Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` |
| Deploy scripts | `/opt/ns8-backup-monitor/deploy/` |
| Configuration | `/etc/ns8-backup-monitor/config.yml` |
| Configuration file | `/etc/ns8-backup-monitor/config.yml` |
| systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` |
| Log file | `/var/log/ns8-backup-monitor.log` |
| NS8 Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
| NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
---
## Requirements
| Dependency | Provided by | Notes |
|------------|------------|-------|
|------------|-------------|-------|
| `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ |
| `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency |
| `redis-cli` | NethServer 8 | Used via subprocess, no Python Redis client needed |
| `runagent` | NethServer 8 | Required for `repo_check` only |
| `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed |
| `restic` | NethServer 8 / manual | Required for `repo_check` only |
| `ns8-sendmail` | NethServer 8 | Required for email delivery |
| `systemd` | OS | Service management |
> **This service must run on an NS8 leader node** (or any node that has
> read access to the cluster Redis socket and `runagent` in `PATH`).
> **This service must run on an NS8 leader node** (or any node with read access
> to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`).
---
@@ -128,7 +133,7 @@ bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/
The installer will:
1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`).
2. Download and extract the latest source archive from the Gitea repository.
2. Download and extract the latest source from the Gitea repository.
3. Prompt interactively for sender address, recipient list, and subject prefix.
4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values.
5. Install and start the systemd service.
@@ -139,19 +144,13 @@ The installer will:
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
cd ns8-backup-monitor
# Install Python dependency
pip3 install pyyaml
# Create directories
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
# Copy source and config template
cp -r . /opt/ns8-backup-monitor/
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
# Edit the config before starting
nano /etc/ns8-backup-monitor/config.yml
nano /etc/ns8-backup-monitor/config.yml # edit before starting
# Install systemd unit
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now ns8-backup-monitor
@@ -161,188 +160,183 @@ systemctl enable --now ns8-backup-monitor
## Configuration
The configuration file is a YAML document. The installer writes it to
`/etc/ns8-backup-monitor/config.yml`; a fully annotated template is available
at `config/config.yml.example`.
Full reference is in `config/config.yml.example`. Key parameters:
```yaml
# ---------------------------------------------------------------------------
# Email notification settings
# ---------------------------------------------------------------------------
# Delivery is handled by ns8-sendmail, which uses the SMTP relay already
# configured in NethServer 8. No SMTP credentials are needed here.
mail:
# Envelope / header sender address.
from: "ns8-backup-monitor@yourdomain.com"
# One or more recipient addresses. At least one is required.
to:
- "admin@yourdomain.com"
# String prepended to every email subject line.
subject_prefix: "[NS8 Backup]"
# ---------------------------------------------------------------------------
# Webhook receiver (HTTP server)
# ---------------------------------------------------------------------------
receiver:
# Interface to listen on. 127.0.0.1 is recommended when Alertmanager
# runs on the same host; use 0.0.0.0 only if it runs on a different node.
host: "127.0.0.1"
# TCP port. Must match the webhook URL configured in Alertmanager.
port: 9099
host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager)
port: 9099 # webhook listening port
# ---------------------------------------------------------------------------
# Timing
# ---------------------------------------------------------------------------
correlator:
# Seconds to wait after receiving the alert before reading Redis.
# This grace period allows all module agents to finish writing their
# per-module status hashes. 30 s is sufficient for most deployments.
wait_seconds: 30
wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish)
recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present
# Look-back window in seconds used when the alert does not include a
# backup_id label. Any plan whose Redis status was updated within this
# window is considered "recent" and included in the report.
recent_window: 3600
# ---------------------------------------------------------------------------
# Redis connection
# ---------------------------------------------------------------------------
redis:
# Path to the NS8 cluster Redis Unix socket.
# On a standard NS8 installation this path never changes.
socket: "/var/lib/nethserver/cluster/state/redis.sock"
# ---------------------------------------------------------------------------
# Repository check (optional, uses runagent + restic)
# ---------------------------------------------------------------------------
repo_check:
# Maximum seconds to wait for each repository check before giving up.
timeout: 60
# Extra flags passed verbatim to every restic invocation.
# Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt"
restic_flags: ""
enabled: true # set false to skip restic health checks entirely
timeout: 60 # per-repository restic timeout (seconds)
notification:
mail_from: ns8-backup-monitor@example.com
mail_to:
- admin@example.com
- ops@example.com
redis:
socket: /var/lib/nethserver/cluster/state/redis.sock
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging:
# Python log level: DEBUG, INFO, WARNING, ERROR.
level: INFO
# Absolute path for the rotating log file (5 MB × 3 backups).
# Leave empty to log to stdout / journald only.
file: "/var/log/ns8-backup-monitor.log"
level: INFO # DEBUG for verbose output during troubleshooting
file: /var/log/ns8-backup-monitor.log
```
---
## Alertmanager integration
Add a receiver pointing to the service in your Alertmanager configuration:
Add a receiver and route to your Alertmanager configuration:
```yaml
# alertmanager.yml (relevant excerpt)
route:
receiver: ns8-backup-monitor
# Only route backup-related alerts to this receiver.
routes:
- match:
alertname: NethServerBackupFailed
receiver: ns8-backup-monitor
# alertmanager.yml
receivers:
- name: ns8-backup-monitor
webhook_configs:
- url: "http://127.0.0.1:9099/alert"
# Send resolved alerts too so the service can log them.
send_resolved: true
- url: http://127.0.0.1:9099/alert
send_resolved: false # resolved alerts are intentionally ignored
route:
routes:
- matchers:
- alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
receiver: ns8-backup-monitor
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
```
Reload Alertmanager after editing:
### Supported alert names
```bash
systemctl reload alertmanager
# or, for the NS8 metrics module:
runagent -m metrics1 systemctl reload alertmanager
```
| Alert name | Rule set | Trigger |
|------------|----------|---------|
| `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` |
| `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time |
| `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` |
| `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` |
### Label mapping
| Label | Used by | Contains |
|-------|---------|----------|
| `id` | NS8 native alerts | Backup plan identifier |
| `backup_id` | Custom / legacy alerts | Backup plan identifier |
Both labels are checked. When neither is present the correlator falls back to
scanning Redis for all plan status keys updated within `correlator.recent_window`.
---
## Outcome classification
For each backup plan the correlator reads all per-module status hashes and
produces one of three outcomes:
| Outcome | Condition | Email subject |
|---------|-----------|---------------|
| `SUCCESS` | All modules finished with `result=success` | `✅ Backup completed` |
| `PARTIAL` | At least one module succeeded, at least one failed | `⚠️ Backup partially failed` |
| `REPO_FAILURE` | All modules failed **or** no status found in Redis | `❌ Backup failed` |
| `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` |
| `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` |
| `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - <reason>` |
`REPO_FAILURE` covers both the case where all modules failed and the case where
no status was found in Redis at all (possible repository-level or scheduling
issue). The repository health check (`repo_check.py`) runs automatically for
`PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics.
---
## Redis key structure
The correlator reads two families of keys from the NS8 cluster Redis:
Keys read by `correlator.py` and `repo_check.py`:
| Key pattern | Description |
|-------------|-------------|
| `cluster/backup/<backup_id>/status` | Plan-level status hash. Fields: `result`, `timestamp`, `errors` (integer count). |
| `module/<module_id>/backup/<backup_id>/status` | Per-module status hash. Fields: `result`, `timestamp`, `error` (message string). |
```
cluster/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
errors integer count of failed modules
`result` is either `"success"` or `"error"`. `timestamp` is an ISO 8601
string in UTC (e.g. `2024-01-15T03:00:05Z`).
module/<module_id>/backup/<backup_id>/status
Hash fields:
result "success" | "error"
timestamp ISO 8601 (UTC)
error human-readable error message (empty on success)
cluster/backup_repository/<repo_id>/parameters
Hash fields:
url cloud backend URL (S3, B2, rclone, ...)
path local or SFTP path
password restic repository password
backend "s3" | "b2" | "sftp" | "rclone" | "local"
aws_access_key_id S3 key ID (also used as b2_account_id for B2)
aws_secret_access_key S3 secret (also used as b2_account_key for B2)
rclone_config path to rclone configuration file
```
---
## Service management
```bash
# Check service status
# Status
systemctl status ns8-backup-monitor
# Follow live logs via journald
journalctl -u ns8-backup-monitor -f
# Follow the rotating log file directly
tail -f /var/log/ns8-backup-monitor.log
# Restart after a config change
# Restart after config change
systemctl restart ns8-backup-monitor
# Test the webhook endpoint manually
# Live logs
journalctl -u ns8-backup-monitor -f
# Enable debug logging without editing config.yml
journalctl -u ns8-backup-monitor -f | grep -v DEBUG
# Test the webhook manually
curl -s -X POST http://127.0.0.1:9099/alert \
-H 'Content-Type: application/json' \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}'
-H "Content-Type: application/json" \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'
```
---
## Troubleshooting
### Service starts but no email is received
### Pipeline does not trigger on automatic backups
1. Verify `ns8-sendmail` works independently:
```bash
echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com
```
2. Check `mail.to` in `/etc/ns8-backup-monitor/config.yml`.
3. Increase log level to `DEBUG` and restart the service.
**Symptom:** email arrives when you click "Run backup" from the NS8 UI but not
when the scheduled timer fires.
### `REPO_FAILURE` on every alert even though backups succeed
**Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed`
and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and
`backup_missing` instead. Additionally, native alerts carry the plan identifier
in the `id` label, not in `backup_id`.
- The correlator may be reading Redis before all modules have finished.
Increase `correlator.wait_seconds` (e.g. to `60`).
- Check that the Redis socket path is correct:
`redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING`
**Fix:** update to the current version; both alert name sets and both labels
are now handled automatically.
### Alertmanager does not reach the webhook
**Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert
with `amtool alert add alertname=backup_failed id=1`. You should see the
`Received alert:` DEBUG line followed by the pipeline start.
- Confirm the service is listening:
`ss -tlnp | grep 9099`
- If Alertmanager runs on a different host, change `receiver.host` to
`0.0.0.0` and open the port in the firewall.
### No email received
1. Check the service is running: `systemctl status ns8-backup-monitor`.
2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`.
3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works:
`echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`.
4. Increase log level to `DEBUG` in `config.yml` and restart the service.
### REPO_FAILURE with no modules found
**Cause:** the correlator ran before backup modules finished writing their
status to Redis.
**Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60120
is safe for most clusters. Restart the service after changing the value.
---
@@ -352,11 +346,8 @@ curl -s -X POST http://127.0.0.1:9099/alert \
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
```
The script will stop and disable the service, remove the install directory,
and optionally remove the configuration directory.
---
## License
MIT — see [LICENSE](LICENSE) if present, otherwise contact the repository owner.
MIT — see [LICENSE](LICENSE).