docs: rewrite README with alert name mapping, label mapping, troubleshooting for automatic backups
This commit is contained in:
@@ -4,7 +4,7 @@
|
||||
>
|
||||
> Receives Alertmanager webhook alerts, correlates per-module backup status
|
||||
> from the cluster Redis, optionally probes restic repositories, and sends a
|
||||
> detailed HTML/text email through the NS8 mail relay.
|
||||
> structured plain-text email through the NS8 mail relay.
|
||||
|
||||
---
|
||||
|
||||
@@ -41,20 +41,25 @@ Alertmanager ──POST /alert──► receiver.py
|
||||
SUCCESS / PARTIAL / REPO_FAILURE)
|
||||
│
|
||||
▼
|
||||
repo_check.py ← optional
|
||||
(runagent → restic snapshots
|
||||
on each module's repository)
|
||||
repo_check.py ← skipped on SUCCESS
|
||||
(restic snapshots --last --no-cache
|
||||
on each configured repository)
|
||||
│
|
||||
▼
|
||||
notifier.py
|
||||
(builds HTML + plain-text email,
|
||||
(builds plain-text email,
|
||||
dispatches via ns8-sendmail)
|
||||
```
|
||||
|
||||
**Key design decision:** the service is a long-running HTTP server managed by
|
||||
systemd, not a one-shot script. This means it is always ready to receive an
|
||||
alert regardless of whether the backup was triggered manually or by a scheduled
|
||||
timer.
|
||||
systemd, not a one-shot script. It is always ready to receive alerts whether
|
||||
the backup is triggered manually from the UI or by an automatic scheduled timer.
|
||||
|
||||
> **Note on automatic alerts:** NS8 native Prometheus rules emit alerts named
|
||||
> `backup_failed` and `backup_missing` (not `NsBackupFailed` / `NsBackupMissing`).
|
||||
> All four names are matched so the pipeline fires on both native and custom rules.
|
||||
> The plan identifier is extracted from the `id` label (native) or `backup_id`
|
||||
> label (custom) — both are checked.
|
||||
|
||||
---
|
||||
|
||||
@@ -73,48 +78,48 @@ ns8-backup-monitor/
|
||||
│ ├── install.sh ← interactive installer / uninstaller
|
||||
│ └── ns8-backup-monitor.service ← systemd unit file
|
||||
│
|
||||
└── ns8_backup_monitor/ ← Python package
|
||||
├── __init__.py ← package metadata, version string
|
||||
├── __main__.py ← entry point: arg parsing, logging init,
|
||||
│ hands off to receiver.run_server()
|
||||
└── ns8_backup_monitor/ ← Python package (the service code)
|
||||
├── __init__.py ← package metadata and version string
|
||||
├── __main__.py ← entry point: argument parsing, logging
|
||||
│ initialisation, calls receiver.run_server()
|
||||
├── receiver.py ← HTTP webhook server (POST /alert)
|
||||
├── correlator.py ← reads Redis, classifies backup outcome
|
||||
├── repo_check.py ← probes restic repositories via runagent
|
||||
├── notifier.py ← builds and sends email notifications
|
||||
└── utils.py ← load_config(), setup_logging()
|
||||
│ matches alert names, spawns pipeline thread
|
||||
├── correlator.py ← reads NS8 Redis, classifies backup outcome
|
||||
├── repo_check.py ← probes restic repositories for health status
|
||||
├── notifier.py ← builds and sends the status email
|
||||
└── utils.py ← load_config() and setup_logging() helpers
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Runtime paths
|
||||
|
||||
The following paths are created by `deploy/install.sh` and assumed by the
|
||||
default configuration.
|
||||
Paths created by `deploy/install.sh` and assumed by the default configuration.
|
||||
|
||||
| Purpose | Path |
|
||||
|---------|------|
|
||||
| Purpose | Default path |
|
||||
|---------|-------------|
|
||||
| Python package | `/opt/ns8-backup-monitor/ns8_backup_monitor/` |
|
||||
| Deploy scripts | `/opt/ns8-backup-monitor/deploy/` |
|
||||
| Configuration | `/etc/ns8-backup-monitor/config.yml` |
|
||||
| Configuration file | `/etc/ns8-backup-monitor/config.yml` |
|
||||
| systemd unit | `/etc/systemd/system/ns8-backup-monitor.service` |
|
||||
| Log file | `/var/log/ns8-backup-monitor.log` |
|
||||
| NS8 Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
|
||||
| NS8 cluster Redis socket | `/var/lib/nethserver/cluster/state/redis.sock` |
|
||||
|
||||
---
|
||||
|
||||
## Requirements
|
||||
|
||||
| Dependency | Provided by | Notes |
|
||||
|------------|------------|-------|
|
||||
|------------|-------------|-------|
|
||||
| `python3` ≥ 3.8 | OS | Standard on AlmaLinux / Rocky 8+ |
|
||||
| `pyyaml` | `pip3 install pyyaml` | Only non-stdlib dependency |
|
||||
| `redis-cli` | NethServer 8 | Used via subprocess, no Python Redis client needed |
|
||||
| `runagent` | NethServer 8 | Required for `repo_check` only |
|
||||
| `redis-cli` | NethServer 8 | Accessed via subprocess; no Python Redis client needed |
|
||||
| `restic` | NethServer 8 / manual | Required for `repo_check` only |
|
||||
| `ns8-sendmail` | NethServer 8 | Required for email delivery |
|
||||
| `systemd` | OS | Service management |
|
||||
|
||||
> **This service must run on an NS8 leader node** (or any node that has
|
||||
> read access to the cluster Redis socket and `runagent` in `PATH`).
|
||||
> **This service must run on an NS8 leader node** (or any node with read access
|
||||
> to the cluster Redis Unix socket and `ns8-sendmail` in `PATH`).
|
||||
|
||||
---
|
||||
|
||||
@@ -128,7 +133,7 @@ bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/
|
||||
|
||||
The installer will:
|
||||
1. Check prerequisites (`python3`, `curl`, `tar`, `ns8-sendmail`).
|
||||
2. Download and extract the latest source archive from the Gitea repository.
|
||||
2. Download and extract the latest source from the Gitea repository.
|
||||
3. Prompt interactively for sender address, recipient list, and subject prefix.
|
||||
4. Write `/etc/ns8-backup-monitor/config.yml` with the supplied values.
|
||||
5. Install and start the systemd service.
|
||||
@@ -139,19 +144,13 @@ The installer will:
|
||||
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
|
||||
cd ns8-backup-monitor
|
||||
|
||||
# Install Python dependency
|
||||
pip3 install pyyaml
|
||||
|
||||
# Create directories
|
||||
mkdir -p /opt/ns8-backup-monitor /etc/ns8-backup-monitor
|
||||
|
||||
# Copy source and config template
|
||||
cp -r . /opt/ns8-backup-monitor/
|
||||
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
|
||||
# Edit the config before starting
|
||||
nano /etc/ns8-backup-monitor/config.yml
|
||||
nano /etc/ns8-backup-monitor/config.yml # edit before starting
|
||||
|
||||
# Install systemd unit
|
||||
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now ns8-backup-monitor
|
||||
@@ -161,188 +160,183 @@ systemctl enable --now ns8-backup-monitor
|
||||
|
||||
## Configuration
|
||||
|
||||
The configuration file is a YAML document. The installer writes it to
|
||||
`/etc/ns8-backup-monitor/config.yml`; a fully annotated template is available
|
||||
at `config/config.yml.example`.
|
||||
Full reference is in `config/config.yml.example`. Key parameters:
|
||||
|
||||
```yaml
|
||||
# ---------------------------------------------------------------------------
|
||||
# Email notification settings
|
||||
# ---------------------------------------------------------------------------
|
||||
# Delivery is handled by ns8-sendmail, which uses the SMTP relay already
|
||||
# configured in NethServer 8. No SMTP credentials are needed here.
|
||||
mail:
|
||||
# Envelope / header sender address.
|
||||
from: "ns8-backup-monitor@yourdomain.com"
|
||||
|
||||
# One or more recipient addresses. At least one is required.
|
||||
to:
|
||||
- "admin@yourdomain.com"
|
||||
|
||||
# String prepended to every email subject line.
|
||||
subject_prefix: "[NS8 Backup]"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Webhook receiver (HTTP server)
|
||||
# ---------------------------------------------------------------------------
|
||||
receiver:
|
||||
# Interface to listen on. 127.0.0.1 is recommended when Alertmanager
|
||||
# runs on the same host; use 0.0.0.0 only if it runs on a different node.
|
||||
host: "127.0.0.1"
|
||||
# TCP port. Must match the webhook URL configured in Alertmanager.
|
||||
port: 9099
|
||||
host: 127.0.0.1 # bind address (use 0.0.0.0 only for remote Alertmanager)
|
||||
port: 9099 # webhook listening port
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Timing
|
||||
# ---------------------------------------------------------------------------
|
||||
correlator:
|
||||
# Seconds to wait after receiving the alert before reading Redis.
|
||||
# This grace period allows all module agents to finish writing their
|
||||
# per-module status hashes. 30 s is sufficient for most deployments.
|
||||
wait_seconds: 30
|
||||
wait_seconds: 30 # wait after alert before reading Redis (allow slow modules to finish)
|
||||
recent_window: 3600 # fallback scan window (seconds) when no backup_id label is present
|
||||
|
||||
# Look-back window in seconds used when the alert does not include a
|
||||
# backup_id label. Any plan whose Redis status was updated within this
|
||||
# window is considered "recent" and included in the report.
|
||||
recent_window: 3600
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Redis connection
|
||||
# ---------------------------------------------------------------------------
|
||||
redis:
|
||||
# Path to the NS8 cluster Redis Unix socket.
|
||||
# On a standard NS8 installation this path never changes.
|
||||
socket: "/var/lib/nethserver/cluster/state/redis.sock"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Repository check (optional, uses runagent + restic)
|
||||
# ---------------------------------------------------------------------------
|
||||
repo_check:
|
||||
# Maximum seconds to wait for each repository check before giving up.
|
||||
timeout: 60
|
||||
# Extra flags passed verbatim to every restic invocation.
|
||||
# Example: "--cacert /etc/pki/tls/certs/ca-bundle.crt"
|
||||
restic_flags: ""
|
||||
enabled: true # set false to skip restic health checks entirely
|
||||
timeout: 60 # per-repository restic timeout (seconds)
|
||||
|
||||
notification:
|
||||
mail_from: ns8-backup-monitor@example.com
|
||||
mail_to:
|
||||
- admin@example.com
|
||||
- ops@example.com
|
||||
|
||||
redis:
|
||||
socket: /var/lib/nethserver/cluster/state/redis.sock
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Logging
|
||||
# ---------------------------------------------------------------------------
|
||||
logging:
|
||||
# Python log level: DEBUG, INFO, WARNING, ERROR.
|
||||
level: INFO
|
||||
# Absolute path for the rotating log file (5 MB × 3 backups).
|
||||
# Leave empty to log to stdout / journald only.
|
||||
file: "/var/log/ns8-backup-monitor.log"
|
||||
level: INFO # DEBUG for verbose output during troubleshooting
|
||||
file: /var/log/ns8-backup-monitor.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alertmanager integration
|
||||
|
||||
Add a receiver pointing to the service in your Alertmanager configuration:
|
||||
Add a receiver and route to your Alertmanager configuration:
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml (relevant excerpt)
|
||||
route:
|
||||
receiver: ns8-backup-monitor
|
||||
# Only route backup-related alerts to this receiver.
|
||||
routes:
|
||||
- match:
|
||||
alertname: NethServerBackupFailed
|
||||
receiver: ns8-backup-monitor
|
||||
# alertmanager.yml
|
||||
|
||||
receivers:
|
||||
- name: ns8-backup-monitor
|
||||
webhook_configs:
|
||||
- url: "http://127.0.0.1:9099/alert"
|
||||
# Send resolved alerts too so the service can log them.
|
||||
send_resolved: true
|
||||
- url: http://127.0.0.1:9099/alert
|
||||
send_resolved: false # resolved alerts are intentionally ignored
|
||||
|
||||
route:
|
||||
routes:
|
||||
- matchers:
|
||||
- alertname =~ "backup_failed|backup_missing|NsBackupFailed|NsBackupMissing"
|
||||
receiver: ns8-backup-monitor
|
||||
group_wait: 10s
|
||||
group_interval: 5m
|
||||
repeat_interval: 12h
|
||||
```
|
||||
|
||||
Reload Alertmanager after editing:
|
||||
### Supported alert names
|
||||
|
||||
```bash
|
||||
systemctl reload alertmanager
|
||||
# or, for the NS8 metrics module:
|
||||
runagent -m metrics1 systemctl reload alertmanager
|
||||
```
|
||||
| Alert name | Rule set | Trigger |
|
||||
|------------|----------|---------|
|
||||
| `backup_failed` | NS8 native (`node_backup_status`) | One or more plans reported `result != success` |
|
||||
| `backup_missing` | NS8 native (`node_backup_status`) | Expected backup did not complete in time |
|
||||
| `NsBackupFailed` | Custom / legacy | Same semantic as `backup_failed` |
|
||||
| `NsBackupMissing` | Custom / legacy | Same semantic as `backup_missing` |
|
||||
|
||||
### Label mapping
|
||||
|
||||
| Label | Used by | Contains |
|
||||
|-------|---------|----------|
|
||||
| `id` | NS8 native alerts | Backup plan identifier |
|
||||
| `backup_id` | Custom / legacy alerts | Backup plan identifier |
|
||||
|
||||
Both labels are checked. When neither is present the correlator falls back to
|
||||
scanning Redis for all plan status keys updated within `correlator.recent_window`.
|
||||
|
||||
---
|
||||
|
||||
## Outcome classification
|
||||
|
||||
For each backup plan the correlator reads all per-module status hashes and
|
||||
produces one of three outcomes:
|
||||
|
||||
| Outcome | Condition | Email subject |
|
||||
|---------|-----------|---------------|
|
||||
| `SUCCESS` | All modules finished with `result=success` | `✅ Backup completed` |
|
||||
| `PARTIAL` | At least one module succeeded, at least one failed | `⚠️ Backup partially failed` |
|
||||
| `REPO_FAILURE` | All modules failed **or** no status found in Redis | `❌ Backup failed` |
|
||||
| `SUCCESS` | `failed == 0` and `total > 0` | `[ns8-backup] SUCCESS - all N module(s) backed up successfully` |
|
||||
| `PARTIAL` | `0 < failed < total` | `[ns8-backup] PARTIAL - N/M module(s) failed` |
|
||||
| `REPO_FAILURE` | `failed == total` or `total == 0` | `[ns8-backup] REPO_FAILURE - <reason>` |
|
||||
|
||||
`REPO_FAILURE` covers both the case where all modules failed and the case where
|
||||
no status was found in Redis at all (possible repository-level or scheduling
|
||||
issue). The repository health check (`repo_check.py`) runs automatically for
|
||||
`PARTIAL` and `REPO_FAILURE` outcomes to provide additional diagnostics.
|
||||
|
||||
---
|
||||
|
||||
## Redis key structure
|
||||
|
||||
The correlator reads two families of keys from the NS8 cluster Redis:
|
||||
Keys read by `correlator.py` and `repo_check.py`:
|
||||
|
||||
| Key pattern | Description |
|
||||
|-------------|-------------|
|
||||
| `cluster/backup/<backup_id>/status` | Plan-level status hash. Fields: `result`, `timestamp`, `errors` (integer count). |
|
||||
| `module/<module_id>/backup/<backup_id>/status` | Per-module status hash. Fields: `result`, `timestamp`, `error` (message string). |
|
||||
```
|
||||
cluster/backup/<backup_id>/status
|
||||
Hash fields:
|
||||
result "success" | "error"
|
||||
timestamp ISO 8601 (UTC)
|
||||
errors integer count of failed modules
|
||||
|
||||
`result` is either `"success"` or `"error"`. `timestamp` is an ISO 8601
|
||||
string in UTC (e.g. `2024-01-15T03:00:05Z`).
|
||||
module/<module_id>/backup/<backup_id>/status
|
||||
Hash fields:
|
||||
result "success" | "error"
|
||||
timestamp ISO 8601 (UTC)
|
||||
error human-readable error message (empty on success)
|
||||
|
||||
cluster/backup_repository/<repo_id>/parameters
|
||||
Hash fields:
|
||||
url cloud backend URL (S3, B2, rclone, ...)
|
||||
path local or SFTP path
|
||||
password restic repository password
|
||||
backend "s3" | "b2" | "sftp" | "rclone" | "local"
|
||||
aws_access_key_id S3 key ID (also used as b2_account_id for B2)
|
||||
aws_secret_access_key S3 secret (also used as b2_account_key for B2)
|
||||
rclone_config path to rclone configuration file
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Service management
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
# Status
|
||||
systemctl status ns8-backup-monitor
|
||||
|
||||
# Follow live logs via journald
|
||||
journalctl -u ns8-backup-monitor -f
|
||||
|
||||
# Follow the rotating log file directly
|
||||
tail -f /var/log/ns8-backup-monitor.log
|
||||
|
||||
# Restart after a config change
|
||||
# Restart after config change
|
||||
systemctl restart ns8-backup-monitor
|
||||
|
||||
# Test the webhook endpoint manually
|
||||
# Live logs
|
||||
journalctl -u ns8-backup-monitor -f
|
||||
|
||||
# Enable debug logging without editing config.yml
|
||||
journalctl -u ns8-backup-monitor -f | grep -v DEBUG
|
||||
|
||||
# Test the webhook manually
|
||||
curl -s -X POST http://127.0.0.1:9099/alert \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"alerts":[{"status":"firing","labels":{"alertname":"NethServerBackupFailed"}}]}'
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"alerts":[{"status":"firing","labels":{"alertname":"backup_failed","id":"1","name":"Daily"}}]}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service starts but no email is received
|
||||
### Pipeline does not trigger on automatic backups
|
||||
|
||||
1. Verify `ns8-sendmail` works independently:
|
||||
```bash
|
||||
echo 'Test' | ns8-sendmail -s 'Test' admin@yourdomain.com
|
||||
```
|
||||
2. Check `mail.to` in `/etc/ns8-backup-monitor/config.yml`.
|
||||
3. Increase log level to `DEBUG` and restart the service.
|
||||
**Symptom:** email arrives when you click "Run backup" from the NS8 UI but not
|
||||
when the scheduled timer fires.
|
||||
|
||||
### `REPO_FAILURE` on every alert even though backups succeed
|
||||
**Cause (pre-fix):** the old `ALERT_NAMES` set only contained `NsBackupFailed`
|
||||
and `NsBackupMissing`. NS8 native Prometheus rules emit `backup_failed` and
|
||||
`backup_missing` instead. Additionally, native alerts carry the plan identifier
|
||||
in the `id` label, not in `backup_id`.
|
||||
|
||||
- The correlator may be reading Redis before all modules have finished.
|
||||
Increase `correlator.wait_seconds` (e.g. to `60`).
|
||||
- Check that the Redis socket path is correct:
|
||||
`redis-cli -s /var/lib/nethserver/cluster/state/redis.sock PING`
|
||||
**Fix:** update to the current version; both alert name sets and both labels
|
||||
are now handled automatically.
|
||||
|
||||
### Alertmanager does not reach the webhook
|
||||
**Verify:** run `journalctl -u ns8-backup-monitor -f` and trigger a test alert
|
||||
with `amtool alert add alertname=backup_failed id=1`. You should see the
|
||||
`Received alert:` DEBUG line followed by the pipeline start.
|
||||
|
||||
- Confirm the service is listening:
|
||||
`ss -tlnp | grep 9099`
|
||||
- If Alertmanager runs on a different host, change `receiver.host` to
|
||||
`0.0.0.0` and open the port in the firewall.
|
||||
### No email received
|
||||
|
||||
1. Check the service is running: `systemctl status ns8-backup-monitor`.
|
||||
2. Verify Alertmanager is routing to the correct receiver: `amtool config routes test alertname=backup_failed`.
|
||||
3. Check `mail_to` in `config.yml` is set and `ns8-sendmail` works:
|
||||
`echo "test" | ns8-sendmail --from test@localhost --to your@email --subject "test"`.
|
||||
4. Increase log level to `DEBUG` in `config.yml` and restart the service.
|
||||
|
||||
### REPO_FAILURE with no modules found
|
||||
|
||||
**Cause:** the correlator ran before backup modules finished writing their
|
||||
status to Redis.
|
||||
|
||||
**Fix:** increase `correlator.wait_seconds` in `config.yml`. A value of 60–120
|
||||
is safe for most clusters. Restart the service after changing the value.
|
||||
|
||||
---
|
||||
|
||||
@@ -352,11 +346,8 @@ curl -s -X POST http://127.0.0.1:9099/alert \
|
||||
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
|
||||
```
|
||||
|
||||
The script will stop and disable the service, remove the install directory,
|
||||
and optionally remove the configuration directory.
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
MIT — see [LICENSE](LICENSE) if present, otherwise contact the repository owner.
|
||||
MIT — see [LICENSE](LICENSE).
|
||||
|
||||
Reference in New Issue
Block a user