Files
ns8-backup-monitor/README.md
T

309 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ns8-backup-monitor
A lightweight webhook receiver for **NethServer 8** that intercepts Alertmanager backup failure alerts, enriches them with per-module status data from the cluster Redis, optionally checks repository health via `restic`, and delivers a detailed email notification through the NS8 configured mail relay.
Unlike solutions that hook into `run-backup` (which only fires on manual UI launches), this service listens to the Alertmanager webhook channel — the same source used by the NS8 monitoring stack — and therefore captures **both manual and scheduled automatic backups**.
---
## Architecture overview
```
Alertmanager
│ POST /alert (NsBackupFailed | NsBackupMissing)
[receiver.py] HTTP webhook listener (localhost:9099)
│ waits N seconds for modules to settle
[correlator.py] Reads Redis cluster state, classifies outcome
│ SUCCESS | PARTIAL | REPO_FAILURE
[repo_check.py] (only on non-SUCCESS) Probes restic repos via runagent
[notifier.py] Builds HTML/text email, sends via ns8-sendmail
```
---
## Requirements
| Dependency | Notes |
|---|---|
| NS8 leader or worker node | Must have access to the cluster Redis socket |
| `redis-cli` | Included in standard NS8 installations |
| `runagent` | NS8 binary used to invoke `restic` inside module containers |
| `ns8-sendmail` | NS8 mail relay script (invoked via `runagent`) |
| Python 3.8+ | Standard library only — no pip dependencies |
| Alertmanager | Must be configured to send webhooks to this service |
---
## File layout
```
ns8-backup-monitor/
├── README.md ← This file
├── config/
│ └── config.yml.example ← Annotated configuration template
├── deploy/
│ ├── install.sh ← Interactive installer / uninstaller
│ └── ns8-backup-monitor.service ← systemd unit file
└── ns8_backup_monitor/ ← Python package (main application)
├── __init__.py ← Package marker, exposes version
├── __main__.py ← CLI entry point (`python3 -m ns8_backup_monitor`)
├── receiver.py ← HTTP webhook server (Alertmanager → pipeline)
├── correlator.py ← Redis reader and outcome classifier
├── repo_check.py ← restic repository health prober
├── notifier.py ← Email builder and sender
└── utils.py ← Config loader and logging setup
```
### Runtime paths (after installation)
| Path | Purpose |
|---|---|
| `/opt/ns8-backup-monitor/` | Application root (Python package) |
| `/etc/ns8-backup-monitor/config.yml` | Active configuration file |
| `/etc/systemd/system/ns8-backup-monitor.service` | systemd unit |
| `/var/log/ns8-backup-monitor/` | Log directory (if file logging is enabled) |
| `/var/lib/nethserver/cluster/state/redis.sock` | NS8 cluster Redis socket (default) |
---
## Installation
### Quick install (interactive)
```bash
bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)
```
> **Note:** Use `bash <(curl ...)` rather than `curl ... | bash`.
> The interactive installer reads answers from your terminal via `read`; piping stdin
> from curl breaks that interaction.
### Non-interactive install (CI / automation)
```bash
curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh \
| bash -s -- \
--from "backup@example.com" \
--to "admin@example.com"
```
### What the installer does
1. Copies the Python package to `/opt/ns8-backup-monitor/`
2. Writes `/etc/ns8-backup-monitor/config.yml` from the template
3. Installs and enables the systemd unit
4. Prints the Alertmanager webhook receiver URL
---
## Uninstallation
```bash
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
```
The uninstaller stops and removes the systemd unit, then optionally removes the configuration directory.
---
## Configuration
The active configuration file is `/etc/ns8-backup-monitor/config.yml`.
Edit it directly and restart the service to apply changes.
```bash
nano /etc/ns8-backup-monitor/config.yml
systemctl restart ns8-backup-monitor
```
See `config/config.yml.example` for a fully annotated reference with all available options.
### Key sections
```yaml
# ── Mail settings ─────────────────────────────────────────────
mail:
from: "backup@ns02.example.com" # Envelope From address
to:
- "admin@example.com" # One or more recipient addresses
subject_prefix: "[NS8 Backup]" # Prepended to every subject line
# ── Webhook receiver ──────────────────────────────────────────
receiver:
host: "127.0.0.1" # Bind address (keep localhost unless Alertmanager is remote)
port: 9099 # Must match the Alertmanager webhook URL
# ── Correlator behaviour ─────────────────────────────────────
correlator:
wait_seconds: 30 # Seconds to wait after alert before reading Redis
# (allows slow modules to write their final status)
recent_window: 3600 # When no backup_id label is present, scan Redis for
# plan status keys updated within this many seconds
# ── Redis connection ─────────────────────────────────────────
redis:
socket: "/var/lib/nethserver/cluster/state/redis.sock"
# ── Repository health check ──────────────────────────────────
repo_check:
enabled: true
timeout: 60 # Seconds per restic check call
# ── Logging ──────────────────────────────────────────────────
logging:
level: "INFO" # DEBUG | INFO | WARNING | ERROR
file: "" # Leave empty to log to stdout (journald captures it)
```
---
## Alertmanager integration
Add a receiver to your Alertmanager configuration on the NS8 leader node
(`/etc/alertmanager/alertmanager.yml` or via the NS8 `metrics1` module):
```yaml
receivers:
- name: ns8-backup-monitor
webhook_configs:
- url: "http://127.0.0.1:9099/alert"
send_resolved: false
route:
receiver: ns8-backup-monitor
group_by: [alertname]
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
routes:
- match_re:
alertname: "NsBackupFailed|NsBackupMissing"
receiver: ns8-backup-monitor
```
The service handles two alert names:
| Alert name | Meaning |
|---|---|
| `NsBackupFailed` | One or more backup modules reported an error |
| `NsBackupMissing` | Expected backup did not run within the time window |
---
## Outcome classification
After reading per-module Redis keys, the correlator assigns one of three outcomes:
| Outcome | Condition | Email subject |
|---|---|---|
| `SUCCESS` | All modules succeeded | ✅ Backup completed successfully |
| `PARTIAL` | Some modules failed, some succeeded | ⚠️ Backup partially failed |
| `REPO_FAILURE` | All modules failed, or no status found in Redis | ❌ Backup failed possible repository error |
On `PARTIAL` or `REPO_FAILURE`, the repo health check runs automatically and appends
diagnostic information (restic error output) to the email.
---
## Redis key structure
The correlator reads the following NS8 Redis key patterns:
```
cluster/backup/<backup_id>/status → overall plan status (hash)
module/<module_id>/backup/<backup_id>/status → per-module status (hash)
```
Hash fields:
| Field | Values | Description |
|---|---|---|
| `result` | `success` / `error` | Outcome of the backup operation |
| `timestamp` | ISO 8601 | When the status was last written |
| `error` | string | Error message, if any |
| `errors` | integer | Number of module errors (plan-level hash only) |
---
## Service management
```bash
# Check service status
systemctl status ns8-backup-monitor
# View live logs
journalctl -u ns8-backup-monitor -f
# Restart after config change
systemctl restart ns8-backup-monitor
# Disable on boot
systemctl disable ns8-backup-monitor
```
---
## Troubleshooting
### Service fails to start
```bash
journalctl -u ns8-backup-monitor --no-pager -n 50
```
Common causes:
- `config.yml` not found at the expected path → check `/etc/ns8-backup-monitor/config.yml`
- Port 9099 already in use → change `receiver.port` in config
### No email received after a backup failure
1. Verify Alertmanager is firing the webhook:
```bash
journalctl -u ns8-backup-monitor -f
```
You should see `Received N relevant alert(s)` within a minute of the backup failure.
2. Check that `wait_seconds` has elapsed (default 30 s) and look for `Sending notification...` in the log.
3. Verify the mail relay works independently:
```bash
echo "Test" | runagent ns8-sendmail -s "test" admin@example.com
```
### Correlator finds no modules
If the log shows `No recent backup status keys found in Redis`, possible causes:
- `recent_window` is too short — the backup ran more than 1 hour ago
- Redis socket path is wrong for your installation
- The backup plan wrote status to a non-standard key pattern
---
## Development
The application is pure Python 3 with no third-party dependencies.
```bash
# Run locally (requires NS8 Redis socket access)
python3 -m ns8_backup_monitor --config ./config/config.yml.example
# Send a test webhook payload
curl -s -X POST http://127.0.0.1:9099/alert \
-H "Content-Type: application/json" \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed","backup_id":"1"}}]}'
```
---
## License
MIT License — contributions welcome via pull request.