docs: rewrite README in English with full file layout, runtime paths, config reference, troubleshooting
This commit is contained in:
@@ -1,143 +1,308 @@
|
|||||||
# ns8-backup-monitor
|
# ns8-backup-monitor
|
||||||
|
|
||||||
Sistema di monitoraggio dei backup per **NethServer 8** basato su tre livelli:
|
A lightweight webhook receiver for **NethServer 8** that intercepts Alertmanager backup failure alerts, enriches them with per-module status data from the cluster Redis, optionally checks repository health via `restic`, and delivers a detailed email notification through the NS8 configured mail relay.
|
||||||
|
|
||||||
1. **Trigger**: riceve l'alert da Prometheus/Alertmanager (`NsBackupFailed`, `NsBackupMissing`)
|
Unlike solutions that hook into `run-backup` (which only fires on manual UI launches), this service listens to the Alertmanager webhook channel — the same source used by the NS8 monitoring stack — and therefore captures **both manual and scheduled automatic backups**.
|
||||||
2. **Correlazione**: interroga lo stato del piano e dei singoli moduli via Redis cluster
|
|
||||||
3. **Classificazione**: distingue tra successo totale, fallimento parziale o fallimento globale di repository
|
|
||||||
|
|
||||||
## Architettura
|
---
|
||||||
|
|
||||||
|
## Architecture overview
|
||||||
|
|
||||||
```
|
```
|
||||||
Alertmanager --webhook--> receiver.py
|
Alertmanager
|
||||||
|
|
│ POST /alert (NsBackupFailed | NsBackupMissing)
|
||||||
+--------------v--------------+
|
▼
|
||||||
| correlator.py | <- stato piano + per-modulo via Redis HGETALL
|
[receiver.py] HTTP webhook listener (localhost:9099)
|
||||||
+--------------+--------------+
|
│ waits N seconds for modules to settle
|
||||||
|
|
▼
|
||||||
+--------------v--------------+
|
[correlator.py] Reads Redis cluster state, classifies outcome
|
||||||
| repo_check.py | <- verifica repository destinazione (restic)
|
│ SUCCESS | PARTIAL | REPO_FAILURE
|
||||||
+--------------+--------------+
|
▼
|
||||||
|
|
[repo_check.py] (only on non-SUCCESS) Probes restic repos via runagent
|
||||||
+--------------v--------------+
|
▼
|
||||||
| notifier.py | <- email unica con esito classificato (HTML+text)
|
[notifier.py] Builds HTML/text email, sends via ns8-sendmail
|
||||||
+-----------------------------+
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Logica di classificazione
|
---
|
||||||
|
|
||||||
| Esito | Condizione |
|
## Requirements
|
||||||
|
|
||||||
|
| Dependency | Notes |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `SUCCESS` | Tutti i moduli del piano completati, nessun errore repo |
|
| NS8 leader or worker node | Must have access to the cluster Redis socket |
|
||||||
| `PARTIAL` | Almeno un modulo fallito, repository raggiungibile |
|
| `redis-cli` | Included in standard NS8 installations |
|
||||||
| `REPO_FAILURE` | Nessuno stato trovato in Redis, o errori di connessione/scrittura sulla destinazione |
|
| `runagent` | NS8 binary used to invoke `restic` inside module containers |
|
||||||
|
| `ns8-sendmail` | NS8 mail relay script (invoked via `runagent`) |
|
||||||
|
| Python 3.8+ | Standard library only — no pip dependencies |
|
||||||
|
| Alertmanager | Must be configured to send webhooks to this service |
|
||||||
|
|
||||||
## Requisiti
|
---
|
||||||
|
|
||||||
- NethServer 8 (leader node)
|
## File layout
|
||||||
- Python 3.9+
|
|
||||||
- `redis-cli` installato (pacchetto `redis` su Rocky Linux)
|
|
||||||
- `restic` installato e nel PATH (per `repo_check.py`)
|
|
||||||
- Accesso Redis locale del cluster NS8 via socket Unix
|
|
||||||
- `metrics1` configurato con Alertmanager webhook abilitato verso `http://localhost:9099/alert`
|
|
||||||
|
|
||||||
## Struttura file
|
|
||||||
|
|
||||||
```
|
```
|
||||||
ns8-backup-monitor/
|
ns8-backup-monitor/
|
||||||
├── README.md
|
│
|
||||||
├── install.sh
|
├── README.md ← This file
|
||||||
├── ns8_backup_monitor/
|
│
|
||||||
│ ├── __init__.py
|
├── config/
|
||||||
│ ├── __main__.py # entry point: python3 -m ns8_backup_monitor
|
│ └── config.yml.example ← Annotated configuration template
|
||||||
│ ├── receiver.py # HTTP webhook receiver (porta 9099)
|
│
|
||||||
│ ├── correlator.py # correlazione stato backup cluster
|
|
||||||
│ ├── repo_check.py # verifica repository destinazione
|
|
||||||
│ ├── notifier.py # invio email con esito classificato
|
|
||||||
│ └── utils.py # config loading + logging setup
|
|
||||||
├── deploy/
|
├── deploy/
|
||||||
│ └── ns8-backup-monitor.service # systemd unit
|
│ ├── install.sh ← Interactive installer / uninstaller
|
||||||
└── config/
|
│ └── ns8-backup-monitor.service ← systemd unit file
|
||||||
└── config.yml.example
|
│
|
||||||
|
└── ns8_backup_monitor/ ← Python package (main application)
|
||||||
|
├── __init__.py ← Package marker, exposes version
|
||||||
|
├── __main__.py ← CLI entry point (`python3 -m ns8_backup_monitor`)
|
||||||
|
├── receiver.py ← HTTP webhook server (Alertmanager → pipeline)
|
||||||
|
├── correlator.py ← Redis reader and outcome classifier
|
||||||
|
├── repo_check.py ← restic repository health prober
|
||||||
|
├── notifier.py ← Email builder and sender
|
||||||
|
└── utils.py ← Config loader and logging setup
|
||||||
```
|
```
|
||||||
|
|
||||||
## Installazione
|
### Runtime paths (after installation)
|
||||||
|
|
||||||
|
| Path | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `/opt/ns8-backup-monitor/` | Application root (Python package) |
|
||||||
|
| `/etc/ns8-backup-monitor/config.yml` | Active configuration file |
|
||||||
|
| `/etc/systemd/system/ns8-backup-monitor.service` | systemd unit |
|
||||||
|
| `/var/log/ns8-backup-monitor/` | Log directory (if file logging is enabled) |
|
||||||
|
| `/var/lib/nethserver/cluster/state/redis.sock` | NS8 cluster Redis socket (default) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### Quick install (interactive)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Clona la repo
|
bash <(curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh)
|
||||||
cd /opt
|
|
||||||
git clone https://repo.lelekaos.com/admin/ns8-backup-monitor.git
|
|
||||||
cd ns8-backup-monitor
|
|
||||||
|
|
||||||
# 2. Installa dipendenze Python
|
|
||||||
pip3 install pyyaml
|
|
||||||
|
|
||||||
# 3. Crea configurazione
|
|
||||||
mkdir -p /etc/ns8-backup-monitor
|
|
||||||
cp config/config.yml.example /etc/ns8-backup-monitor/config.yml
|
|
||||||
# Edita /etc/ns8-backup-monitor/config.yml con smtp, mail.to, ecc.
|
|
||||||
|
|
||||||
# 4. Installa e avvia il servizio
|
|
||||||
cp deploy/ns8-backup-monitor.service /etc/systemd/system/
|
|
||||||
systemctl daemon-reload
|
|
||||||
systemctl enable --now ns8-backup-monitor
|
|
||||||
|
|
||||||
# 5. Verifica
|
|
||||||
systemctl status ns8-backup-monitor
|
|
||||||
journalctl -u ns8-backup-monitor -f
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Configurazione Alertmanager
|
> **Note:** Use `bash <(curl ...)` rather than `curl ... | bash`.
|
||||||
|
> The interactive installer reads answers from your terminal via `read`; piping stdin
|
||||||
|
> from curl breaks that interaction.
|
||||||
|
|
||||||
Aggiungere in `alertmanager.yml` il receiver:
|
### Non-interactive install (CI / automation)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fsSL https://repo.lelekaos.com/admin/ns8-backup-monitor/raw/branch/main/deploy/install.sh \
|
||||||
|
| bash -s -- \
|
||||||
|
--from "backup@example.com" \
|
||||||
|
--to "admin@example.com"
|
||||||
|
```
|
||||||
|
|
||||||
|
### What the installer does
|
||||||
|
|
||||||
|
1. Copies the Python package to `/opt/ns8-backup-monitor/`
|
||||||
|
2. Writes `/etc/ns8-backup-monitor/config.yml` from the template
|
||||||
|
3. Installs and enables the systemd unit
|
||||||
|
4. Prints the Alertmanager webhook receiver URL
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Uninstallation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash /opt/ns8-backup-monitor/deploy/install.sh --uninstall
|
||||||
|
```
|
||||||
|
|
||||||
|
The uninstaller stops and removes the systemd unit, then optionally removes the configuration directory.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
The active configuration file is `/etc/ns8-backup-monitor/config.yml`.
|
||||||
|
Edit it directly and restart the service to apply changes.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nano /etc/ns8-backup-monitor/config.yml
|
||||||
|
systemctl restart ns8-backup-monitor
|
||||||
|
```
|
||||||
|
|
||||||
|
See `config/config.yml.example` for a fully annotated reference with all available options.
|
||||||
|
|
||||||
|
### Key sections
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# ── Mail settings ─────────────────────────────────────────────
|
||||||
|
mail:
|
||||||
|
from: "backup@ns02.example.com" # Envelope From address
|
||||||
|
to:
|
||||||
|
- "admin@example.com" # One or more recipient addresses
|
||||||
|
subject_prefix: "[NS8 Backup]" # Prepended to every subject line
|
||||||
|
|
||||||
|
# ── Webhook receiver ──────────────────────────────────────────
|
||||||
|
receiver:
|
||||||
|
host: "127.0.0.1" # Bind address (keep localhost unless Alertmanager is remote)
|
||||||
|
port: 9099 # Must match the Alertmanager webhook URL
|
||||||
|
|
||||||
|
# ── Correlator behaviour ─────────────────────────────────────
|
||||||
|
correlator:
|
||||||
|
wait_seconds: 30 # Seconds to wait after alert before reading Redis
|
||||||
|
# (allows slow modules to write their final status)
|
||||||
|
recent_window: 3600 # When no backup_id label is present, scan Redis for
|
||||||
|
# plan status keys updated within this many seconds
|
||||||
|
|
||||||
|
# ── Redis connection ─────────────────────────────────────────
|
||||||
|
redis:
|
||||||
|
socket: "/var/lib/nethserver/cluster/state/redis.sock"
|
||||||
|
|
||||||
|
# ── Repository health check ──────────────────────────────────
|
||||||
|
repo_check:
|
||||||
|
enabled: true
|
||||||
|
timeout: 60 # Seconds per restic check call
|
||||||
|
|
||||||
|
# ── Logging ──────────────────────────────────────────────────
|
||||||
|
logging:
|
||||||
|
level: "INFO" # DEBUG | INFO | WARNING | ERROR
|
||||||
|
file: "" # Leave empty to log to stdout (journald captures it)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alertmanager integration
|
||||||
|
|
||||||
|
Add a receiver to your Alertmanager configuration on the NS8 leader node
|
||||||
|
(`/etc/alertmanager/alertmanager.yml` or via the NS8 `metrics1` module):
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
receivers:
|
receivers:
|
||||||
- name: ns8-backup-monitor
|
- name: ns8-backup-monitor
|
||||||
webhook_configs:
|
webhook_configs:
|
||||||
- url: 'http://localhost:9099/alert'
|
- url: "http://127.0.0.1:9099/alert"
|
||||||
send_resolved: true
|
send_resolved: false
|
||||||
|
|
||||||
route:
|
route:
|
||||||
receiver: ns8-backup-monitor
|
receiver: ns8-backup-monitor
|
||||||
matchers:
|
group_by: [alertname]
|
||||||
- alertname =~ "NsBackupFailed|NsBackupMissing"
|
group_wait: 10s
|
||||||
|
group_interval: 5m
|
||||||
|
repeat_interval: 12h
|
||||||
|
routes:
|
||||||
|
- match_re:
|
||||||
|
alertname: "NsBackupFailed|NsBackupMissing"
|
||||||
|
receiver: ns8-backup-monitor
|
||||||
```
|
```
|
||||||
|
|
||||||
Riavviare Alertmanager dopo la modifica:
|
The service handles two alert names:
|
||||||
```bash
|
|
||||||
systemctl restart alertmanager
|
|
||||||
```
|
|
||||||
|
|
||||||
## Backend supportati per repo_check
|
| Alert name | Meaning |
|
||||||
|
|---|---|
|
||||||
|
| `NsBackupFailed` | One or more backup modules reported an error |
|
||||||
|
| `NsBackupMissing` | Expected backup did not run within the time window |
|
||||||
|
|
||||||
`repo_check.py` legge le credenziali direttamente da Redis e imposta le variabili d'ambiente necessarie per `restic`:
|
---
|
||||||
|
|
||||||
| Backend | Campi Redis letti | Env vars impostate |
|
## Outcome classification
|
||||||
|
|
||||||
|
After reading per-module Redis keys, the correlator assigns one of three outcomes:
|
||||||
|
|
||||||
|
| Outcome | Condition | Email subject |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `local` / `fs` | `url` o `path` | `RESTIC_PASSWORD` |
|
| `SUCCESS` | All modules succeeded | ✅ Backup completed successfully |
|
||||||
| `s3` / `aws` | `url`, `aws_access_key_id`, `aws_secret_access_key` | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` |
|
| `PARTIAL` | Some modules failed, some succeeded | ⚠️ Backup partially failed |
|
||||||
| `b2` / `backblaze` | `url`, `b2_account_id`, `b2_account_key` | `B2_ACCOUNT_ID`, `B2_ACCOUNT_KEY` |
|
| `REPO_FAILURE` | All modules failed, or no status found in Redis | ❌ Backup failed – possible repository error |
|
||||||
| `sftp` | `url` (formato `sftp:host:path`) | `RESTIC_PASSWORD` |
|
|
||||||
| `rclone` | `url`, `rclone_config` | `RCLONE_CONFIG` |
|
|
||||||
|
|
||||||
## Debug / test manuale
|
On `PARTIAL` or `REPO_FAILURE`, the repo health check runs automatically and appends
|
||||||
|
diagnostic information (restic error output) to the email.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Redis key structure
|
||||||
|
|
||||||
|
The correlator reads the following NS8 Redis key patterns:
|
||||||
|
|
||||||
|
```
|
||||||
|
cluster/backup/<backup_id>/status → overall plan status (hash)
|
||||||
|
module/<module_id>/backup/<backup_id>/status → per-module status (hash)
|
||||||
|
```
|
||||||
|
|
||||||
|
Hash fields:
|
||||||
|
|
||||||
|
| Field | Values | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| `result` | `success` / `error` | Outcome of the backup operation |
|
||||||
|
| `timestamp` | ISO 8601 | When the status was last written |
|
||||||
|
| `error` | string | Error message, if any |
|
||||||
|
| `errors` | integer | Number of module errors (plan-level hash only) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service management
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Test del correlatore (senza inviare email)
|
# Check service status
|
||||||
python3 -c "
|
systemctl status ns8-backup-monitor
|
||||||
import json
|
|
||||||
from ns8_backup_monitor.utils import load_config
|
|
||||||
from ns8_backup_monitor.correlator import correlate_backup_status
|
|
||||||
cfg = load_config()
|
|
||||||
print(json.dumps(correlate_backup_status(cfg), indent=2))
|
|
||||||
"
|
|
||||||
|
|
||||||
# Test invio webhook simulato
|
# View live logs
|
||||||
curl -s -X POST http://localhost:9099/alert \
|
journalctl -u ns8-backup-monitor -f
|
||||||
-H 'Content-Type: application/json' \
|
|
||||||
-d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed"}}]}'
|
|
||||||
|
|
||||||
# Verifica log
|
# Restart after config change
|
||||||
journalctl -u ns8-backup-monitor --since '1 hour ago'
|
systemctl restart ns8-backup-monitor
|
||||||
|
|
||||||
|
# Disable on boot
|
||||||
|
systemctl disable ns8-backup-monitor
|
||||||
```
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Service fails to start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u ns8-backup-monitor --no-pager -n 50
|
||||||
|
```
|
||||||
|
|
||||||
|
Common causes:
|
||||||
|
- `config.yml` not found at the expected path → check `/etc/ns8-backup-monitor/config.yml`
|
||||||
|
- Port 9099 already in use → change `receiver.port` in config
|
||||||
|
|
||||||
|
### No email received after a backup failure
|
||||||
|
|
||||||
|
1. Verify Alertmanager is firing the webhook:
|
||||||
|
```bash
|
||||||
|
journalctl -u ns8-backup-monitor -f
|
||||||
|
```
|
||||||
|
You should see `Received N relevant alert(s)` within a minute of the backup failure.
|
||||||
|
|
||||||
|
2. Check that `wait_seconds` has elapsed (default 30 s) and look for `Sending notification...` in the log.
|
||||||
|
|
||||||
|
3. Verify the mail relay works independently:
|
||||||
|
```bash
|
||||||
|
echo "Test" | runagent ns8-sendmail -s "test" admin@example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
### Correlator finds no modules
|
||||||
|
|
||||||
|
If the log shows `No recent backup status keys found in Redis`, possible causes:
|
||||||
|
- `recent_window` is too short — the backup ran more than 1 hour ago
|
||||||
|
- Redis socket path is wrong for your installation
|
||||||
|
- The backup plan wrote status to a non-standard key pattern
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Development
|
||||||
|
|
||||||
|
The application is pure Python 3 with no third-party dependencies.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run locally (requires NS8 Redis socket access)
|
||||||
|
python3 -m ns8_backup_monitor --config ./config/config.yml.example
|
||||||
|
|
||||||
|
# Send a test webhook payload
|
||||||
|
curl -s -X POST http://127.0.0.1:9099/alert \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"alerts":[{"status":"firing","labels":{"alertname":"NsBackupFailed","backup_id":"1"}}]}'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT License — contributions welcome via pull request.
|
||||||
|
|||||||
Reference in New Issue
Block a user