Whitepaper
The Problem
Current observability solutions split along two failing axes:
- Open-source stacks (Prometheus + Grafana + Loki) demand 5+ services, 8 GB+ RAM, and weeks of integration. They answer "What is broken?" via static thresholds — never "When will it break?"
- Enterprise SaaS (Datadog, Dynatrace) provide black-box AI at prohibitive cost with opaque decision logic.
Neither predicts. Neither explains.
The Ruptura Approach
Ruptura treats infrastructure as a living organism — measuring vital signs, behaviours, stress responses, and social dynamics through 8 auditable composite signals, an adaptive ensemble of 5 prediction models, and a dual-scale acceleration detector.
Rupture Index™ — the core prediction metric
R(t) = |α_burst(t)| / max(|α_stable(t)|, ε)
Two ILR windows run in parallel per metric:
| Window | Size | Captures |
|---|---|---|
ILR_stable |
60 min | Long-term baseline — what is normal? |
ILR_burst |
5 min | Short-term acceleration — is something diverging? |
When R > 3, a metric accelerates 3× faster than its own baseline — Ruptura raises a warning before the metric reaches 80% saturation.
Why ILR over LSTM?
| Model | MAE | RAM | Inference | Efficiency |
|---|---|---|---|---|
| LSTM | 2.0% | 200+ MB | 500 ms | < 0.0001 |
| ARIMA | 4.1% | 85 MB | 210 ms | 0.0001 |
| ILR (Ruptura) | 6.2% | 0.5 MB | 0.8 ms | 1,550× |
ILR trades +2.1% MAE for 170× less RAM and 262× faster inference. 1,550× more efficient than ARIMA — validated on a Raspberry Pi 4 over 40,320 samples.
8 Composite Signals
All inputs are normalised to [0, 1]. Formulas are versioned release artifacts — every coefficient is auditable.
1. Stress
stress(t) = 0.3·CPU(t) + 0.2·RAM(t) + 0.2·Latency(t) + 0.2·Errors(t) + 0.1·Timeouts(t)
| Value | State |
|---|---|
| < 0.3 | Calm |
| 0.3–0.6 | Nervous |
| 0.6–0.8 | Stressed |
| ≥ 0.8 | Panic |
2. Fatigue (dissipative)
F(t) = max(0, F(t−1) + (stress(t) − R_threshold) − λ)
R_threshold = 0.3 (rest threshold)
λ = 0.05 (healing rate per interval)
The λ dissipation term prevents false alarms from planned load spikes (nightly backups, batch jobs). The system "heals" during low-stress periods, eliminating ~90% of false-positive fatigue alerts observed in v5.0 field deployments.
| Value | State | Action |
|---|---|---|
| < 0.3 | Rested | Normal |
| 0.3–0.6 | Tired | Increase observation |
| 0.6–0.8 | Exhausted | Plan maintenance |
| ≥ 0.8 | Burnout imminent | Preventive restart |
3. Pressure
pressure(t) = d(stress̄)/dt + ∫₀ᵗ errors̄(τ) dτ
Measures the rate of systemic load increase plus accumulated error burden. Sustained positive pressure (> 0.1 for 10+ min) predicts a "storm" ~2 hours ahead.
4. Contagion
contagion(t) = Σ_{i,j} E_{ij}(t) × D_{ij}
E_ij = error rate from service i to j, D_ij = dependency weight (0–1). Measures how fast failures propagate across the service graph.
| Value | State | Action |
|---|---|---|
| < 0.3 | Isolated | Normal |
| 0.3–0.6 | Spreading | Monitor |
| 0.6–0.8 | Epidemic | Isolate affected |
| ≥ 0.8 | Pandemic | Global response |
5. Resilience
resilience(t) = 1 − P_failure(t)
P_failure(t) = σ( α·R(t) + β·fatigue(t) + γ·contagion(t) )
α = 0.5, β = 0.3, γ = 0.2 (default weights)
σ = sigmoid function
Resilience is the complement of estimated failure probability, combining the Rupture Index, fatigue, and contagion into a single 0–1 score. A resilience score below 0.4 means the system is more likely to fail than not.
| Value | State |
|---|---|
| > 0.8 | Robust |
| 0.6–0.8 | Adequate |
| 0.4–0.6 | Fragile |
| < 0.4 | Failure likely |
6. Entropy
entropy(t) = −Σ_i p_i(t) · log₂(p_i(t))
p_i = normalised frequency of metric i crossing its baseline threshold
Measures behavioural unpredictability — how many metrics are deviating from their baseline simultaneously. High entropy indicates configuration drift, deployment side effects, or cascading anomalies. Normalised to [0, 1] via entropy / log₂(N) where N is the number of tracked metrics.
| Value | State |
|---|---|
| < 0.2 | Predictable |
| 0.2–0.5 | Some drift |
| 0.5–0.8 | High drift |
| > 0.8 | Chaotic |
7. Sentiment
sentiment(t) = (Uptime(t) × Throughput(t)) / (Errors(t) × Timeouts(t) × Restarts(t) + ε)
A high sentiment means the service is performant and stable. Near zero indicates degraded user experience. Normalised logarithmically to [0, 1].
| Value | State |
|---|---|
| > 0.8 | Happy |
| 0.6–0.8 | Content |
| 0.4–0.6 | Neutral |
| 0.2–0.4 | Sad |
| < 0.2 | Depressed |
8. Health Score
healthscore(t) = (1 − stress) × (1 − fatigue) × (1 − pressure) × (1 − contagion) × 100
A single 0–100 operational score combining the four primary signals. Below 60: needs attention. Below 40: action required.
Adaptive Ensemble (v6.1)
Five models weighted by online MAE over a 1-hour sliding window:
| Model | Strengths |
|---|---|
| CA-ILR | O(1), detects acceleration, edge-native |
| ARIMA | Strong on stationary trending series |
| Holt-Winters | Excellent on periodic/seasonal patterns |
| MAD | Robust to outliers |
| EWMA | Reacts to recent data, smooth |
Weights update every 60 s: weight_i = (1/MAE_i) / Σ(1/MAE_j). No manual tuning. No profile configuration.
Production Benchmarks
| Criterion | Prom/Grafana/Loki | Datadog | Ruptura v6.1 |
|---|---|---|---|
| RAM idle | ~450 MB | ~180 MB | 22 MB |
| Setup time | ~30 min | ~5 min | < 1 min |
| Prediction | ❌ None | ✅ Black-box | ✅ Transparent, 6.2% MAE |
| False positives (backup spikes) | ❌ Yes | ⚠️ Sometimes | ✅ No (λ dissipation) |
| Exponential crash detection | ❌ No | ✅ Black-box | ✅ R > 3 (auditable) |
| Air-gapped ready | ⚠️ Complex | ❌ Impossible | ✅ Native |
| Efficiency score | 1× | ~0.0001× | 1,550× |
Design Principles
Three principles non-negotiable since v4.0:
- Transparent AI — every prediction traceable to a published formula. No black boxes.
- Sovereign deployment — single static binary, no external database, runs on a Raspberry Pi 4.
- Auditable by design — KPI formulas are versioned release artifacts. CISOs, auditors, and SREs can challenge any decision.
Full Technical Reference
The v5.0 whitepaper contains the complete mathematical formalization and canonical METRICS.md standard:
Read OHE v5.0 Whitepaper (GitHub) →
"Stop staring at dashboards hoping for the best. Sleep. Ruptura watches."
— Selim Benfradj, Architect & Founder