Skip to content

Composite Signals

Ruptura computes 10 composite KPI signals from raw telemetry. Each maps multiple input metrics to a single 0–1 interpretable index with a published formula. No black boxes.

Signal overview

Signal Range What it measures
stress 0–1 Instantaneous load pressure across CPU, RAM, latency, errors, timeouts
fatigue 0–1 Accumulated stress over time — dissipative, recovers during low-stress periods
mood 0–1 System well-being: uptime × throughput vs errors × timeouts × restarts
pressure 0–1 Rate of change in stress + integrated error load (storm approaching)
humidity 0–1 Error × timeout density relative to throughput
contagion 0–1 Error propagation across service dependencies (topology-based when traces available)
resilience 0–1 How quickly the workload recovers from stress
entropy 0–1 Behavioral unpredictability — rolling variance of HealthScore
velocity 0–1 Rate of change of HealthScore — how fast the workload is degrading
health_score 0–100 Composite: additive-penalty sum of the primary signals

Formulas

Stress

stress(t) = 0.3·CPU(t) + 0.2·RAM(t) + 0.2·Latency(t) + 0.2·Errors(t) + 0.1·Timeouts(t)

All inputs normalised to [0, 1].

stress State
< 0.3 Calm
0.3 – 0.6 Nervous
0.6 – 0.8 Stressed
≥ 0.8 Panic

Fatigue (dissipative)

F(t) = max(0, F(t−1) + (stress(t) − R_threshold) − λ)

  R_threshold = 0.3  (rest threshold — stress below this heals fatigue)
  λ           = 0.05 (healing rate per 15-second interval)

The dissipative term λ prevents false fatigue alarms from legitimate scheduled spikes (nightly backups, batch jobs). After 24h of observation, each workload's baseline is learned and thresholds become relative — a batch job at 90% CPU is never "fatigued" if that is its normal state.

fatigue State Recommended action
< 0.3 Rested Normal monitoring
0.3 – 0.6 Tired Increase observation frequency
0.6 – 0.8 Exhausted Plan maintenance window
≥ 0.8 Burnout imminent Preventive restart

Mood

mood(t) = log(uptime × throughput + 1) / log(errors × timeouts × restarts + 2)

High mood = service is happy and performant. Low mood = degraded user experience regardless of raw CPU/memory numbers.

mood State
> 0.7 Happy
0.5–0.7 Content
0.3–0.5 Neutral
0.1–0.3 Sad
< 0.1 Depressed

Pressure

pressure(t) = d(stress̄)/dt + ∫₀ᵗ errors̄(τ) dτ
pressure Interpretation
> 0.1 sustained for 10 min Storm likely in ~2 hours
Stable Steady-state conditions
Declining System recovering
≥ 0.8 Storm approaching

Humidity

humidity(t) = (errors(t) × timeouts(t)) / max(throughput(t), ε)

A high-throughput service absorbs more errors before becoming "humid." A low-throughput service with even a few errors/timeouts gets high humidity — which is usually a sign of a problem.

Contagion

When trace spans are available (OTLP traces ingested):

contagion(t) = Σ_{i,j} E_{ij}(t) × D_{ij}

Where E_ij = error rate from service i to j (from real trace spans), D_ij = dependency weight (0–1) from call volume.

Fallback (no trace topology):

contagion(t) ≈ errors(t) × cpu(t)   # proxy signal
contagion State Action
< 0.3 Isolated Normal
0.3 – 0.6 Spreading Monitor closely
0.6 – 0.8 Epidemic Isolate affected services
≥ 0.8 Pandemic Global incident response

Resilience

resilience(t) = mood(t) × (1 − fatigue(t)) × (1 − contagion(t))

A workload that is in good mood, not fatigued, and not spreading errors to its peers has high resilience.

Entropy

entropy(t) = MAD(HealthScore history, window=20)

The median absolute deviation of the last 20 HealthScore samples. High entropy means the workload is behaving unpredictably — oscillating between healthy and degraded.

Velocity

velocity(t) = |ΔHealthScore| / Δt

Rate of change of HealthScore. High velocity means the workload is degrading (or recovering) rapidly.

Health Score

health_score = 100 × (1 − (
    w_stress    × stress +
    w_fatigue   × fatigue +
    w_mood      × (1 − mood) +
    w_pressure  × pressure +
    w_humidity  × humidity +
    w_contagion × contagion
))

Default weights: stress=0.25, fatigue=0.20, mood=0.20, pressure=0.15, humidity=0.10, contagion=0.10.

Additive-penalty model. A single high signal degrades the score proportionally — it does not collapse the score the way a multiplicative model would. Below 60 indicates a workload needing attention.

Weights are configurable per workload or namespace (v6.6.0+). See Signal Weight Configuration for the API reference and Helm workloadWeights for static bootstrap config.

health_score State
80–100 Excellent
60–80 Good
40–60 Fair
20–40 Poor
< 20 Critical

Calibration Warm-Up

For the first 96 observations (~24 hours at the default 15-second interval), Ruptura is in calibrating state. During this period:

  • KPI signals are computed and stored normally
  • Rupture predictions and Tier-1/Tier-2 action recommendations are suppressed — the baseline is not yet reliable enough to act on
  • The API response includes a clear calibration status so you are never confused by the silence

Every rupture snapshot carries:

{
  "status": "calibrating",
  "calibration_progress": 43,
  "calibration_eta_minutes": 820
}

Once calibration completes, status switches to "active" and the full prediction + action pipeline comes online.

{
  "status": "active",
  "calibration_progress": 100,
  "calibration_eta_minutes": 0
}

You can fast-track calibration in demos using ruptura-sim.


HealthScore Trend Forecast

When a workload is active (calibration complete) and at least 10 health history points are available, Ruptura runs an OLS linear regression over the rolling 60-point health history and projects the critical-threshold crossing time.

{
  "health_forecast": {
    "trend": "degrading",
    "in_15min": 51.2,
    "in_30min": 38.7,
    "critical_eta_minutes": 28
  }
}
field Meaning
trend "improving" | "stable" | "degrading"
in_15min Projected HealthScore (0–100) in 15 minutes
in_30min Projected HealthScore (0–100) in 30 minutes
critical_eta_minutes Minutes until HealthScore is projected to fall below 40 (Fair → Poor). 0 if not degrading toward critical.

This turns "your score is 54" into "you have 28 minutes." The forecast is null during calibration and when the trend is flat (insufficient variance to project).


Adaptive Per-Workload Baselines

After 96 observations (~24 hours at the default 15s interval), Ruptura switches from global thresholds to workload-specific baselines using Welford online statistics.

  • A batch job at 90% CPU → stress = 0.9 globally, but z-score = 0.1 (normal for this workload) → no alarm
  • An API server normally at 10% CPU, now at 40% → z-score = 4.2 → stress alarm fires

Fatigue thresholds remain absolute because sustained effort IS fatigue regardless of whether it is normal.


API

# By Kubernetes workload (primary)
GET /api/v2/kpi/{signal}/{namespace}/{workload}

# By legacy host name (fallback)
GET /api/v2/kpi/{signal}/{host}

# Full workload snapshot (all signals at once)
GET /api/v2/rupture/{namespace}/{workload}

Example request:

curl -H "Authorization: Bearer $API_KEY" \
  "http://localhost:8080/api/v2/kpi/fatigue/default/payment-api"

Example response:

{
  "signal": "fatigue",
  "workload": {
    "namespace": "default",
    "kind": "Deployment",
    "name": "payment-api"
  },
  "value": 0.81,
  "state": "burnout_imminent",
  "timestamp": "2026-05-01T09:00:00Z"
}

Prometheus metrics

Scrape at GET /api/v2/metrics.

All 10 signals (plus fused_rupture_index and throughput) are exported as:

ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="fatigue"} 0.81
ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="stress"} 0.52
ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="health_score"} 74.0
ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="fused_rupture_index"} 1.8
# ... one series per signal per workload

The Grafana dashboard at deploy/grafana/dashboards/ruptura_overview.json is pre-configured to use these label selectors.