Composite Signals

Ruptura computes 10 composite KPI signals from raw telemetry. Each maps multiple input metrics to a single 0–1 interpretable index with a published formula. No black boxes.

Signal overview

Signal	Range	What it measures
`stress`	0–1	Instantaneous load pressure across CPU, RAM, latency, errors, timeouts
`fatigue`	0–1	Accumulated stress over time — dissipative, recovers during low-stress periods
`mood`	0–1	System well-being: uptime × throughput vs errors × timeouts × restarts
`pressure`	0–1	Rate of change in stress + integrated error load (storm approaching)
`humidity`	0–1	Error × timeout density relative to throughput
`contagion`	0–1	Error propagation across service dependencies (topology-based when traces available)
`resilience`	0–1	How quickly the workload recovers from stress
`entropy`	0–1	Behavioral unpredictability — rolling variance of HealthScore
`velocity`	0–1	Rate of change of HealthScore — how fast the workload is degrading
`health_score`	0–100	Composite: additive-penalty sum of the primary signals

Formulas

Stress

stress(t) = 0.3·CPU(t) + 0.2·RAM(t) + 0.2·Latency(t) + 0.2·Errors(t) + 0.1·Timeouts(t)

All inputs normalised to [0, 1].

stress	State
< 0.3	Calm
0.3 – 0.6	Nervous
0.6 – 0.8	Stressed
≥ 0.8	Panic

Fatigue (dissipative)

F(t) = max(0, F(t−1) + (stress(t) − R_threshold) − λ)

  R_threshold = 0.3  (rest threshold — stress below this heals fatigue)
  λ           = 0.05 (healing rate per 15-second interval)

The dissipative term λ prevents false fatigue alarms from legitimate scheduled spikes (nightly backups, batch jobs). After 24h of observation, each workload's baseline is learned and thresholds become relative — a batch job at 90% CPU is never "fatigued" if that is its normal state.

fatigue	State	Recommended action
< 0.3	Rested	Normal monitoring
0.3 – 0.6	Tired	Increase observation frequency
0.6 – 0.8	Exhausted	Plan maintenance window
≥ 0.8	Burnout imminent	Preventive restart

Mood

mood(t) = log(uptime × throughput + 1) / log(errors × timeouts × restarts + 2)

High mood = service is happy and performant. Low mood = degraded user experience regardless of raw CPU/memory numbers.

mood	State
> 0.7	Happy
0.5–0.7	Content
0.3–0.5	Neutral
0.1–0.3	Sad
< 0.1	Depressed

Pressure

pressure(t) = d(stress̄)/dt + ∫₀ᵗ errors̄(τ) dτ

pressure	Interpretation
> 0.1 sustained for 10 min	Storm likely in ~2 hours
Stable	Steady-state conditions
Declining	System recovering
≥ 0.8	Storm approaching

Humidity

humidity(t) = (errors(t) × timeouts(t)) / max(throughput(t), ε)

A high-throughput service absorbs more errors before becoming "humid." A low-throughput service with even a few errors/timeouts gets high humidity — which is usually a sign of a problem.

Contagion

When trace spans are available (OTLP traces ingested):

contagion(t) = Σ_{i,j} E_{ij}(t) × D_{ij}

Where E_ij = error rate from service i to j (from real trace spans), D_ij = dependency weight (0–1) from call volume.

Fallback (no trace topology):

contagion(t) ≈ errors(t) × cpu(t)   # proxy signal

contagion	State	Action
< 0.3	Isolated	Normal
0.3 – 0.6	Spreading	Monitor closely
0.6 – 0.8	Epidemic	Isolate affected services
≥ 0.8	Pandemic	Global incident response

Resilience

resilience(t) = mood(t) × (1 − fatigue(t)) × (1 − contagion(t))

A workload that is in good mood, not fatigued, and not spreading errors to its peers has high resilience.

Entropy

entropy(t) = MAD(HealthScore history, window=20)

The median absolute deviation of the last 20 HealthScore samples. High entropy means the workload is behaving unpredictably — oscillating between healthy and degraded.

Velocity

velocity(t) = |ΔHealthScore| / Δt

Rate of change of HealthScore. High velocity means the workload is degrading (or recovering) rapidly.

Health Score

health_score = 100 × (1 − (
    w_stress    × stress +
    w_fatigue   × fatigue +
    w_mood      × (1 − mood) +
    w_pressure  × pressure +
    w_humidity  × humidity +
    w_contagion × contagion
))

Default weights: stress=0.25, fatigue=0.20, mood=0.20, pressure=0.15, humidity=0.10, contagion=0.10.

Additive-penalty model. A single high signal degrades the score proportionally — it does not collapse the score the way a multiplicative model would. Below 60 indicates a workload needing attention.

Weights are configurable per workload or namespace (v6.6.0+). See Signal Weight Configuration for the API reference and Helm workloadWeights for static bootstrap config.

health_score	State
80–100	Excellent
60–80	Good
40–60	Fair
20–40	Poor
< 20	Critical

Calibration Warm-Up

For the first 96 observations (~24 hours at the default 15-second interval), Ruptura is in calibrating state. During this period:

KPI signals are computed and stored normally
Rupture predictions and Tier-1/Tier-2 action recommendations are suppressed — the baseline is not yet reliable enough to act on
The API response includes a clear calibration status so you are never confused by the silence

Every rupture snapshot carries:

{
  "status": "calibrating",
  "calibration_progress": 43,
  "calibration_eta_minutes": 820
}

Once calibration completes, status switches to "active" and the full prediction + action pipeline comes online.

{
  "status": "active",
  "calibration_progress": 100,
  "calibration_eta_minutes": 0
}

You can fast-track calibration in demos using ruptura-sim.

HealthScore Trend Forecast

When a workload is active (calibration complete) and at least 10 health history points are available, Ruptura runs an OLS linear regression over the rolling 60-point health history and projects the critical-threshold crossing time.

{
  "health_forecast": {
    "trend": "degrading",
    "in_15min": 51.2,
    "in_30min": 38.7,
    "critical_eta_minutes": 28
  }
}

field	Meaning
`trend`	`"improving"` \| `"stable"` \| `"degrading"`
`in_15min`	Projected HealthScore (0–100) in 15 minutes
`in_30min`	Projected HealthScore (0–100) in 30 minutes
`critical_eta_minutes`	Minutes until HealthScore is projected to fall below 40 (Fair → Poor). `0` if not degrading toward critical.

This turns "your score is 54" into "you have 28 minutes." The forecast is null during calibration and when the trend is flat (insufficient variance to project).

Adaptive Per-Workload Baselines

After 96 observations (~24 hours at the default 15s interval), Ruptura switches from global thresholds to workload-specific baselines using Welford online statistics.

A batch job at 90% CPU → stress = 0.9 globally, but z-score = 0.1 (normal for this workload) → no alarm
An API server normally at 10% CPU, now at 40% → z-score = 4.2 → stress alarm fires

Fatigue thresholds remain absolute because sustained effort IS fatigue regardless of whether it is normal.

API

# By Kubernetes workload (primary)
GET /api/v2/kpi/{signal}/{namespace}/{workload}

# By legacy host name (fallback)
GET /api/v2/kpi/{signal}/{host}

# Full workload snapshot (all signals at once)
GET /api/v2/rupture/{namespace}/{workload}

Example request:

curl -H "Authorization: Bearer $API_KEY" \
  "http://localhost:8080/api/v2/kpi/fatigue/default/payment-api"

Example response:

{
  "signal": "fatigue",
  "workload": {
    "namespace": "default",
    "kind": "Deployment",
    "name": "payment-api"
  },
  "value": 0.81,
  "state": "burnout_imminent",
  "timestamp": "2026-05-01T09:00:00Z"
}

Prometheus metrics

Scrape at GET /api/v2/metrics.

All 10 signals (plus fused_rupture_index and throughput) are exported as:

ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="fatigue"} 0.81
ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="stress"} 0.52
ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="health_score"} 74.0
ruptura_kpi{namespace="default",kind="Deployment",workload="payment-api",signal="fused_rupture_index"} 1.8
# ... one series per signal per workload

The Grafana dashboard at deploy/grafana/dashboards/ruptura_overview.json is pre-configured to use these label selectors.