BioSentinel — Biomedical Anomaly Detection

The Problem

When Sensors Fail,
Patients Pay the Price

Cold-chain equipment and lab instruments stream data constantly. Small anomalies go unnoticed until a drug batch is already ruined. BioSentinel catches them as they happen.

Cold storage failures

A refrigerator storing vaccines or insulin can fail silently at night. By morning the batch is ruined and patients go without.

Sensor tampering

In pharmaceutical settings, tampered or miscalibrated sensors can hide contaminated batches from quality control behind fake-normal readings.

The detection gap

Simple thresholds miss slow sustained failures. Advanced systems exist but are proprietary and unauditable. There is no open baseline anyone can inspect.

A security question

Sensor integrity is a security problem, not just a maintenance one. BioSentinel treats it with the same rigor applied to network intrusion detection.

Machine temperature sensor signal with anomaly windows

Temperature every 5 minutes Known anomaly window NAB machine_temperature_system_failure — Dec 2013 to Feb 2014 — 22,695 rows

Methodology

Seven Detectors.
One Honest Story.

BioSentinel builds detectors in order of complexity. Each step is driven by where the previous one failed. Every design choice is backed by data from a hyperparameter sweep.

3-Sigma Baseline (Global)

Flag anything more than 3 standard deviations from the global mean. A pharma auditor can verify every flag in two lines of arithmetic. Sets the floor that every future model must beat.

Precision0.991

Recall0.202

F1 Score0.336

Key finding: Near-perfect precision (almost no false alarms) but misses 80% of real anomalies. The global standard deviation is inflated by the failure period itself, widening the band so the model cannot catch sustained low-temperature readings at 50°C.

3-sigma detector output showing flagged anomaly points

3-Sigma detector — red dots are flags. The sharp December crash is caught but the sustained February failure is mostly missed. Amber dashed lines show the global ±3-sigma bounds.

EWMA — Exponentially Weighted Moving Average

Uses an exponentially weighted baseline instead of a global mean. With span=5000 (roughly 17 days at 5-minute intervals), the baseline adapts very slowly, staying stable through long failure events.

Precision0.586

Recall0.513

F1 Score0.547

Key finding: Tuning the span from 48 to 5000 improved F1 by 10x (0.004 to 0.547). Short spans adapt into the failure period and make it look normal. A span of 5000 behaves close to a global mean, keeping the baseline stable across a 2-day failure event.

Precision, recall and F1 bar chart for all detectors

Detector comparison bar chart — EWMA (second group) achieves the highest F1 of any statistical method, but still falls short of IsolationForest-based approaches.

Hampel Filter and LOF — Why They Fail Here

Two well-known algorithms were tested and both performed poorly. Their failure modes reveal important properties of sustained failures in time-series data.

Hampel Filter F10.050

LOF (k=20) F10.119

Hampel: Uses rolling median plus MAD. The rolling window adapts into the sustained failure, making 40°C readings look normal relative to the local median. Same failure mode as rolling 3-sigma.

LOF: Compares local density to that of neighbors. The 48-hour failure creates a dense cluster of roughly 576 similar readings. Each anomalous point has neighbors with similar density, so the LOF score stays low. LOF is designed for isolated outliers, not large dense clusters.

Precision-recall curves for all detectors

Precision-Recall curves across all score thresholds. LOF (pink) stays near the random baseline. The IF + Rolling curve (cyan) dominates at high recall values, confirming it as the strongest detector.

IsolationForest — Raw Value

Build 200 random isolation trees. Points isolated in fewer cuts are anomalies. Uses the global distribution, not a local window, so it is immune to the adaptation problem that kills rolling methods.

Precision0.471

Recall0.566

F1 Score0.514

Key finding: Recall jumps from 0.20 to 0.57 compared to 3-sigma — a 2.8x improvement. Precision drops because many readings in the 60–70°C range look globally unusual but fall inside normal-labeled windows. The model needs additional context to separate a brief dip from a sustained failure.

Feature insight showing sustained_dev vs raw signal

Top: raw signal, where the 48-hour failure looks similar to brief dips. Bottom: sustained_dev (value minus 30-hour rolling mean), which drops sharply and stays low through the full failure — giving the next model a clear signal to exploit.

IsolationForest + Rolling Features Best

Adds one engineered feature: sustained_dev = value - rolling_mean(360). A hyperparameter sweep over 24 window sizes confirmed 360 rows (30 hours) is optimal. Both precision and recall improve over raw IsolationForest.

Precision0.568

Recall0.682

F1 Score0.619

Key result: Recall of 0.682 exceeds the 61.2% ceiling for value-only detectors. A reading of 50°C with sustained_dev = -30 is distinguishable from a brief 50°C dip with sustained_dev = +2. One feature, added with intent, breaks the apparent ceiling.

IsolationForest + Rolling detector output

IsolationForest + Rolling (w=360) — red dots show flagged points. The model catches the full February failure period and much of the sustained December events, while the 3-sigma detector above misses nearly all of them.

Results

Real Numbers.
Honest Limits.

All metrics computed against four ground-truth windows from the Numenta Anomaly Benchmark. No cherry-picking. The theoretical ceiling is documented and then broken.

Value-only recall ceiling: 38.8% of labeled anomaly points fall within the normal temperature range because NAB windows start before the temperature drops. No raw-value-only detector can exceed roughly 61.2% recall. The IsolationForest + Rolling model achieves recall=0.682 by using sustained_dev as a second feature, providing temporal context the raw value cannot.

Detector	Type	Flagged	TP	FP	Precision	Recall	F1
3-Sigma (Global)	Statistical	462	458	4	0.991	0.202	0.336
Rolling 3-Sigma	Statistical	703	199	504	0.283	0.088	0.134
EWMA (span=5000)	Statistical	1,983	1,163	820	0.586	0.513	0.547
Hampel Filter	Statistical	691	74	617	0.107	0.033	0.050
LOF (k=20)	ML	2,724	298	2,426	0.109	0.131	0.119
IsolationForest (Raw)	ML	2,724	1,283	1,441	0.471	0.566	0.514
IsolationForest + Rolling	ML + FE	2,724	1,546	1,178	0.568	0.682	0.619

Zoomed in on the February 7–9 failure event. Top (3-Sigma): catches only the absolute lowest readings, misses the majority of the sustained failure. Bottom (IF + Rolling): catches the full 48-hour period because sustained_dev stays strongly negative throughout.

Precision, Recall, and F1 for the four primary detectors. IF + Rolling (rightmost group) wins on both recall and F1. Statistical methods have high precision but low recall — they sound few false alarms but miss most of the actual failures.

Precision-recall curves with AUC-PR scores

Precision-Recall curves across all decision thresholds. AUC-PR summarises ranking ability: a perfect detector has AUC-PR=1.0; a random detector matches the base rate (10.0%). IF + Rolling (cyan, AUC-PR=0.610) dominates at high recall values.

All four detectors plotted against ground truth

All four primary detectors plotted against ground-truth windows (yellow shading). Colored dots are flagged anomaly points. Notice how the 3-sigma detector (top) catches almost nothing in the sustained periods, while IF + Rolling (bottom) covers them well.

Hyperparameter Analysis

Every Parameter
Was Swept, Not Guessed.

Contamination and window size were each swept across a full range of values. The optimal settings were selected by maximizing F1 on the ground-truth labels — documented so anyone can reproduce it.

Hyperparameter sensitivity sweep for contamination and window size

Left: F1 vs contamination. Peaks at 0.12, slightly above the true anomaly rate of 10.0% — the model's internal threshold is slightly conservative. Right: F1 vs window size (hours). Peaks at 360 rows = 30 hours. Shorter windows adapt into the failure period; longer windows introduce noise. The 30-hour baseline stays stable across the 48-hour February event.

Why contamination = 0.12?

The true anomaly rate in this dataset is 10.0% (2,268 of 22,695 readings). A sweep from 0.02 to 0.30 found F1 peaks at contamination=0.12. The IsolationForest threshold calibration is slightly conservative, so setting contamination above the true rate compensates and yields the best precision-recall balance.

Why long_window = 360 rows (30 hours)?

A sweep from 24 to 576 rows found F1 peaks at 360 rows — 30 hours at 5-minute intervals. The February failure lasts roughly 48 hours. A 30-hour baseline does not fully adapt into the failure, keeping sustained_dev clearly negative throughout the entire event. Shorter windows normalize the failure; longer ones lose resolution near the onset.

Technical Depth

Why Each Decision
Was Made

Four findings that are non-obvious and each backed by data run on this dataset.

Why IsolationForest instead of a threshold?

A threshold is a single scalar — it only says "too high" or "too low." IsolationForest scores anomalies by how few random cuts it takes to isolate a point. With two input features it operates across a 2D space, detecting combinations that no single numeric threshold could capture.

Why LOF failed on this dataset

Local Outlier Factor compares the local density of each point to that of its neighbors. The 48-hour February failure creates roughly 576 readings in a dense cluster. Each anomalous point has neighbors with identical density, so the LOF score stays low. LOF was designed for isolated outliers, not large sustained anomaly regions.

Breaking the 61.2% recall ceiling

38.8% of labeled anomaly points have normal-range temperatures — no raw-value detector can flag them. Adding sustained_dev gives the model a second dimension: a reading of 50°C with sustained_dev = -30 is fundamentally different from a brief 50°C dip with sustained_dev = +2. One feature breaks the apparent ceiling.

AUC-PR versus F1: two different questions

F1 measures performance at a single threshold. AUC-PR (0.610 for IF + Rolling) measures ranking ability across all thresholds — it answers whether the model consistently assigns higher anomaly scores to true positives than to normal points. AUC-PR is the standard evaluation metric for imbalanced detection problems.

biosentinel — terminal

# Core function — two features that push F1 from 0.514 to 0.619
def detect_iforest_rolling(df, contamination=0.12, long_window=360):
    v             = df["value"]
    sustained_dev = v - v.rolling(long_window, min_periods=1).mean()
    X             = pd.DataFrame({"value": v, "sustained_dev": sustained_dev})
    clf           = IsolationForest(contamination=contamination, random_state=42, n_estimators=200)
    clf.fit(X)
    df["anomaly"] = clf.predict(X) == -1
    df["score"]  = -clf.decision_function(X)  # higher = more anomalous
    return df
    

Stack

Built in Python

pandas

scikit-learn

matplotlib

Streamlit

numpy

NAB Dataset

Python 3.13

Project Files

A complete, runnable codebase

biosentinel.py

All 7 detectors, CLI, and scoring

compare.py

Run all detectors side by side

tune.py

Hyperparameter sweep analysis

pr_curve.py

PR curves and AUC-PR scores

app.py

Streamlit interactive demo

explore.py

Load and visualize the signal

evaluate.py

Baseline evaluation pipeline

generate_web_images.py

Reproducible figure generation

About

Built by Srujan Chilakapati

I built BioSentinel because I wanted to work at the intersection of machine learning, biomedical data, and security. The idea that one sensor reading could be the difference between a safe drug batch and a compromised one is what got me started.

The design principle is honesty over polish. Every detector is motivated by the failure of the one before it. The theoretical recall ceiling is documented. The hyperparameter tuning is reproducible with one command. The detectors that failed — Hampel and LOF — are included and explained rather than quietly dropped.

If you are a researcher or student working with biomedical time-series data, this codebase is open, readable, and built to be extended.

Contact View on GitHub

Machine Learning Biocybersecurity Anomaly Detection IsolationForest Python scikit-learn Time-Series Feature Engineering Hyperparameter Tuning AUC-PR NAB Benchmark Open Source

"The recall ceiling is not a failure. It is a result. Knowing exactly why a detector cannot exceed 61.2% recall — and then engineering a feature that breaks through it — is more valuable than a black-box model claiming 95%."

Catching Failures in Before They Cause Harm

When Sensors Fail,Patients Pay the Price