An ingest pipeline can fail in ways that produce no errors, no crashes, and no alerts. The service stays up. The process keeps running. The database just stops receiving new rows. We discovered this the hard way when Avo's site was displaying equity prices from the previous trading day as “live” data for over 24 hours before we noticed.
The Problem with Process-Level Monitoring
Standard infrastructure monitoring answers: is the process running? Is the service responding to health checks? Is CPU and memory within normal bounds? These are necessary but insufficient for data ingest systems.
Ingest pipelines fail silently in several specific patterns:
- →Rate limiting from upstream API: the source responds 429, the pipeline backs off, and the backoff logic has a bug that causes indefinite backoff instead of eventual retry.
- →Authentication token expiry: the API key expired or the OAuth token was not refreshed. The pipeline starts receiving 401 responses and silently drops rows rather than crashing.
- →Schema mismatch after upstream change: the source API changed its response format. The parser fails on the new format, logs an error, and continues without inserting. The service is healthy, the data is stale.
- →Network partition: a firewall rule change blocks outbound traffic to a specific API endpoint. TCP connections time out slowly. The pipeline is technically running but producing zero rows.
In each case, the process monitor shows green. The health check returns 200. The only signal that something is wrong is that the data in the database is getting older.
Our Specific Failures
When we audited Avo's ingest pipelines, we found several freshness violations across our 10 exchange feeds:
- →US equities: daily bar downloader broken, site showed previous-day closing prices as current. Users saw 24+ hour stale data presented as live.
- →Funding rates: 48+ hours stale across all exchanges. Service was running but not producing rows.
- →Liquidation tracker: 0 rows. Service running but not producing.
- →Kucoin and Coinbase feeds: 14 hours stale, services reconnecting in a crash loop.
- →Binance, Bybit historical fills: 25-35 days old. Background fill jobs had silently stopped.
None of these failures produced a process crash or an alert. All of them were invisible to our existing monitoring. The users who would have noticed would have been the ones relying on the data for trading decisions.
Get weekly intelligence delivered to your inbox
Curated signals, regime shifts, and anomaly highlights from Avo Intelligence. Every Monday. Free.
The Freshness Monitor Pattern
A freshness monitor is a separate process that queries your database on a schedule and compares the maximum timestamp in each table against the current time. If the gap exceeds a defined threshold, it fires an alert. The monitor has no knowledge of the ingest pipelines themselves. It only looks at the data.
This separation is the key insight: the monitor is fully independent of the pipelines it monitors. A bug in a pipeline cannot also break the monitor. A pipeline crash cannot silence the monitor. The monitor only trusts the database.
-- Freshness check query for ClickHouse
-- Run every 5 minutes via cron or scheduled job
SELECT
table_name,
max_ts,
now() - max_ts AS staleness_seconds,
CASE
WHEN now() - max_ts > threshold_seconds THEN 'STALE'
ELSE 'FRESH'
END AS status
FROM (
SELECT 'bars_1m' AS table_name, max(ts) AS max_ts, 300 AS threshold_seconds FROM bars_1m
UNION ALL
SELECT 'bars_1d' AS table_name, max(ts) AS max_ts, 86400 AS threshold_seconds FROM bars_1d
UNION ALL
SELECT 'funding_rates' AS table_name, max(ts) AS max_ts, 3600 AS threshold_seconds FROM funding_rates
UNION ALL
SELECT 'liquidations' AS table_name, max(ts) AS max_ts, 3600 AS threshold_seconds FROM liquidations
UNION ALL
SELECT 'ticks_okx' AS table_name, max(ts) AS max_ts, 60 AS threshold_seconds FROM ticks_okx
)
WHERE status = 'STALE'
ORDER BY staleness_seconds DESC;The thresholds are per-table because different data sources have different update frequencies. Minute bars have a 5-minute threshold (1 missed update = alert). Daily bars have a 24-hour threshold (allowing for normal overnight gap). Tick data has a 60-second threshold. Funding rates update every 8 hours so the threshold is 1 hour.
Rust Implementation
We implemented the freshness monitor as a small Rust binary that runs every 5 minutes via PM2. It queries ClickHouse, checks each table against its threshold, and POSTs to a Slack webhook if any table is stale:
use std::time::{SystemTime, UNIX_EPOCH};
use serde::Deserialize;
#[derive(Debug, Deserialize)]
struct FreshnessRow {
table_name: String,
staleness_seconds: f64,
max_ts: u64,
}
struct FreshnessConfig {
table: &'static str,
threshold_secs: u64,
}
const CHECKS: &[FreshnessConfig] = &[
FreshnessConfig { table: "bars_1m", threshold_secs: 300 },
FreshnessConfig { table: "bars_1d", threshold_secs: 86400 },
FreshnessConfig { table: "funding_rates", threshold_secs: 3600 },
FreshnessConfig { table: "liquidations", threshold_secs: 3600 },
FreshnessConfig { table: "ticks_okx", threshold_secs: 60 },
];
async fn check_freshness(client: &ClickHouseClient) -> Vec<String> {
let now = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_secs();
let mut violations = Vec::new();
for check in CHECKS {
let max_ts = client
.query_scalar::<u64>(&format!(
"SELECT toUnixTimestamp(max(ts)) FROM {}",
check.table
))
.await
.unwrap_or(0);
let staleness = now.saturating_sub(max_ts);
if staleness > check.threshold_secs {
violations.push(format!(
"[STALE] {}: {}s stale (threshold: {}s)",
check.table, staleness, check.threshold_secs
));
}
}
violations
}Symbol-Level Freshness
Table-level freshness is necessary but not sufficient. A single active symbol can keep the table's max(ts) fresh while hundreds of other symbols have gone stale. This is common when a subset of exchange feeds are healthy and the rest are failing.
Symbol-level freshness adds a second check: for each table, compute the count of symbols with a max(ts) within the freshness threshold. If the fresh symbol percentage drops below 90%, alert.
-- Symbol-level freshness: what % of symbols have recent data?
SELECT
countIf(now() - max_ts <= 86400) AS fresh_symbol_count,
count() AS total_symbol_count,
fresh_symbol_count / total_symbol_count AS fresh_pct
FROM (
SELECT symbol, max(ts) AS max_ts
FROM bars_1d
GROUP BY symbol
);UI Transparency Layer
Monitoring catches the failure internally. The user still needs to know if the data they are looking at is real-time versus delayed. We added a staleness indicator to every data display on the Avo platform:
- →Green dot with 'Live' label: data is within 2x the expected update frequency.
- →Yellow dot with 'Delayed Xm' label: data is between 2x and 10x the expected update frequency.
- →Red dot with 'Stale': data has not updated in over 24 hours. Displayed with the last update timestamp.
This is the Bloomberg Terminal pattern: Bloomberg always shows the last update time next to data. Users of financial data know how to read staleness indicators. Hiding stale data or presenting it as live without disclosure is not acceptable in a finance product.
Results After Deploying the Monitor
- →Time-to-detect ingest failures dropped from hours/days to under 5 minutes.
- →Found and fixed 6 separate stale pipeline issues in the first week after deployment.
- →UI staleness indicators shipped for all 26,000+ symbol data displays.
- →Zero user-reported stale data incidents in the 30 days following deployment.
Lessons
Process health and data freshness are different dimensions of system health. A system can pass all process health checks and still be serving stale data to users. Monitoring both is not optional for a data-dependent product.
The freshness monitor pattern is simple enough to build in a few hours and catches failure modes that would otherwise only surface through user complaints. Build it before you ship data to users. We built it after. The order matters.
Need similar work shipped?
We build production data infrastructure with real monitoring. If you need a data pipeline built with observability from day one, we can help.
Start a Project →