Every signal you see on Avo . every regime detection, anomaly flag, correlation score, and pattern match . is the output of a data pipeline processing 2.5 billion rows of market history. This is how that pipeline actually works: the database choices, the ingest architecture, the trade-offs we made, and the numbers behind the system.
The Storage Layer: Why ClickHouse
When you're storing time-series market data at scale, database choice is the most consequential architectural decision you make. We evaluated several options before settling on ClickHouse as our primary OLAP store.
The core data: 713 million minute bars and 38 million daily bars across 56,000+ symbols, stretching back to 2010 for US equities and varying depths for global markets and crypto. Each row carries timestamp, symbol, open, high, low, close, volume, and a handful of computed fields. That's roughly 2.5 billion rows total when you include signals, outcomes, regime states, and derived tables.
ClickHouse handles this well for several reasons. It is a columnar database . instead of storing rows together, it stores each column separately. For time-series queries that aggregate one or two fields across millions of rows (typical for signal generation and backtesting), this is dramatically faster than row-oriented storage. A query scanning 100 million rows for close prices never touches volume, open, or high data on disk.
Compression is the other major benefit. Market data is highly compressible: price series for a single symbol change slowly relative to their absolute values (high locality for delta encoding), and adjacent timestamps are nearly identical. ClickHouse's LZ4HC compression on columnar data achieves 5-10x compression ratios compared to raw storage. Our 2.5 billion rows occupy roughly 14 GB on disk rather than the 100+ GB you might expect from a row-oriented database without compression.
Query performance is where ClickHouse earns its reputation. A typical signal generation query . scanning 90 days of minute bars for 500 symbols, computing rolling statistics, filtering on conditions . runs in under 100 milliseconds. The same query on PostgreSQL at this data volume would take minutes. The Avo UI can serve real-time analytics without pre-computation because ClickHouse makes it viable.
The Ingest Layer: Rust Binaries
Getting data into the system is a different engineering challenge than querying it. Avo monitors 56,000+ symbols across 13 data sources. For real-time coverage, that means maintaining WebSocket connections to exchange data feeds, parsing streaming JSON, computing derived fields, and writing to ClickHouse with minimal latency . continuously, without dropping data.
We built the ingest layer in Rust. This was not a premature optimization . market data ingest has specific characteristics that make Rust a genuine fit rather than an over-engineered choice.
First, reliability. A Python ingest daemon that crashes silently loses data you can never recover. Rust's ownership model and exhaustive error handling force you to handle every failure path explicitly. Our ingest binaries have defined behavior for every edge case: WebSocket disconnects trigger reconnection with exponential backoff, write failures queue to an in-memory buffer and retry, corrupt messages are logged and skipped rather than crashing the process.
Second, performance at the tail. With 10 WebSocket daemons running simultaneously . covering OKX, Kraken, Binance, Bybit, Coinbase, Deribit, and several equity feeds . peak ingest moments involve hundreds of concurrent messages. Rust handles this without garbage collection pauses or memory allocation overhead that would cause latency spikes in a managed runtime.
The current ingest toolbelt is 58 binaries: 23 data downloaders, 15 signal generators, 8 analytics tools, and 12 utility binaries for maintenance, validation, and backfill. Each binary does one thing: one exchange, one data type, one transformation. Composition via Unix pipes and systemd services rather than monolithic processes.
Get weekly intelligence delivered to your inbox
Curated signals, regime shifts, and anomaly highlights from Avo Intelligence. Every Monday. Free.
Hot State: Redis at 685K Keys
ClickHouse is fast for analytics, but it is not designed for sub-millisecond random access to current state. For hot data . the current regime for each symbol, the latest signal scores, recent price ticks . we use Redis.
The Redis keyspace holds 685,000 keys at any given time. The majority are regime state keys: one per symbol per timeframe, updated as regime detection runs. These keys have a 24-hour TTL . regime data older than a day is stale and should be recomputed rather than served from cache.
API response times are where Redis earns its keep. The Avo market regime API endpoint, without caching, requires a ClickHouse query across thousands of recent rows to compute current regime state. With Redis, the same request returns cached regime data in 2-5 milliseconds. We measured the improvement: cold-cache response time was 1,048 milliseconds; warm-cache (Redis hit) is 79-90 milliseconds . a 13x improvement on one of the most queried API routes.
A background cache warmer runs continuously, pre-populating Redis with the most-accessed symbols before requests hit. This means most real-world API calls are cache hits before they start . the cold cache case is rare in production.
Exchange Coverage: 13 Data Sources, 56K Symbols
Coverage breadth is one of Avo's core data advantages. Most data providers offer comprehensive US equity coverage and limited international exposure. Avo covers:
US equities: all NYSE and NASDAQ-listed securities, updated via EOD and intraday feeds. Global equities: 11,706 India symbols (BSE and NSE), major European exchanges, Asia-Pacific markets. Crypto: 17,000+ trading pairs across 10 exchanges including OKX (our primary real-time source for 1,210 pairs), Kraken, Coinbase, Binance, Bybit, and Deribit. Macro: 32 of 63 critical FRED series with automated daily updates. Forex and commodities: major pairs and futures contracts via Yahoo Finance and direct exchange feeds.
Managing 13 data source connections means managing 13 sets of quirks. Every exchange has different WebSocket message formats, different rate limits, different reconnection behavior, different symbol naming conventions. The ingest layer normalizes everything into a common schema before writing to ClickHouse . the analytics layer never sees exchange-specific formats.
Why Not Postgres, TimescaleDB, or InfluxDB
We evaluated alternatives before choosing this stack. The short version:
PostgreSQL:excellent for transactional workloads, wrong for analytical queries at 2.5 billion rows. MVCC overhead, row-oriented storage, and the query planner's behavior on large table scans make it unsuitable for signal generation at our data volume. We still use Postgres for user data, subscriptions, and transactional records . it is the right tool for that layer.
TimescaleDB: a reasonable middle ground that extends Postgres with time-series optimizations. We found it faster than vanilla Postgres but still 3-5x slower than ClickHouse on our specific query patterns. The operational familiarity of Postgres is a genuine advantage, but not enough to offset the performance gap.
InfluxDB:purpose-built for time-series but optimized for simple metric ingestion (server monitoring, IoT telemetry). Its query language and data model are poorly suited for the multi-symbol, multi-field analytical queries that power signal generation. ClickHouse's SQL dialect handles these naturally.
QuestDB:we evaluated and briefly used QuestDB as an earlier store. It offers excellent ingest performance and a good SQL dialect, but WAL table behavior . specifically the inability to safely drop partitions from active WAL tables . caused data reliability issues during our migration. ClickHouse's partition management is more predictable at scale.
The Numbers
To make it concrete:
- . 2.5 billion total rows across all tables
- . 713 million minute bars (bars_1m), covering 2010-present for US equities
- . 38 million daily bars (bars_1d) across all covered symbols
- . 56,000+ symbols across 13 data sources
- . 14 GB total on-disk storage after LZ4HC compression
- . sub-100ms query times for typical signal generation queries
- . 685,000 Redis keys for hot state
- . 10 WebSocket daemons running concurrently for real-time ingest
- . 58 Rust binaries in the ingest/analytics toolbelt
Every regime detection, correlation score, anomaly flag, and pattern match you see on Avo is computed against this data infrastructure in real time. The UI is the front door; this is what powers it.
Intelligence backed by real infrastructure
Explore the signals, correlations, and regime data that this pipeline produces . live, on the Avo platform.
Explore Avo Intelligence →