There is a genre of startup writing that presents the journey as a clean narrative arc: insight, execution, traction, success. The reality of building a data-intensive market intelligence platform is messier, more iterative, and more instructive precisely because of the mess. This is an honest account of Avo's first 90 days.
Day 1: The Data Problem
The thesis behind Avo was simple: the data that institutional traders use to make decisions . real-time market data across thousands of symbols, macro context, cross-asset correlations, regime classification . is technically available to anyone but practically inaccessible. Bloomberg Terminal charges $24,000 per seat per year. Alternatives are either too narrow, too expensive, or too raw to be useful without significant engineering.
The first problem was data collection. To build anything useful, we needed historical data at scale: years of daily bars across global equities, ETFs, crypto, forex, commodities, and indices. The free data APIs block programmatic access from server IPs. The paid APIs are expensive at scale. The research data vendors want enterprise contracts.
We built our own data collection infrastructure. Rust-based WebSocket daemons for real-time crypto market data from OKX, Kraken, Kucoin, Coinbase, Binance, and others. Historical downloaders for Yahoo Finance, with proxy rotation to handle rate limits. FRED API integration for macroeconomic data. The collection layer ended up as 23 Rust binaries running as system services.
Day 30: First Numbers, First Problems
By day 30 we had crossed 400 million rows of OHLCV data in QuestDB . our initial database choice . and deployed the first version of the signal detection pipeline. The signals were running. They were also producing too many outputs to be useful.
The first signal engine scanned for volume anomalies and price momentum patterns across the full universe and produced several hundred candidate signals per day. We quickly learned that quantity and quality are inversely correlated in signal output. The candidates needed a quality gate, and the quality gate needed data that we were still building.
We also encountered the first serious infrastructure problem: QuestDB's WAL tables. During a partition management operation, we triggered a bug where DROP TABLE on an active WAL table silently loses partition data. We lost a batch of historical partitions and had to rebuild from source. It took three days. The lesson was expensive and clear: understand your database's edge cases before you depend on it for production data.
Get weekly intelligence delivered to your inbox
Curated signals, regime shifts, and anomaly highlights from Avo Intelligence. Every Monday. Free.
The Database Migration Decision
The QuestDB incident, combined with growing query performance concerns as data scaled, pushed us to evaluate alternatives. We spent a week benchmarking TimescaleDB (built on PostgreSQL, row-oriented, strong time-series tooling) against ClickHouse (columnar, analytical, much faster for our query patterns but harder to operate).
For our analytical query patterns . full-universe aggregations, cross-asset correlation queries, signal backtesting . ClickHouse was 3-5x faster at comparable data volumes. We migrated. The migration itself was a week of careful data movement, validation, and schema redesign to take advantage of ClickHouse's MergeTree engine and LZ4HC compression.
In retrospect, this was the most important architectural decision we made in the first 90 days. The compression efficiency (14.19 GB for 2.5 billion rows) and query performance characteristics (sub-100ms on full-universe queries) are foundational to what the platform can deliver at the API layer.
Day 60: Pattern Discovery and the Quality Gate
The pattern discovery work started in earnest around day 45. Running systematic backtests across the full historical database, looking for price and volume patterns with documented predictive value, we identified 555 patterns that passed the initial statistical screen: minimum 250 historical instances, minimum 55% win rate on the training set, positive expected value net of a 0.1% transaction cost assumption.
The 555 number is not a marketing figure. It is the result of testing a much larger initial set of candidate patterns and discarding those that did not survive out-of-sample validation. We started with several thousand candidates and eliminated roughly 80% of them.
Simultaneously, the LightGBM model training pipeline came online. Training on historical signal outcomes, the model learned to score candidate signals on a 0-1 probability scale. The score floor of 0.60 was calibrated to eliminate the bottom tier of candidates while retaining real edge. Setting it too high reduced signal volume below useful levels; setting it too low let noise through.
The 6-layer quality gate . ML scoring, pattern matching, leading indicator confirmation, portfolio risk checks, AI reasoning synthesis, and plain-English commentary . was the result of this calibration work. Each layer was added when we found a specific failure mode that the previous layers were not catching.
The Stack We Ended Up With
The technology choices that survived the first 90 days:
- RustEvery data collection daemon, signal computation binary, and batch processing job. 23 binaries in production, total. The performance characteristics are worth the learning curve . sustained throughput at low memory footprint, no garbage collection pauses.
- ClickHousePrimary analytical database. 43 tables, 2.5 billion rows, 14.19 GB compressed. Sub-100ms query times on full-universe aggregations. LZ4HC compression, MergeTree with (symbol, ts) sort keys.
- RedisCache layer and real-time signal store. 685K+ keys, sub-millisecond latency on cache hits. Regime classifications, latest signal cache, API response caching. 24h TTL on regime keys to bound memory growth.
- Next.jsWeb platform. Server components for data fetching, edge-compatible API routes, streaming responses. The incremental static regeneration model works well for data that updates on minute-level cadence.
- Python + ModalML training pipeline. Modal for on-demand GPU compute (A10G), Python for feature engineering and model training. The decision to use cloud GPU for training rather than self-hosted was correct . the cost is low and the operational overhead of maintaining GPU infrastructure would have been a distraction.
Day 90: Where Things Stand
The production numbers at day 90:
Signal accuracy at the 0.80 confidence threshold: 91.9%, measured on out-of-sample validation data across the full signal history. At the standard 0.65 threshold: 71.4%. These numbers will shift as the model encounters new market conditions . we report rolling accuracy, not static historical figures.
What We Are Still Building
Being honest about where the system is incomplete is part of building in public. Several things are still in progress:
- →Macro data coverage: 50.8% of critical FRED series are currently ingested. Target is 100% coverage by mid-May.
- →Exchange coverage: 6 of 10 crypto exchange ingests are active. Binance and Bybit real-time feeds are reconnecting after infrastructure changes.
- →Order flow microstructure data: L2 order book data for leading indicators is designed but not yet in production.
- →Monthly model retraining: the automation pipeline for scheduled model updates is in progress. Currently we retrain manually.
Lessons for Anyone Building Something Similar
Three things we got right that we would do again without hesitation: choosing Rust for the data collection and computation layer (the operational stability and performance have been worth every hour of the learning curve), choosing ClickHouse for the analytical database (impossible to imagine delivering the query performance we need on a row-oriented alternative at this data volume), and implementing walk-forward validation from the beginning rather than retrofitting it.
One thing we would do differently: start with a narrower symbol universe and expand outward, rather than collecting data for 56,000 symbols from the beginning. The broad universe is a competitive advantage now, but the early infrastructure overhead of managing that many data streams slowed down iteration on the signal layer, which is the core product.
The general principle that held throughout: build the infrastructure to be correct before building it to be fast, and build it to be fast before building it to be feature-rich. A slow system with accurate data and accurate signals is more useful than a fast system with either compromised.
The platform is live
2.5 billion data points, 56,000+ symbols, live signals with documented accuracy. Start with a free trial and see what 90 days of infrastructure work enables.
Start free →