We built a zero-cost downloader collecting 11,706 equity symbols across 19+ global exchanges, replacing $8,000 to $22,000 per month in vendor licensing.
11,706
Total symbols collected
128M+
bars_1d total rows
$8K-22K/mo
Commercial equivalent cost
$40/mo
Actual cost (proxies)
CHAPTER 01
Regime classification and cross-market correlation are only meaningful when the universe of securities is large enough to reveal structural relationships. A signal derived from correlations across US equities alone misses the most important context: what BSE mid-caps and Korean KOSPI blue chips are doing relative to the S&P 500 is often a leading indicator, not a lagging one.
The alternatives we evaluated fell into two categories. Commercial providers charged per-exchange licensing fees that totaled $8,000 to $22,000 per month for the universe we needed. The per-symbol API providers did not carry Indian BSE/NSE symbols at all, and their international equity coverage for European mid-caps was spotty past 5 years of history. We needed BSE (5,200+ symbols), NSE (2,500+ symbols), US equities (full OTC + exchange), European markets, and Asian exchanges. That is 19,000+ symbols.
The hard technical constraint was network-layer rate limiting from the data-center IP block. Yahoo Finance's fc.yahoo.com endpoint aggressively rate-limits non-residential IP ranges. From Hetzner's AS24940 network, direct requests would succeed for the first 200 to 400 requests per session before returning HTTP 429 or silently returning 0-row responses. The 0-row response is more dangerous than a 429: it looks like success, writes 0 rows to ClickHouse, and marks the symbol as downloaded.
CHAPTER 02
The downloader runs as a Rust binary with a pool of 50 concurrent HTTP session workers. Each worker maintains its own cookie jar, User-Agent string, and session state. Workers share a work queue stored in ClickHouse: a download_queue table with columns for symbol, exchange, last_fetched timestamp, error count, and next_retry_at. This design means the work queue is durable, queryable, and visible to operational tooling without a separate queue service.
Proxy rotation was the solution to the IP-rate-limiting problem. Each worker session routes through a different proxy, distributing requests across a pool of residential and data-center-residential endpoints. Per-proxy request counts are tracked in Redis with a 60-second sliding window. Workers that exceed 100 requests per proxy per minute are paused and rotated.
India symbols required a two-phase auth flow. Yahoo Finance uses a crumb token mechanism: the client first hits a page that sets a cookie, then extracts a crumb value from a JSON endpoint, then includes the crumb as a query parameter on every subsequent OHLCV request. The crumb is session-scoped and expires when the cookie expires, roughly 60 minutes. Each worker handles its own crumb lifecycle independently.
ARCHITECTURE OVERVIEW
SOURCES
Rust 1.84
Tokio 1.40 (50-concurrent-session pool)
TRANSFORM
ClickHouse 26.3
validate + dedup
STORE
HTTP/1.1 with per-session cookie jar
partitioned
QUERY
systemd timer
+ cache
CHAPTER 03
The most expensive bug was the silent empty response. Yahoo Finance returns HTTP 200 with a JSON body containing a chart.result array. When rate-limited or when the symbol does not exist on that exchange, it returns HTTP 200 with chart.result: null. The first version of the parser treated a null result as equivalent to a legitimate zero-trading-day response. This caused 847 symbols to be marked as downloaded with 0 rows in ClickHouse.
Detection required writing a validation query against the download_queue: symbols with last_fetched in the past day but no rows in bars_1d. That query returned 847 symbols that the downloader believed it had processed but which had no rows. Re-running the downloader for those symbols with explicit null-result detection and a retry flag resolved the gap within 6 hours.
The crumb token expiry caused a second class of silent failures. When the session's cookie expires, subsequent requests using the old crumb return HTTP 200 with a different error payload. The parser was not checking the error field because the outer HTTP status was 200. The fix added a top-level check for the chart.error field before attempting to parse chart.result.
TECH STACK
CHAPTER 04
The 50-worker concurrency limit was set empirically. At 100 workers, the Hetzner host's network interface showed packet loss at roughly 2% during peak download periods. At 50 workers, packet loss dropped to below 0.1% and download throughput held at 1,963 symbols per 7.5 minutes on the European universe, translating to roughly 261 symbols per minute.
A commercial equivalent for the same dataset would run $8,000 to $22,000 per month by our vendor quotes. The downloader itself runs on the same Hetzner host that runs ClickHouse, with proxy costs of approximately $40 per month for the residential proxy pool. Daily incremental updates add 11,706 HTTP requests per run, completing in roughly 45 minutes per day.
11,706
Total symbols collected
128M+
bars_1d total rows
$8K-22K/mo
Commercial equivalent cost
$40/mo
Actual cost (proxies)
CHAPTER 05
DECISION · 01
Chose ClickHouse as the work queue rather than Redis. The tradeoff: writes to the queue table go through ClickHouse's async insert path, which introduces up to 100ms delay before a newly queued symbol becomes visible to workers. What ClickHouse gives us: the queue state is durable, survives host reboots, and is queryable with SQL for operational visibility without additional infrastructure.
DECISION · 02
Chose Yahoo Finance over alternative free sources. The tradeoff: Yahoo has no SLA, the auth flow changes without notice, and rate limiting is aggressive from data-center IPs. What it gave us: 19,000+ symbols with 10 to 20 years of history, zero licensing cost, and no usage-based pricing.
DECISION · 03
The India 1-minute bar gap (July to December 2024) is a real coverage gap affecting minute-resolution signals on Indian equities. The daily bar coverage is complete. The intraday gap requires either a different source or accepting the limitation.
START A PROJECT
We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.
Start a ProjectWe built a 723M-row market data pipeline ingesting 10 exchanges simultaneously at under 50ms tick-to-storage latency.
723M+ Total rows stored
Read case study →
DataWe migrated 425M rows to ClickHouse and achieved 8x storage compression and 15x faster analytical scans versus our prior QuestDB setup.
723M+ Rows stored
Read case study →
DataWe replaced a Python fan-in that dropped ticks under load with a Rust multi-task aggregator handling 80,000 ticks per second across 10 exchanges at 3.1% CPU.
80K tick/s Peak throughput
Read case study →