From 400 Million to 2.5 Billion Rows: Scaling Our Data Pipeline

Every data infrastructure has a moment where it stops being adequate. For us, that moment arrived around the 400 million row mark on QuestDB. This is the story of what broke, why we chose ClickHouse as the replacement, how the migration went, and what we learned building a financial time-series pipeline that now handles 2.5 billion rows across 56,000+ symbols from 13 data sources worldwide.

None of this was clean. Real migrations never are. But the lessons are applicable to any team building on time-series data at scale, financial or otherwise.

Where We Started: QuestDB at 400M Rows

QuestDB was an excellent early choice. The SQL interface was familiar, the ingestion throughput was solid via the InfluxDB line protocol, and the time-series native design meant common market data queries . “give me all bars for AAPL between date X and date Y” . were fast without needing to think about indexing.

The problems emerged in stages as the dataset grew:

Partition proliferation: At scale, QuestDB's partition management became fragile. We hit a state where over 2,800 garbage partitions had accumulated . incomplete writes, partial ingestions, orphaned partition files. Querying across them was slow and occasionally incorrect.
WAL pitfalls: QuestDB's Write-Ahead Log tables have a critical gotcha: a DROP TABLE on an active WAL table drops partition data before the WAL is fully replayed. We lost data this way once, doing what seemed like a routine dedup operation. The fix was never using CTAS+DROP on WAL tables . but this was not obvious from the documentation.
Query performance degradation: Complex analytical queries . multi-symbol aggregations, cross-asset joins, percentile computations . became progressively slower as the row count grew. The single-node architecture was hitting its ceiling.
Compression: QuestDB's default compression was adequate but not exceptional. At 400M rows, disk usage was already significant, and our growth trajectory made the cost of storage a real consideration.

The inflection point was when we started seeing query timeouts on the Avo platform for analytical endpoints. Users requesting screener results or multi-symbol comparisons were hitting 30-second+ response times. That was the migration trigger.

Choosing ClickHouse

The evaluation shortlist was ClickHouse, TimescaleDB, and Apache Druid. Each has legitimate strengths; we chose ClickHouse for three reasons.

Compression.ClickHouse's columnar storage with LZ4 compression delivered 5-10x better compression ratios than our QuestDB baseline on typical OHLCV financial data. For a dataset that was growing by 100M+ rows per month, the difference between 200GB and 20-40GB for the same data was not trivial. We measured this directly on representative samples before committing to the migration.

Query performance.ClickHouse's columnar architecture is specifically optimized for the analytical queries that matter in market research: aggregations across millions of rows, percentile calculations, multi-symbol GROUP BY operations. Our benchmark queries ran 2-5x faster on ClickHouse than on QuestDB at equivalent data volumes.

Ecosystem maturity. ClickHouse has a large, production-tested community and extensive documentation on exactly the use cases we care about. The Rust client library (clickhouse-rs) was stable and well-maintained. The HTTP interface made integration with our Next.js API layer straightforward.

Lesson 1: Compression Is Not Optional . It Is a Design Constraint

The 5-10x compression improvement sounds like a nice-to-have until you do the math. At 2.5 billion rows of OHLCV data with timestamps, symbols, and associated metadata, uncompressed storage would require hundreds of gigabytes on a single-server deployment. With ClickHouse's LZ4HC compression, our entire dataset fits in approximately 14-15GB.

The key insight is that financial market data compresses exceptionally well. Price data is highly repetitive: most symbols have many sessions with similar ranges. The columnar storage model stores all open prices together, all close prices together, all volumes together . which makes the patterns within each column highly compressible.

In practice, this means:

Use CODEC(LZ4HC(9)) on price and volume columns for maximum compression at acceptable CPU cost.
Use CODEC(Delta, LZ4) on timestamp columns . Delta encoding on monotonically increasing timestamps dramatically improves compression.
Use LowCardinality(String) for symbol columns. Most queries filter on a small subset of the possible symbol values, and LowCardinality enables dictionary encoding that is both smaller and faster.

We did not get all of these right on the first schema design. Retrofitting codec choices onto an existing 2.5B row table requires a full data reload. Design compression into the schema from day one.

Lesson 2: Async Inserts Prevent Lock Contention

Our ingest architecture consists of 23+ Rust binaries ingesting data from 13 data sources simultaneously. At peak, this means dozens of concurrent write streams hitting the database. Synchronous inserts at this concurrency level cause lock contention, insert queue saturation, and periodic ingestion failures that are difficult to debug.

ClickHouse's async insert mode (enabled with async_insert=1, wait_for_async_insert=0) buffers small inserts in memory and flushes them as larger batches. This transforms dozens of tiny concurrent inserts into periodic bulk flushes, dramatically reducing merge pressure and eliminating the lock contention that was causing our ingestion failures.

The tradeoff: async inserts introduce a small latency between data arriving and data being queryable (typically 1-2 seconds at our flush interval settings). For a financial intelligence platform where most data is historical, this is completely acceptable. For use cases requiring sub-second write-to-read latency, synchronous inserts with careful batching are still the right approach.

Configuration that worked well for us:

async_insert = 1
wait_for_async_insert = 0
async_insert_max_data_size = 10000000  # 10MB flush threshold
async_insert_busy_timeout_ms = 1000    # 1s max wait

Get weekly intelligence delivered to your inbox

Curated signals, regime shifts, and anomaly highlights from Avo Intelligence. Every Monday. Free.

Lesson 3: ReplacingMergeTree Is Your Dedup Strategy . Use It Correctly

Financial data has an inherent deduplication problem. Exchange feeds occasionally resend bars. Backfill operations may overlap with live ingestion. API retries on failure can insert the same record twice. Without a dedup mechanism, a table that should have one bar per symbol per timestamp ends up with two or three, which silently corrupts aggregation queries.

ClickHouse's ReplacingMergeTreeengine handles this via background deduplication: when two rows have the same primary key, the merge process keeps the most recent version (or the highest version number if you specify a version column). This is exactly what financial data ingestion requires.

Critical caveats that took us time to learn:

Merges are not immediate. Deduplication happens in the background, on ClickHouse's schedule. In the window between insert and merge completion, duplicate rows are visible. Queries that require exact dedup must use FINAL modifier or argMax() aggregation patterns. The FINAL modifier forces a merge on read, which is correct but expensive at scale.
Primary key ordering matters enormously. The primary key in ClickHouse is not a uniqueness constraint . it is the sort order of the data on disk, which determines query performance. For time-series data, (symbol, ts) ordering is typically correct: it collocates all data for a given symbol together, making per-symbol time-range queries fast.
Partition by time, not by symbol. Partitioning by toYYYYMM(ts) keeps partition counts manageable as data grows. Partitioning by symbol creates one partition per symbol . 56,000+ partitions is a filesystem nightmare.

Lesson 4: Partition Management Is Ongoing Maintenance

One of the most operationally expensive lessons of the QuestDB era was discovering 2,810 garbage partitions from incomplete writes and abandoned ingestion jobs. ClickHouse's partition model is more robust . parts (the sub-partition unit) are atomic and either complete or rolled back . but partition management still requires active attention.

In production, we run nightly partition health checks that verify:

Row counts per partition are within expected bounds (catches silent ingestion failures)
No corrupt timestamps (we found rows with year-2299 timestamps from a downstream bug . easily caught by checking MAX(ts) per table)
TTL enforcement is running (our bars_1m table has a 2-year TTL to prevent unbounded growth of the high-frequency data)
Merge backlog is not growing (a growing parts count without corresponding merge activity indicates a problem)

We built a Rust binary (argus-validate-ch) that runs these checks on a schedule and alerts if any threshold is breached. The 30 minutes spent building that binary has saved hours of debugging time.

Real Numbers: Before and After

Concrete figures from the migration, based on measurements at equivalent data volumes (400M rows):

Metric	QuestDB	ClickHouse
Storage (400M rows)	~180 GB	~22 GB
Single-symbol range query (p95)	~120ms	~35ms
Multi-symbol aggregation (1000 symbols, 1Y)	28s	6s
Concurrent ingest streams (stable)	~8	23+
Partition maintenance incidents (90 days)	4	0

At 2.5 billion rows today, total storage is approximately 14.19 GB. That is the compression advantage compounding at scale.

The Migration Process

We ran QuestDB and ClickHouse in parallel for approximately 3 weeks during the migration. This dual-write period was operationally expensive but essential: it gave us confidence in the ClickHouse data quality before cutting over production read traffic.

The migration sequence was:

Schema design and validation on ClickHouse with a 10M row sample
Build Rust binary to read from QuestDB HTTP API and write to ClickHouse via HTTP bulk insert
Migrate historical data in monthly partitions (slowest part: ~72 hours total)
Enable dual-write on all ingest binaries (write to both simultaneously)
Run row count and spot-check validation scripts for 1 week
Shift all read traffic to ClickHouse
Maintain QuestDB as 72-hour fallback read replica for 2 weeks
Decommission QuestDB

The validation step was the most important. We wrote scripts that sampled random (symbol, date) combinations and compared OHLCV values between the two systems. Any discrepancy triggered an alert. We found and fixed three subtle ingestion bugs during this phase that we would not have caught any other way.

What We Would Do Differently

In roughly priority order:

Design for ClickHouse from the start. The migration cost was real. If we had the benefit of hindsight, we would have validated ClickHouse at 50M rows and committed to it earlier. The QuestDB era was not wasted . we learned a lot about our data model . but the migration itself was pure overhead.

Define partition strategy before writing any data.Partition strategy changes after the fact require full data reloads. Get it right on the first schema: PARTITION BY toYYYYMM(ts) for monthly partitions, TTL configured from day one, codec choices locked in.

Build the validation harness first. We built the data validation tooling reactively, during the migration. It should be the first thing built, running continuously from the moment the first row is ingested.

Use materialized views for derived aggregations earlier.Many of our API endpoints compute aggregations on the fly from the raw bars tables. Several of these could be pre-computed as materialized views and updated incrementally. We started doing this for the highest-traffic queries and the latency improvements were significant.

Instrument everything before you need it. When ingestion problems happen at 2.5 billion rows, debugging requires knowing exactly which binary was writing to which table at what rate. We now have per-binary, per-table ingestion rate metrics in Prometheus. This visibility was missing for the first 12 months and we paid for it in debugging time.

Where We Are Now

The current state: 43 ClickHouse tables, 2.5 billion rows, 14.19 GB total storage, sub-100ms query times on the vast majority of Avo platform endpoints, and a 23-binary Rust ingest pipeline running continuously across 13 data sources. The infrastructure is stable. The engineering attention has shifted from “making the database not fall over” to “building more sophisticated intelligence on top of the data.”

That is the right kind of problem to have.

The lessons are not specific to finance. Any team building an analytics product on time-series data at scale will hit the same inflection points: storage costs, query performance degradation, concurrent write contention, deduplication complexity. The solutions . columnar storage with aggressive compression, async inserts, merge-tree dedup engines, partition discipline . apply broadly.

Start with the right database for where you are going, not just where you are.

None of this was clean. Real migrations never are. But the lessons are applicable to any team building on time-series data at scale, financial or otherwise.

Where We Started: QuestDB at 400M Rows

The problems emerged in stages as the dataset grew:

Partition proliferation: At scale, QuestDB's partition management became fragile. We hit a state where over 2,800 garbage partitions had accumulated . incomplete writes, partial ingestions, orphaned partition files. Querying across them was slow and occasionally incorrect.
WAL pitfalls: QuestDB's Write-Ahead Log tables have a critical gotcha: a DROP TABLE on an active WAL table drops partition data before the WAL is fully replayed. We lost data this way once, doing what seemed like a routine dedup operation. The fix was never using CTAS+DROP on WAL tables . but this was not obvious from the documentation.
Query performance degradation: Complex analytical queries . multi-symbol aggregations, cross-asset joins, percentile computations . became progressively slower as the row count grew. The single-node architecture was hitting its ceiling.
Compression: QuestDB's default compression was adequate but not exceptional. At 400M rows, disk usage was already significant, and our growth trajectory made the cost of storage a real consideration.

Choosing ClickHouse

The evaluation shortlist was ClickHouse, TimescaleDB, and Apache Druid. Each has legitimate strengths; we chose ClickHouse for three reasons.

Lesson 1: Compression Is Not Optional . It Is a Design Constraint

In practice, this means:

Use CODEC(LZ4HC(9)) on price and volume columns for maximum compression at acceptable CPU cost.
Use CODEC(Delta, LZ4) on timestamp columns . Delta encoding on monotonically increasing timestamps dramatically improves compression.
Use LowCardinality(String) for symbol columns. Most queries filter on a small subset of the possible symbol values, and LowCardinality enables dictionary encoding that is both smaller and faster.

We did not get all of these right on the first schema design. Retrofitting codec choices onto an existing 2.5B row table requires a full data reload. Design compression into the schema from day one.

Lesson 2: Async Inserts Prevent Lock Contention

Configuration that worked well for us:

async_insert = 1
wait_for_async_insert = 0
async_insert_max_data_size = 10000000  # 10MB flush threshold
async_insert_busy_timeout_ms = 1000    # 1s max wait

Get weekly intelligence delivered to your inbox

Curated signals, regime shifts, and anomaly highlights from Avo Intelligence. Every Monday. Free.

Lesson 3: ReplacingMergeTree Is Your Dedup Strategy . Use It Correctly

Critical caveats that took us time to learn:

Merges are not immediate. Deduplication happens in the background, on ClickHouse's schedule. In the window between insert and merge completion, duplicate rows are visible. Queries that require exact dedup must use FINAL modifier or argMax() aggregation patterns. The FINAL modifier forces a merge on read, which is correct but expensive at scale.
Primary key ordering matters enormously. The primary key in ClickHouse is not a uniqueness constraint . it is the sort order of the data on disk, which determines query performance. For time-series data, (symbol, ts) ordering is typically correct: it collocates all data for a given symbol together, making per-symbol time-range queries fast.
Partition by time, not by symbol. Partitioning by toYYYYMM(ts) keeps partition counts manageable as data grows. Partitioning by symbol creates one partition per symbol . 56,000+ partitions is a filesystem nightmare.

Lesson 4: Partition Management Is Ongoing Maintenance

In production, we run nightly partition health checks that verify:

Row counts per partition are within expected bounds (catches silent ingestion failures)
No corrupt timestamps (we found rows with year-2299 timestamps from a downstream bug . easily caught by checking MAX(ts) per table)
TTL enforcement is running (our bars_1m table has a 2-year TTL to prevent unbounded growth of the high-frequency data)
Merge backlog is not growing (a growing parts count without corresponding merge activity indicates a problem)

We built a Rust binary (argus-validate-ch) that runs these checks on a schedule and alerts if any threshold is breached. The 30 minutes spent building that binary has saved hours of debugging time.

Real Numbers: Before and After

Concrete figures from the migration, based on measurements at equivalent data volumes (400M rows):

Metric	QuestDB	ClickHouse
Storage (400M rows)	~180 GB	~22 GB
Single-symbol range query (p95)	~120ms	~35ms
Multi-symbol aggregation (1000 symbols, 1Y)	28s	6s
Concurrent ingest streams (stable)	~8	23+
Partition maintenance incidents (90 days)	4	0

At 2.5 billion rows today, total storage is approximately 14.19 GB. That is the compression advantage compounding at scale.

The Migration Process

The migration sequence was:

Schema design and validation on ClickHouse with a 10M row sample
Build Rust binary to read from QuestDB HTTP API and write to ClickHouse via HTTP bulk insert
Migrate historical data in monthly partitions (slowest part: ~72 hours total)
Enable dual-write on all ingest binaries (write to both simultaneously)
Run row count and spot-check validation scripts for 1 week
Shift all read traffic to ClickHouse
Maintain QuestDB as 72-hour fallback read replica for 2 weeks
Decommission QuestDB

What We Would Do Differently

In roughly priority order:

Where We Are Now

That is the right kind of problem to have.

Start with the right database for where you are going, not just where you are.

From 400 Million to 2.5 Billion Rows: Scaling Our Data Pipeline

Where We Started: QuestDB at 400M Rows

Choosing ClickHouse

Lesson 1: Compression Is Not Optional . It Is a Design Constraint

Lesson 2: Async Inserts Prevent Lock Contention

Lesson 3: ReplacingMergeTree Is Your Dedup Strategy . Use It Correctly

Lesson 4: Partition Management Is Ongoing Maintenance

Real Numbers: Before and After

The Migration Process

What We Would Do Differently

Where We Are Now

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

91.9% Accuracy at High Confidence: How We Built the Model

From 400 Million to 2.5 Billion Rows: Scaling Our Data Pipeline

Where We Started: QuestDB at 400M Rows

Choosing ClickHouse

Lesson 1: Compression Is Not Optional . It Is a Design Constraint

Lesson 2: Async Inserts Prevent Lock Contention

Lesson 3: ReplacingMergeTree Is Your Dedup Strategy . Use It Correctly

Lesson 4: Partition Management Is Ongoing Maintenance

Real Numbers: Before and After

The Migration Process

What We Would Do Differently

Where We Are Now

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

91.9% Accuracy at High Confidence: How We Built the Model

Ask Avo anything

From 400 Million to 2.5 Billion Rows: Scaling Our Data Pipeline

Where We Started: QuestDB at 400M Rows

Choosing ClickHouse

Lesson 1: Compression Is Not Optional . It Is a Design Constraint

Lesson 2: Async Inserts Prevent Lock Contention

Lesson 3: ReplacingMergeTree Is Your Dedup Strategy . Use It Correctly

Lesson 4: Partition Management Is Ongoing Maintenance

Real Numbers: Before and After

The Migration Process

What We Would Do Differently

Where We Are Now

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

91.9% Accuracy at High Confidence: How We Built the Model

From 400 Million to 2.5 Billion Rows: Scaling Our Data Pipeline

Where We Started: QuestDB at 400M Rows

Choosing ClickHouse

Lesson 1: Compression Is Not Optional . It Is a Design Constraint

Lesson 2: Async Inserts Prevent Lock Contention

Lesson 3: ReplacingMergeTree Is Your Dedup Strategy . Use It Correctly

Lesson 4: Partition Management Is Ongoing Maintenance

Real Numbers: Before and After

The Migration Process

What We Would Do Differently

Where We Are Now

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

91.9% Accuracy at High Confidence: How We Built the Model