91.9% Accuracy at High Confidence: How We Built the Model

The number that gets attention is 91.9%. That is Avo's directional accuracy on signals where the model confidence exceeds 0.65. But the number that matters more for understanding how the system works is 56.9% . the overall accuracy across all signals before any confidence filtering. Understanding the gap between those two figures is the whole story.

Why Most ML Trading Systems Fail

Machine learning for stock prediction has a long history of promising backtests and disappointing live performance. The root causes fall into three categories, and they are well-documented enough that we can be direct about them:

Overfitting to historical structure: A model trained on 2010-2020 data implicitly learns the interest rate regime, the volatility regime, and the sector correlations of that era. When macro conditions shift, those embedded assumptions fail silently.
Survivorship bias in the training universe: If you train on "S&P 500 stocks" using today's index composition, you are training on winners. The stocks that were delisted, acquired, or went bankrupt during the training period are missing. The model learns patterns that look predictive but partly just reflect survivorship.
Look-ahead leakage: Subtle data pipeline bugs where future information contaminates training labels. This is the most common cause of backtests that cannot be reproduced live. A feature computed from data that was not actually available at prediction time produces unrealistically good training metrics.

Avo's model architecture is built around preventing all three. Here is how.

The Feature Set: 49 Inputs

The LightGBM model takes 49 input features. They fall into five families:

Price and volume momentum (12 features): Returns at multiple horizons (1, 5, 10, 20, 60 bars), volume-weighted momentum, price position within the trailing range, and relative strength against the asset class index.
Volatility structure (8 features): Realized volatility at multiple lookbacks, the ratio of recent to long-run vol (vol regime), ATR-normalized price range, and Garman-Klass estimator for intraday variance.
Cross-asset context (11 features): Correlation to the broad market at multiple windows, sector relative strength, currency exposure adjustments for non-USD assets, and lead-lag signals from related instruments.
Macro regime inputs (9 features): Current HMM regime label and confidence, VIX term structure slope, credit spread level, yield curve position, and a composite macro stress indicator derived from FRED series.
Microstructure signals (9 features): Bid-ask spread estimate, volume profile position, order flow imbalance proxy, large-trade indicator, and overnight gap characteristics.

Every feature is computed using only data that was available at the prediction timestamp. The data pipeline enforces point-in-time correctness with timestamped ingestion logs . a feature derived from data with a later timestamp than the prediction window is rejected at the pipeline level, not just at the model level.

The 6-Layer Quality Gate

Before any signal reaches a user, it passes six checks:

Layer 1 . Data completeness: All 49 features must be computable. Symbols with data gaps exceeding 5% in the lookback window are excluded.
Layer 2 . Liquidity floor: Minimum 30-day average daily volume threshold by asset class. Illiquid instruments produce unreliable momentum signals.
Layer 3 . Regime consistency: The signal direction must be consistent with the current macro regime. A long signal in a crisis regime is downgraded automatically.
Layer 4 . Model confidence threshold: Raw model output probability must exceed 0.55. Below this, the signal is logged but not surfaced.
Layer 5 . Cross-asset confirmation: For equity signals, sector momentum must not be strongly opposed. A lone-stock breakout against a deteriorating sector is penalised.
Layer 6 . Recency gate: No duplicate signals on the same symbol within 4 bars. This prevents the same structural condition from generating repeated alerts.

Walk-Forward Validation: The Only Honest Backtest

Standard train/test splits are not sufficient for time-series financial models. Future data must never touch the training window, and market regimes mean that a model trained on 2015-2019 is not valid for 2020-2022.

Avo uses expanding-window walk-forward validation. The model is trained on all data up to month M, then evaluated on month M+1, then retrained on all data up to M+1, evaluated on M+2, and so on. This produces a performance distribution across many out-of-sample windows rather than a single number . which is far more honest about model stability.

The reported 56.9% overall accuracy is the mean directional accuracy across all walk-forward windows. It is not cherry-picked from the best window. The range across windows is 52.1% to 63.4% . meaningful variation that reflects genuine regime sensitivity. Any system claiming perfectly stable accuracy across all market conditions is probably overfitting.

Confidence Thresholds: The Real Lever

56.9% is the unconditional accuracy. But LightGBM outputs a probability, not just a direction. When the model is uncertain (probability close to 0.5), it is approximately coin-flip directional. When the model is confident (probability above 0.65), something structurally different is happening . the feature set is aligned in a way that historically resolves in one direction at much higher rates.

At confidence above 0.65, the walk-forward accuracy rises to 91.9%. The cost of that accuracy is selectivity: only about 8% of all computed signals pass the 0.65 threshold. That is not a bug. That is the design. In quantitative signals, having fewer, higher-quality alerts is almost always preferable to more, noisier ones.

The threshold is not a fixed parameter. Avo exposes a configurable confidence floor in the intelligence dashboard, defaulting to 0.60 (which yields about 18% of all signals at approximately 78% accuracy). Users who want maximum selectivity can set the floor to 0.65 or higher.

The Thompson Sampling Meta-Learner

A single model trained once is static. Markets evolve. Avo runs a Thompson sampling meta-learner that tracks the recent performance of different signal families (momentum, mean-reversion, pattern-based, macro-driven) and dynamically weights their contribution to the composite confidence score.

Thompson sampling is a Bayesian approach to the multi-armed bandit problem. Each signal family has a Beta distribution representing the posterior belief about its current hit rate. At each inference step, the meta-learner samples from these distributions and weights the signal families accordingly. Families that have been performing well in recent market conditions get sampled higher more often; families that have been underperforming are naturally downweighted without requiring a manual intervention.

This is the mechanism that allows Avo's model to adapt to regime shifts without a full retrain. The base model weights remain static between monthly retraining cycles, but the meta-learner continuously adjusts which signals to trust based on recent outcomes.

What We Are Honest About

91.9% accuracy at high confidence is not a license to trade recklessly. Several important caveats apply:

Directional accuracy does not equal profitability. A signal correct in direction but wrong in magnitude can still lose money with aggressive position sizing.
Accuracy is computed at the signal horizon (5 bars forward by default). Longer hold periods revert toward the unconditional mean.
The high-confidence threshold selects for conditions that are rare by design. You cannot run 100% of capital against 8% of signals.
Regime shifts are the primary risk. The walk-forward windows capture regime variation but cannot fully simulate a true out-of-sample future regime we have not yet seen.

Avo publishes ongoing model performance metrics in the intelligence dashboard . not just the historical walk-forward numbers, but rolling 30-day live accuracy tracked against signals as they resolve. Transparency about model limitations is part of the product.

Get weekly intelligence delivered to your inbox

Curated signals, regime shifts, and anomaly highlights from Avo Intelligence. Every Monday. Free.

Why Most ML Trading Systems Fail

Overfitting to historical structure: A model trained on 2010-2020 data implicitly learns the interest rate regime, the volatility regime, and the sector correlations of that era. When macro conditions shift, those embedded assumptions fail silently.
Survivorship bias in the training universe: If you train on "S&P 500 stocks" using today's index composition, you are training on winners. The stocks that were delisted, acquired, or went bankrupt during the training period are missing. The model learns patterns that look predictive but partly just reflect survivorship.
Look-ahead leakage: Subtle data pipeline bugs where future information contaminates training labels. This is the most common cause of backtests that cannot be reproduced live. A feature computed from data that was not actually available at prediction time produces unrealistically good training metrics.

Avo's model architecture is built around preventing all three. Here is how.

The Feature Set: 49 Inputs

The LightGBM model takes 49 input features. They fall into five families:

Price and volume momentum (12 features): Returns at multiple horizons (1, 5, 10, 20, 60 bars), volume-weighted momentum, price position within the trailing range, and relative strength against the asset class index.
Volatility structure (8 features): Realized volatility at multiple lookbacks, the ratio of recent to long-run vol (vol regime), ATR-normalized price range, and Garman-Klass estimator for intraday variance.
Cross-asset context (11 features): Correlation to the broad market at multiple windows, sector relative strength, currency exposure adjustments for non-USD assets, and lead-lag signals from related instruments.
Macro regime inputs (9 features): Current HMM regime label and confidence, VIX term structure slope, credit spread level, yield curve position, and a composite macro stress indicator derived from FRED series.
Microstructure signals (9 features): Bid-ask spread estimate, volume profile position, order flow imbalance proxy, large-trade indicator, and overnight gap characteristics.

The 6-Layer Quality Gate

Before any signal reaches a user, it passes six checks:

Layer 1 . Data completeness: All 49 features must be computable. Symbols with data gaps exceeding 5% in the lookback window are excluded.
Layer 2 . Liquidity floor: Minimum 30-day average daily volume threshold by asset class. Illiquid instruments produce unreliable momentum signals.
Layer 3 . Regime consistency: The signal direction must be consistent with the current macro regime. A long signal in a crisis regime is downgraded automatically.
Layer 4 . Model confidence threshold: Raw model output probability must exceed 0.55. Below this, the signal is logged but not surfaced.
Layer 5 . Cross-asset confirmation: For equity signals, sector momentum must not be strongly opposed. A lone-stock breakout against a deteriorating sector is penalised.
Layer 6 . Recency gate: No duplicate signals on the same symbol within 4 bars. This prevents the same structural condition from generating repeated alerts.

Walk-Forward Validation: The Only Honest Backtest

Confidence Thresholds: The Real Lever

The Thompson Sampling Meta-Learner

What We Are Honest About

91.9% accuracy at high confidence is not a license to trade recklessly. Several important caveats apply:

Directional accuracy does not equal profitability. A signal correct in direction but wrong in magnitude can still lose money with aggressive position sizing.
Accuracy is computed at the signal horizon (5 bars forward by default). Longer hold periods revert toward the unconditional mean.
The high-confidence threshold selects for conditions that are rare by design. You cannot run 100% of capital against 8% of signals.
Regime shifts are the primary risk. The walk-forward windows capture regime variation but cannot fully simulate a true out-of-sample future regime we have not yet seen.

Get weekly intelligence delivered to your inbox

Curated signals, regime shifts, and anomaly highlights from Avo Intelligence. Every Monday. Free.

91.9% Accuracy at High Confidence: How We Built the Model

Why Most ML Trading Systems Fail

The Feature Set: 49 Inputs

The 6-Layer Quality Gate

Walk-Forward Validation: The Only Honest Backtest

Confidence Thresholds: The Real Lever

The Thompson Sampling Meta-Learner

What We Are Honest About

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

How We Process 2.5 Billion Data Points

91.9% Accuracy at High Confidence: How We Built the Model

Why Most ML Trading Systems Fail

The Feature Set: 49 Inputs

The 6-Layer Quality Gate

Walk-Forward Validation: The Only Honest Backtest

Confidence Thresholds: The Real Lever

The Thompson Sampling Meta-Learner

What We Are Honest About

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

How We Process 2.5 Billion Data Points

Ask Avo anything

91.9% Accuracy at High Confidence: How We Built the Model

Why Most ML Trading Systems Fail

The Feature Set: 49 Inputs

The 6-Layer Quality Gate

Walk-Forward Validation: The Only Honest Backtest

Confidence Thresholds: The Real Lever

The Thompson Sampling Meta-Learner

What We Are Honest About

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

How We Process 2.5 Billion Data Points

91.9% Accuracy at High Confidence: How We Built the Model

Why Most ML Trading Systems Fail

The Feature Set: 49 Inputs

The 6-Layer Quality Gate

Walk-Forward Validation: The Only Honest Backtest

Confidence Thresholds: The Real Lever

The Thompson Sampling Meta-Learner

What We Are Honest About

Get weekly market intelligence delivered to your inbox. Free.

Related Articles

How Avo Intelligence Detects Anomalies Across 37,000+ Symbols

How We Process 2.5 Billion Data Points