A/B Testing & Experimentation
Statistical rigor. No per-seat fees. No vendor lock-in.
Custom experimentation platforms, feature flag systems, multi-armed bandit optimization, and conversion rate infrastructure. We build the analysis engine, the randomization layer, and the dashboards your team will run independently after handoff.
23%
Cumulative checkout conversion lift
< 1s
Feature flag kill switch propagation
$0
Per-seat fees on the owned platform
CAPABILITIES
What we build
01
Experimentation platform
Experiment registry with hypothesis tracking, pre-specified metric definitions, and minimum detectable effect calculations before any test runs. Assignment logic uses hashed user IDs with a configurable salt so the same user always sees the same variant and variant exposure is logged per impression.
02
Feature flagging
Flag evaluation at the edge with Redis-backed targeting rules. Roll out to 1% of users on a specific plan, measure for 7 days, then promote to 100% without a deploy. Kill switch fires in under 1 second across all active sessions.
03
Multi-armed bandit optimization
Thompson sampling or UCB1 allocation that shifts traffic toward winning variants as data accumulates. Faster than classical A/B testing when you have many variants and limited patience for weeks-long ramp periods.
04
Statistical analysis engine
Pre-registration of primary metric, guardrail metrics, and stopping rules before the test opens. Sequential testing with alpha spending so you can peek at results without inflating your false-positive rate. Automated significance report generated at experiment close.
DISCIPLINE
Pre-registration and stopping rules
Every experiment is pre-registered before exposure. Primary metric, guardrail metrics, minimum detectable effect, and the stopping rule are locked in the database. Peeking does not inflate the false-positive rate because the rule was set before the data arrived.
Primary metric
Pre-declared
Locked before the first user is assigned. One metric per experiment. Changing it after exposure invalidates the test.
Guardrails
2 to 4 per test
Churn rate, support ticket volume, revenue per user. A primary win that breaks a guardrail is not a win.
MDE
Power 0.8
Sample size is computed from minimum detectable effect, baseline variance, and 80% power before exposure starts.
Alpha spending
O'Brien-Fleming
Sequential boundaries let you peek without inflating alpha past 0.05. Stop early on overwhelming wins or losses.
Min runtime
1 to 2 weeks
Enforced by the platform. Prevents day-of-week and novelty effects from corrupting the result.
Holdout
5 to 10%
Permanent holdout group never sees any winning variant. Used to measure long-run cumulative lift across all tests.
PROCESS
How we deliver
Every engagement follows the same three phases. No surprises, no scope creep.
Experiment Design + Metric Definition
We define the hypothesis, primary metric, guardrail metrics, and minimum detectable effect. Sample size and assignment logic are locked before any test runs.
Instrumentation + Randomization Engine
Event tracking and feature flag infrastructure deployed. Consistent user assignment, exposure logging, and holdout groups configured to eliminate carryover bias.
Analysis Engine + Reporting Handoff
Statistical analysis pipeline runs automatically at experiment close. Results dashboard and documented methodology transferred so your team can run future tests independently.
APPLICATIONS
Where this applies
- 01Pricing experiment infrastructure. A SaaS product ran 4 simultaneous pricing page experiments: annual vs monthly emphasis, price anchoring positions, and guarantee copy. Each experiment had pre-specified guardrail metrics (churn rate, support ticket volume) to catch wins that hurt long-term retention.
- 02Checkout funnel optimization. E-commerce client ran 11 sequential experiments on their checkout flow. Each test had a 2-week minimum runtime enforced by the platform to prevent peeking. Cumulative conversion lift: 23% over 6 months.
- 03Feature rollout for a B2B platform. New dashboard shipped behind a flag targeting beta accounts. Engagement metrics measured for 30 days, then promoted to 10%, then 100%. Two regressions caught and rolled back before reaching full rollout.
- 04Email subject line optimization with bandit allocation. 8 subject line variants for a weekly newsletter. Thompson sampling converged to a 2-variant runoff within 4 weeks. Winner produced 31% higher open rate than the control.
TECHNOLOGY
Tech stack
METRICS
By the numbers
< 1s
Flag kill switch propagation
Unlimited
Concurrent experiments, no seat tax
100%
Platform IP ownership
2 wks
Platform to production
GET STARTED
Ready to build?
Most projects ship in 2 to 4 weeks. Fixed price. Full IP transfer.