Experimentation
GoGreen includes a built-in experimentation platform for running A/B tests, measuring feature impact, and making data-driven decisions.
How It Works
- Create an experiment linked to a feature flag with two or more variations.
- Define metrics — what you want to measure (e.g., conversion rate, revenue, latency).
- Start the experiment — GoGreen begins collecting impression and custom events from SDKs.
- View results — statistical analysis tells you which variation performs better with confidence intervals.
Experiment Lifecycle
Draft → Running → Stopped- Draft: Configure the experiment, link a flag, and define metrics. No data is collected yet.
- Running: Data collection is active. SDKs send impression events (which variation a user saw) and custom events (what the user did).
- Stopped: Data collection freezes. Results are finalized and available for review.
Creating an Experiment
Via Dashboard
- Navigate to Experiments in the sidebar.
- Click Create Experiment.
- Select the flag and environment to experiment on.
- Define one or more metrics (event key + aggregation type).
- Click Create (starts in Draft state).
- When ready, click Start Experiment.
Via API
# Create an experiment
curl -X POST https://api.gogreenflags.com/v1/projects/{projectId}/experiments \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Checkout Flow Test",
"flag_key": "new-checkout",
"environment_id": "env-prod",
"description": "Test new checkout flow vs legacy"
}'Tracking Events
SDKs automatically send impression events when a flag is evaluated. You can also send custom events to track business metrics:
Go
client.Track("purchase", user, map[string]any{
"revenue": 49.99,
"currency": "USD",
"items": 3,
})TypeScript
client.track('purchase', {
revenue: 49.99,
currency: 'USD',
items: 3,
});Statistical Analysis
GoGreen computes statistical significance automatically:
| Metric Type | Test | Output |
|---|---|---|
| Numeric (e.g., revenue) | Welch’s t-test | Mean difference, p-value, confidence interval |
| Categorical (e.g., conversion) | Chi-squared test | Proportion difference, p-value, confidence interval |
Results include:
- p-value: Probability the difference is due to chance. A p-value < 0.05 indicates statistical significance.
- Confidence interval: Range within which the true difference lies with 95% confidence.
- Sample size: Number of users in each variation.
- Lift: Percentage improvement of the treatment over the control.
Event Pipeline
Events flow through a purpose-built analytics pipeline:
- SDKs send events to the Events Ingestion service.
- Events Ingestion deduplicates events, strips PII, and writes to ClickHouse.
- ClickHouse stores raw events and maintains materialized views for hourly, daily, and monthly aggregation rollups.
- Experimentation Service queries ClickHouse aggregations to compute results.
Data Retention
Event data is retained according to your plan’s retention period. Aggregation rollups (hourly → daily → monthly) ensure long-term trend analysis while managing storage costs.
Best Practices
- Run experiments for at least 1-2 weeks to account for day-of-week effects.
- Don’t peek at results and stop early — let the experiment run to the planned sample size for valid statistical conclusions.
- Use guardrail metrics alongside your primary metric to catch unintended negative effects (e.g., monitor error rate alongside conversion rate).
- One change at a time — avoid running overlapping experiments on the same flag to prevent interaction effects.