Experimentation

GoGreen includes a built-in experimentation platform for running A/B tests, measuring feature impact, and making data-driven decisions.

How It Works

Create an experiment linked to a feature flag with two or more variations.
Define metrics — what you want to measure (e.g., conversion rate, revenue, latency).
Start the experiment — GoGreen begins collecting impression and custom events from SDKs.
View results — statistical analysis tells you which variation performs better with confidence intervals.

Experiment Lifecycle


Draft → Running → Stopped

Draft: Configure the experiment, link a flag, and define metrics. No data is collected yet.
Running: Data collection is active. SDKs send impression events (which variation a user saw) and custom events (what the user did).
Stopped: Data collection freezes. Results are finalized and available for review.

Creating an Experiment

Via Dashboard

Navigate to Experiments in the sidebar.
Click Create Experiment.
Select the flag and environment to experiment on.
Define one or more metrics (event key + aggregation type).
Click Create (starts in Draft state).
When ready, click Start Experiment.

Via API


# Create an experiment
curl -X POST https://api.gogreenflags.com/v1/projects/{projectId}/experiments \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Checkout Flow Test",
    "flag_key": "new-checkout",
    "environment_id": "env-prod",
    "description": "Test new checkout flow vs legacy"
  }'

Tracking Events

SDKs automatically send impression events when a flag is evaluated. You can also send custom events to track business metrics:

Go


client.Track("purchase", user, map[string]any{
    "revenue":  49.99,
    "currency": "USD",
    "items":    3,
})

TypeScript


client.track('purchase', {
  revenue: 49.99,
  currency: 'USD',
  items: 3,
});

Statistical Analysis

GoGreen computes statistical significance automatically:

Metric Type	Test	Output
Numeric (e.g., revenue)	Welch’s t-test	Mean difference, p-value, confidence interval
Categorical (e.g., conversion)	Chi-squared test	Proportion difference, p-value, confidence interval

Results include:

p-value: Probability the difference is due to chance. A p-value < 0.05 indicates statistical significance.
Confidence interval: Range within which the true difference lies with 95% confidence.
Sample size: Number of users in each variation.
Lift: Percentage improvement of the treatment over the control.

Event Pipeline

Events flow through a purpose-built analytics pipeline:

SDKs send events to the Events Ingestion service.
Events Ingestion deduplicates events, strips PII, and writes to ClickHouse.
ClickHouse stores raw events and maintains materialized views for hourly, daily, and monthly aggregation rollups.
Experimentation Service queries ClickHouse aggregations to compute results.

Data Retention

Event data is retained according to your plan’s retention period. Aggregation rollups (hourly → daily → monthly) ensure long-term trend analysis while managing storage costs.

Best Practices

Run experiments for at least 1-2 weeks to account for day-of-week effects.
Don’t peek at results and stop early — let the experiment run to the planned sample size for valid statistical conclusions.
Use guardrail metrics alongside your primary metric to catch unintended negative effects (e.g., monitor error rate alongside conversion rate).
One change at a time — avoid running overlapping experiments on the same flag to prevent interaction effects.