Jan 22, 2026

Building my own A/B testing toolkit

Why I did this

I wanted to get comfortable with the real questions that come up in experimentation:

Also: I wanted a project that could stand on its own as a portfolio piece ... something you can open, run, and understand without needing a long explanation.

The plan :

I planned the work as two connected projects, each with a distinct job:

1) A/B Experimentation Toolkit ( ab-experimentation-toolkit)

This is the “toolbox.” It includes functions that take data from two groups (A and B) and return results in a consistent format.

My goal here was simplicity:

2) Simulation Lab (experimentation-sim-lab)

This is the proof. It runs many simulated experiments and checks how the toolkit behaves.

This was the key idea:

It’s easy to write A/B testing code that seems correct. But, it’s much harder to prove it behaves correctly across many situations.

So instead of trusting formulas blindly, I used simulation to validate things like:

This second repo is what turned the project from “I implemented some statistics” into “I understand experimentation like a platform problem.”

The Execution

I treated this like a mini product build rather than a notebook exercise.

Step 1: Set up professional scaffolding early

Before writing much functionality, I set up:

It felt annoying at times, but it paid off. Once the workflow was in place, it became easy to make small improvements without breaking everything.

Step 2: Build the toolkit in small, testable pieces

I implemented the toolkit gradually, focusing on what real experimentation teams use day-to-day.I started with simpler metric types and built toward harder ones:

Each time I added functionality, I added:

Step 3: Add platform guardrails, not just analysis

This was a big mindset shift for me. In real experimentation platforms, the job is not only to compute a p-value. You also need checks like:

So I added features that make the toolkit feel like a tiny experimentation platform rather than a stats demo.

Step 4: Build a simulation lab to validate behavior

Once the toolkit existed, I built a separate repo that:

This gave me a clear way to validate assumptions and build confidence in the toolkit.

What I learned (the real takeaways)

1) “Statistical significance” is the easiest part

The hardest part is everything around it:

2) Simulation is underrated

Simulation gave me a way to sanity-check my own thinking.

It also made the project feel grounded: not just math, but “does this behave the way a platform needs it to behave?”

3) Tooling discipline matters

CI, formatting, and tests are not just for show. They let you move faster without constantly breaking things.

And honestly, having clean repos made the whole project more enjoyable to work on.

Closing thoughts

This project was about building a system you can trust ... technically and culturally. If you’re learning experimentation and want a project that forces you to understand what’s going on under the hood, I highly recommend building something like this. The learning curve is real, but the payoff is even bigger.

If you’d like to explore the code:

Arnav Jaitly

Arnav Jaitly

Hi, I am Arnav! If you liked this article, do consider leaving a comment below as it motivates me to publish more helpful content like this!

Leave a Comment

1 Comments

Related Posts

Find Posts by Categories