Our homegrown A/B testing framework at 42Floors

Six months ago we created a homegrown A/B testing framework wherein we randomize traffic between three servers running different branches of our codebase.  Conversion rate has since increased 251%.

Original article published here: Using split testing for office space search

My goal in sharing our results is to encourage you to take bigger risks with your A/B testing. Please treat our particular designs with skepticism; the UX that worked for us probably won’t work for you.

The original

As a search engine for commercial real estate, our site has only one goal: for visitors to find at least one office space that they like enough to contact.

180019v1-300x188.png

Version 1

So, the simplest measure of our success rate (our conversion rate) is the number of visitors who contact a space divided by the total number of visitors.  A contact is any phone call, email, or web-based tour request.  For the sake of simplicity, the data shown below is only for web-based tour requests from search traffic.

Back in September 2013, our UX was in it’s third iteration since YC Demo Day; it was called Unified View.  The design served us well; it was elegant and power-users liked it because we offered lots of filters and they could choose between photo, list, or map-based search results.

For the month of September, our conversion rate was 0.41%.

Unified View: index page (conversion rate: 0.41%)

Unified View: show page (conversion rate: 0.41%)

Baby steps

While we didn’t have good industry benchmarks to tell us if a 0.41% conversion rate was reasonable, we assumed we were on the low end.  At the time, we’d been running our A/B testing using Optimizely and our tests tended to be modest — the usual textual changes, image variations, and call to action tweaks.  While we’d occasionally find a 15% gain, we wanted to run more aggressive experiments.

So we sketched out 8 wildly different landing pages in Photoshop and sent them off to PSD2HTML to render as static HTML.  Then we used AdWords campaigns to randomize our traffic between the 8 landing pages.  I wrote about that process a few months ago in a post called “How we tests fake sites on live traffic”.

Seeing the wide performance variations between the static landing pages gave us confidence that we’d eventually find a killer UX if we kept running big experiments.

Homegrown A/B testing framework

As a data-driven search site, we were limited by how little we could stretch our static landing page tests.  What we needed was a way of serving up entirely different search experiences to cohorts of users and tracking their behavior.  We considered using a split testing gem but the idea of littering our codebase with inline conditions was repulsive.


<% ab_test("login_button", "/images/button1.jpg", "/images/button2.jpg") do |button_file| %>  <%= img_tag(button_file, :alt =>; "Login!") %> <% end %>

Plus, we’d still be constrained by the existing structure of our app.

After a few days of exploring options, Aaron and Julian volunteered to build a homegrown A/B testing solution.  Aaron’s working on a technical blog post explaining the implementation in detail, but suffice it to say we can now do things like this:

cap production-C deploy -s branch=hover4test

Since each app server is self contained, branches are free to diverge wildly. We took that freedom to extremes.  At one point, less than 10% of the codebase was the same between the branches on ProdA and ProdB.  While it makes the eventual git merge a nightmare when the branches diverge, a day of wrestling with merge conflicts is a small price to pay for unfettered experimentation.

The first radical A/B test

Our static landing page tests suggested that the most promising UX was Google Hover Clone.  If you’re wondering what that looks like, think of the sidebar that appears in Google search results sometimes when you hover over a local business listing.  The idea was to eliminate clicks: users could get most of the information they needed about the properties just by hovering over listings in the index of results.

Using our new A/B testing framework, we served up Hover1 to 30% of our new users, followed quickly by Hover2, and eventually Hover3.  But none of them moved the needle.

Hover3

We decided to give the Hover UX paradigm one more shot with Hover 4.  Here’s the announcement email to the team:

For the two weeks in November that we ran the experiment, our conversion rate for the branch was, once again, 0.41%. Despite all the improvements we thought we’d made, our conversion rate hadn’t budged.

Hover4 (conversion rate: 0.41%)

Simple, No Javascript

The day after we killed the Hover UX experiment, Aaron had a suggestion. Instead of looking to glossy sites like AirBnB for inspiration, what if we looked to HackerNews, Reddit, and 4chan?

Thus, the Craigslist branch was born:

CraigslistView: index page (conversion rate: 1.44%)

CraigslistView: show page (conversion rate 1.44%)

The results

The Craigslist UX converted at 1.44% which, compared to our historical baseline of 0.41%, was a 251% improvement.  Supporting metrics like pages per visit (+51%) were way up, too, as you’d expect.

We think that there’s still quite a bit of upside left so we’re actively running more experiments.  Ongoing experiments have a dedicated chart up on a monitor in the lunch area so everybody can see how they’re doing.  Here’s one of the charts from today.

The reason why we put so much effort into split testing is that we’re trying to find the global maximum.  We worry that making linear iterations will lead us to a local maximum that would be far less than our potential.  So, we force ourselves to try ideas that are radically different from past experiments.  At some point we’ll probably run out of crazy ideas and then we’ll settle into optimizing the winning UX instead of trashing and rewriting it periodically.

PS - are you curious what experiment is running right now?  Here it is.