Our homegrown A/B testing framework at 42Floors
Six months ago we created a homegrown A/B testing framework wherein we randomize traffic between three servers running different branches of our codebase. Conversion rate has since increased 251%.
Original article published here: Using split testing for office space search
My goal in sharing our results is to encourage you to take bigger risks with
your A/B testing. Please treat our particular designs with skepticism; the UX
that worked for us probably won’t work for you.
As a search engine for commercial real estate, our site has only one goal: for
visitors to find at least one office space that they
like enough to contact.
So, the simplest measure of our success rate (our conversion rate) is the
number of visitors who contact a space divided by the total number of
visitors. A contact is any phone call, email, or web-based tour request. For
the sake of simplicity, the data shown below is only for web-based tour
requests from search traffic.
Back in September 2013, our UX was in it’s third iteration since YC Demo Day;
it was called Unified View. The design served us well; it was elegant and
power-users liked it because we offered lots of filters and they could choose
between photo, list, or map-based search results.
For the month of September, our conversion rate was 0.41%.
Unified View: index page (conversion rate: 0.41%)
Unified View: show page (conversion rate: 0.41%)
While we didn’t have good industry benchmarks to tell us if a 0.41% conversion
rate was reasonable, we assumed we were on the low end. At the time, we’d
been running our A/B testing using Optimizely and our tests tended to be
modest — the usual textual changes, image variations, and call to action
tweaks. While we’d occasionally find a 15% gain, we wanted to run more
So we sketched out 8 wildly different landing pages in Photoshop and sent them
off to PSD2HTML to render as static HTML. Then we used AdWords campaigns to
randomize our traffic between the 8 landing pages. I wrote about that process
a few months ago in a post called “How we tests fake sites on live
Seeing the wide performance variations between the static landing pages gave
us confidence that we’d eventually find a killer UX if we kept running big
Homegrown A/B testing framework
As a data-driven search site, we were limited by how little we could stretch
our static landing page tests. What we needed was a way of serving up
entirely different search experiences to cohorts of users and tracking their
behavior. We considered using a split testing gem but the idea of littering
our codebase with inline conditions was repulsive.
<% ab_test("login_button", "/images/button1.jpg", "/images/button2.jpg") do |button_file| %> <%= img_tag(button_file, :alt =>; "Login!") %> <% end %>
Plus, we’d still be constrained by the existing structure of our app.
After a few days of exploring options, Aaron and Julian volunteered to build a homegrown A/B testing solution. Aaron’s working on a technical blog post
explaining the implementation in detail, but suffice it to say we can now do
things like this:
cap production-C deploy -s branch=hover4test
Since each app server is self contained, branches are free to diverge wildly.
We took that freedom to extremes. At one point, less than 10% of the codebase was the same between the branches on ProdA and ProdB. While it makes the eventual git merge a nightmare when the branches diverge, a day of wrestling with merge conflicts is a small price to pay for unfettered experimentation.
The first radical A/B test
Our static landing page tests suggested that the most promising UX was Google
Hover Clone. If you’re wondering what that looks like, think of the sidebar
that appears in Google search results sometimes when you hover over a local
business listing. The idea was to eliminate clicks: users could get most of
the information they needed about the properties just by hovering over
listings in the index of results.
Using our new A/B testing framework, we served up Hover1 to 30% of our new
users, followed quickly by Hover2, and eventually Hover3. But none of them
moved the needle.
We decided to give the Hover UX paradigm one more shot with Hover 4. Here’s
the announcement email to the team:
For the two weeks in November that we
ran the experiment, our conversion rate for the branch was, once again, 0.41%.
Despite all the improvements we thought we’d made, our conversion rate hadn’t
Hover4 (conversion rate: 0.41%)
The day after we killed the Hover UX experiment, Aaron had a suggestion.
Instead of looking to glossy sites like AirBnB for inspiration, what if we
looked to HackerNews, Reddit, and 4chan?
Thus, the Craigslist branch was born:
CraigslistView: index page (conversion rate: 1.44%)
CraigslistView: show page (conversion rate 1.44%)
The Craigslist UX converted at 1.44% which, compared to our historical
baseline of 0.41%, was a 251% improvement. Supporting metrics like pages per
visit (+51%) were way up, too, as you’d expect.
We think that there’s still quite a bit of upside left so we’re actively
running more experiments. Ongoing experiments have a dedicated chart up on a
monitor in the lunch area so everybody can see how they’re doing. Here’s one
of the charts from today.
The reason why we put so much effort into split testing is that we’re trying
to find the global maximum. We worry that making linear iterations will lead
us to a local maximum that would be far less than our potential. So, we force
ourselves to try ideas that are radically different from past experiments. At
some point we’ll probably run out of crazy ideas and then we’ll settle into
optimizing the winning UX instead of trashing and rewriting it periodically.
PS - are you curious what experiment is running right now? Here it is.