The Profitability of Proof

A pair of shoes at a crossroads, one side labelled fact and one side labelled alternative

Generating fake news and acting on whims are all the rage these days. Today’s political discussions in particular are all too often making headlines for involving assertions made with little attempt (or no attempt whatsoever) to offer supportive evidence. But lack of evidence in the corporate world is also a serious issue.

Indeed, while day-to-day business discussions are not exactly a staple of media consumption, managers frequently issue or accept claims that lack factual support. And they were doing so long before media managers in the U.S. White House popularized the term “alternative facts.” As Thomas Davenport pointed out in a 2009 HBR article, corporate decision makers regularly ignore the need for facts when they “fiddle with offerings, try out distribution approaches, and alter how work gets done, usually acting on little more than gut feel or seeming common sense.”

To make matters worse, many business bets and guestimates are wrapped up “in the language of science, creating an illusion of evidence.”

As Davenport notes, some big strategy decisions must be made without the benefit of perfect evidence because not all massive changes to the way firms operate can be tested. But unless they are psychic, anyone spouting statements such as “free shipping will boost sales sufficiently to compensate for the costs” or “redesigned product packaging will considerably increase sales” without having evidence to back up these claims is either overconfident or foolish.

Having the investigative rigor to gather the evidence required to back up a corporate claim has always been important, but the case for effective evidence gathering has never been greater than today because of the popularity of winging it. Knowing what you are doing simply tends to be more profitable.

Unfortunately, while the data organizations ordinarily collect on a regular basis can be useful, it is often insufficient to properly answer many relevant questions. In other words, the evidence required isn’t typically just ready to be analyzed. It often needs to be developed using real-world experimentation.

Let’s assume you just completed one year of sales for a redesigned product. Is a 2 per cent year-on-year increase in sales evidence that the redesign worked? What if sales of a similar product, which was not redesigned, showed a 1.8 per cent increase? Or what if an annual increase of 1.5 per cent in sales is typical, even without a product redesign? Maybe the redesign was responsible for a 0.2 per cent difference compared to the increase of similar products; or a 0.5 per cent increase compared to typical per annum sales increases; or the entire 2 per cent increase; or maybe none of the increases. Analyzing historical data is useful, but what exactly does this data indicate?

Data can indicate the end results, but there is often a need to dig deeper. We need to know whether an action (e.g., the product redesign) caused enough change in the outcome to ensure that the results would have been meaningfully different without it. To confidently understand the impact of our actions, we must address three key questions: Is the relationship between the action and outcome meaningful? What would have happened if nothing had been done? Could the result be due to sheer chance?

When assessing medical treatments, medical researchers use randomized controlled trials—the gold standard of experiments, where individual test participants are randomly assigned to a condition.  This gives them a better understanding when it really matters. This paper examines how businesses can do the same. Even when a perfect experiment can’t be run, there are ways to gain a better understanding using evidence-gaining techniques.  To help you conduct better testing for your own business, we will also address three more questions: What if you can’t run a randomized controlled trial in a lab? Which organizations should conduct tests? How should you conduct your testing?

THE RELATIONSHIP BETWEEN ACTION AND OUTCOME

A famous maxim states that correlation does not imply causation. (Correlation is the statistical relationship of two things moving together. For example, there is a positive correlation between height and weight; all else being equal, taller people tend to weigh more than shorter people.) The difference between correlation and causation is surprisingly important for business managers. If pet food sales increase in line with the population of toddlers in a specific location, should a pet food business target toddlers in its advertising? This seems unwise. It is likely that a growing city sees an increase in both toddlers and pets. The relationship between toddlers and pet food sales isn’t directly causal.

Managers are interested in the impact directly caused by their actions. After a boost in weekly sales, a manager may want to know if a given promotion “caused” the sales. After all, you can’t effectively work towards repeating a success, or avoiding a failure, if you don’t know what caused the outcome—whether positive or negative.

To assume that a certain correlation implies a causal relationship can lead to mistakes ranging from amusing to disastrous. Significant social problems have been caused by prejudices fed by mistaking causation for correlation. Harvard Law student Tyler Vigen compiled a website full of spurious correlations (a correlation that’s not the result of a direct causal relationship).  For example, Vigen shows the remarkably high correlation between consumption of mozzarella and civil engineering doctorates. Before Kraft embarks on the funding of engineering schools, however, we should note that there is no reason to think that one might cause the other. Eating mozzarella does not help with engineering calculations, nor does studying engineering generate a taste for cheese. In this case, the lack of causal relationship is obvious, but without careful consideration, managers can draw equally spurious conclusions, albeit on less ludicrous subjects.

The potential for confusing correlation and causation only becomes greater in a world of big data. Massive amounts of data and computing power allow firms to extensively mine data. If you mine enough data you will find numerous correlations, many of which will be purely spurious. The solution is to specify what you expect to see before looking for it. Creating a real-world experiment forces the manager to specify what is being tested in advance of any action. This method helps business managers generate confidence in the meaningfulness of the results. When you predict in advance, your results are much more compelling than when you simply rationalize the results after the fact.

Expert Advice: Specify in advance what you expect to find. Predicting the result before the test—especially if you are correct—is far more impressive than making up a story after the fact to fit the data.

WHAT IF NOTHING HAD BEEN DONE?

Historical data shows us what happened, but it does not show what would have happened if a certain action had not been taken.

Historical data alone is not enough. The impact of evidence can only be uncovered by comparing the results of taking a certain action against a no-action baseline. Randomized controlled trials, where subjects are assigned to either a control or test condition, attempt to solve this problem by generating a baseline scenario (the control). The baseline scenario has no impact from the action being contemplated. For example, when testing whether a new website design will boost sales, marketers can establish a control by presenting the old version of the website to one test group and the new version to another. This type of test is common online, and is very effective. Online advertising lends itself to this type of study because different website visitors can see different versions of the ad, making it easy to compare ad effectiveness.

To establish an effective comparison, it is critical that the control and test groups are made up of similar types of people, or test subjects. For example, it would be pointless to have one group of all men see one ad and a group of all women see another ad. Similarly, the two groups should not be from different regions. After all, when groups are different at the start, it is difficult to identify the cause of any differences after seeing the advertisement. What accounted for the difference: the sex of the person, the person’s geographical location, the specifics of the advertisement, or some combination?

Results that differ between groups are easy to detect, but without a test there is little confidence in identifying the cause. Online testing is often effective because the test can very easily and randomly be administered—one version of a webpage is shown to the test group and another version to the control group to establish a baseline version. The large number of visitors, from a wide variety of locations and assigned completely at random, provides a high probability of having two similar groups. The random assignment (in which subjects receive different versions of a test in a completely random fashion) ensures that the two groups are relatively comparable in terms of sex, location, age, and a wide variety of other potentially relevant factors. If the two groups are similar before the two ads are shown, the cause of the results can be more safely attributed to the ads.

Expert Advice: If people are randomly chosen, then both groups—one receiving the action and one not receiving the action—are likely to be made up of similar people, which gives confidence that the results are due to the action.

WHAT ROLE DID CHANCE PLAY?

Sales fluctuate daily. If we seek meaning in every tiny variation, we will end up pursuing numerous pointless dead ends and assign cause where there is none. This is where the complexity of statistics comes in. Thankfully, however, running a randomized controlled trial simplifies the math to the level of Microsoft Excel proficiency, although a wide range of specialized software for larger or more complex datasets is also available.

Statistical tests allow us to compare the distribution of a particular result (e.g., number of sales) from customers who saw a specific action (e.g., the new ad) versus those who saw the control (e.g., the old ad). The variability in the data, with a few standard assumptions, allows us to estimate the likelihood that any difference we observed was simply due to chance. If the difference was minor and likely due to chance, we can conclude that there was no impact from our action. However, if the difference was too pronounced to be caused by normal data fluctuations, we can conclude that our action had an impact.

The simplest statistical test, which is often effective enough to assess a randomized controlled trial, is a t-test (named for its use of the t-distribution, similar to a bell curve). A t-test can examine two sets of data, such as the spending of those seeing ad A or ad B, and assess the likelihood that the group spending patterns are similar regardless of the ad they viewed, ignoring minor chance variations.

But how do we measure likely? The most widely used significance level (the chance, given the assumptions are correct, of wrongly claiming a relationship when none exists) is 5 per cent. Although not especially meaningful, this convention can be useful because it is widely accepted, but it can be changed if there are reasons to do so. A higher value is considered more liberal and a lower value is considered more strict —business managers should adapt the significance level based on the severity of the consequences of being wrong.

To decide on the appropriate significance level before running the test, ask yourself, “What would happen if we incorrectly conclude that the test was successful?” If the negative consequences of this error are relatively modest, such as a minor difference in profit, you might require less stringent proof. If the consequences of a mistake are dramatic, where the future of your business is at stake, you need to use a strict significance level of 1 per cent or even 0.1 per cent. Of course, running more than one experiment reduces the chance of mistakes, which is one reason why medical trials require multiple stages of testing.

Expert Advice: Determine if an outcome is significant by using a simple statistical test such as the t-test, and specify your required level of significance before running the test.

WHAT IF YOU CAN’T RUN RANDOMIZED LAB TRIALS?

Academics love laboratory tests. And for good reason. A lab gives the tester the ability to remove the impact of extraneous influences. Outside the lab, even the weather can influence sales and confuse results—think of umbrella sales due to rain or swimsuit sales during heatwaves. (Some might argue that President Trump’s poor inauguration attendance was caused by the threat of mild drizzle.)

You might want to test the impact of a new point-of-purchase display on sales. This can be cleanly tested in a mock-up of a store housed in a lab. However, despite what we can learn from lab experiments, they never fully capture the effect of the real world. For example, it is important to know what will happen when consumers see an ad while carrying on normally rather than when they are paid to react to an ad in a lab. The need to remove extraneous influences while replicating the unpredictable real world is a challenge for lab experiments. Will the results of the lab experiment reflect the real-world experience?

One way to achieve the best of both worlds is to use a field experiment—an experiment run in the real world but with as many confounding influences as possible removed. The value of field experiments is obvious. Although messier than lab tests, field experiments inspire a greater level of confidence that the results will be relevant to the business, because the results occur in an everyday situation. The effectiveness of some field experiments can even rival that of a lab experiment, such as direct mail, which target different offers at randomly selected consumers

Although experiments can achieve a high level of control, there are many business instances in which experiments simply cannot be executed. For example, a physical store in a specific location is unable to use an experiment to test different prices to randomly chosen customers. In such situations, it is important to maintain a baseline for comparison. For example, a retailer with stores in different locations can test a newly designed display in one location while the other stores use the traditional display. However imperfect, this approach can be useful, given that the displays are as comparable as possible to ensure that untested differences are not driving any variation in sales results.

For example, CKE Restaurants, a U.S. fast food restaurant group that owns Hardee’s and Carl’s Jr., uses field tests for the introduction of new products. The company develops a variety of new product ideas, refines the options internally, and then tests the new products in its stores by measuring pre-defined metrics, and by using a test group versus a control group.

Random assignment by store, rather than by individual customer, is never ideal. If two stores are near each other, the same consumer could end up visiting both the test and control stores. Other uncontrollable events, such as environmental conditions or road works, can also affect results at one store and not another. Still, testing between stores can be effective, and it is certainly better than not testing at all. If randomly chosen stores receiving products with newly designed packaging show significantly higher sales than those with the old packaging after the appropriate statistical tests, the test can be considered a success.

Businesses often choose to compare tests to a specific baseline, such as the previous year’s sales, rather than a randomly selected control group. This may be a pragmatic compromise—while still being clearly inferior to an experiment with a test and control. Year-over-year differences can be attributed to a large number of possible causes. For example, in any given year sales could be down due to the closure of a major local employer. A new package design might be very effective yet only maintain sales at the prior year’s level if other factors would have reduced overall sales without the redesign. Though imperfect, we would not discourage such analysis if the only practical alternative is no investigation.

When experiments fail to use formal randomization of subjects, they are known as quasi-experiments. This is a test where the conditions of random assignment are violated. When limitations are recognized, these tests may still be useful for analysis, even when they occur outside the control of the business. For example, an automaker can test the effects of a provincial government tax credit on sales of energy-efficient cars by comparing sales in the province where the credit was launched against sales in provinces without the credit.

In general, the further away a test deviates from an ideal randomized controlled trial, the more complex the statistical analysis needed thereafter to compensate for the experiment’s deficiencies. Without random assignment, we must correct for factors that render the two groups incomparable. Randomized controlled trials are the gold standard, but don’t give up on the idea of trying to find out what is happening in your business just because random assignment is challenging in your industry.

When you are presented with evidence, remember that data alone does not equal evidence; without a baseline (control group) you can’t know what would have happened if you had not taken that action. Consider also that consumers often have little insight into their own decision-making, and that context changes their behaviour; surveys alone may have limited value. Generally, it is best to triangulate across multiple types of evidence. Errors can creep into any piece of evidence, but if the evidence comes from different methods and sources, the same error is less likely to reappear in multiple methods, compared to a single method.

Remember, statistical tests determine if your favourable results likely happened by chance. If you run the same test multiple times, however, even an unlikely result will occur at least once. For example, you may be a poor shot, but after enough attempts you’re likely to eventually get the crumped-up paper into the recycling basket.

Expert Advice: Beware of cherry-picking results; if you run a test many times, don’t look at the only positive result and claim success.

TO TEST OR NOT TO TEST?

Businesses have traditionally been wary of seeking formal evidence. But despite the many barriers to gaining evidence of causes, a surprisingly wide range of organizations are able to do so effectively.

Direct marketers such as Capital One have been at the forefront of testing. The credit card business uses high-volume, personalized offers in well-controlled randomized trials. Online marketers also test extensively, from charities seeking the most compelling request for donations to retailers such as Amazon determining the best price to offer its customers. Banks experiment with different financial product offers, and political parties test their campaign materials to see what works and what doesn’t. A variety of organizations consistently seek to understand the impact of any changes they make in their business.

High-level business strategies remain difficult to test but, fortunately, testing is becoming more widely applicable than ever. One of the most interesting developments in recent years has been the adoption of testing across a wide range of governmental agencies around the world. The United Kingdom’s Behavioural Insights Team was an early proponent of testing and the Organisation for Economic Co-operation and Development is actively advancing the idea that testing can be used to develop governmental policy based on people’s behaviour. Canada’s Privy Council Office is testing how to increase the uptake by lower-income families of government incentives for education and is encouraging the use of testing by other agencies. The Canada Revenue Agency and Employment and Social Development Canada have initiated testing to help them improve their offerings, for the benefit of service users and taxpayers. Provincial governments such as Ontario (the Behavioural Insights Unit) and British Columbia (the Behavioural Insights Group) are testing behavioural insights to improve the services they offer to their respective provincial residents.

Expert Advice: From Capital One, to Amazon, to the Privy Council Office, diverse organizations have determined that they can benefit from testing—so can your business.

HOW TO CONDUCT TESTING?

To experiment is to admit that you don’t necessarily know what the best course of action is, but you are willing to find out. Executives often feel the need to claim they know more than they do to appear in control, but this backfires when results don’t go as expected. When a manager acknowledges that he or she doesn’t possess psychic abilities, it allows for a more reliable power to be deployed. Appropriate tests can offer great insight into causes, allowing the company to benefit from smarter decisions.

A further ingredient for successful evidence collection is a willingness to be wrong. When accuracy of prediction is rewarded, a test may easily be manipulated to justify the manager’s prior beliefs. However, managers shouldn’t be assessed on how confident they are before the test—they should be judged on what was learned.

Embracing experimentation means embracing short-term inconveniences for long-term benefits. Dan Ariely, a professor at Duke University, reported on manager concerns that testing, by its nature, means treating people differently. Surely all customers have a right to the better offer right away, managers argue. The problem with this anti-testing logic is that it assumes that business managers already know what the better offer is. But without testing, how can they be sure what the consumers want?

Trials are used to determine the validity of new medicine, where one group receives a placebo rather than the newly proposed medical intervention. This process allows doctors to understand whether the medical intervention will actually be effective. Such testing is widely considered acceptable, even if it means that some people in the trial who are assigned to the control group do not receive a potentially life-saving drug. This is deemed preferable compared to the risk of letting potentially dangerous, or at least ineffective, new drugs onto the market.

In business, similarly, long-term benefits that a test can provide outweigh the short-term risk that some customers may be temporarily denied the better deal. Tests allow organizations to understand if it is feasible to give customers a better deal. Will the profits from additional sales compensate for lower margins per unit after a price cut? Without testing, that answer is unknown.

When recruiting, businesses are better served by hiring a candidate who encourages testing rather than a candidate who claims to know the answer without testing. Even if a claim to know what is better is correct in today’s environment, the industry dynamics will likely change without warning, leaving the candidate with outdated knowledge unless he or she is able to test and learn about the new environment.

Expert Advice: Being able to recognize what we don’t yet know, and determine the answers through testing, is useful throughout an entire career.

Testing is far from a panacea, especially with randomized controlled trials being a challenge in some business situations. However, there are numerous opportunities available to business managers to conduct tests today. And whenever it is feasible to randomize what you offer your customers, you have the potential to gain a much better idea of what works best.

Even if conditions are not perfect, a clever approach to testing will usually offer insight as to whether recommended actions are worth the effort. In other words, putting evidence where your mouth is can save your organization money while improving outcomes for your customers. You can, of course, argue the opposite, but not with compelling evidence to back up the claim.

About the Author

Neil Bendle is an Associate Professor of Marketing at the Ivey Business School at Western University, London, Ontario, Canada. Contact nbendle@ivey.ca.

About the Author

Katie Chen is an undergraduate student in Honours Psychology at Western University.

About the Author

Dilip Soman is the Corus Chair in Communication Strategy at the Rotman School of Management, and co-director of the Behavioural Economics in Action Research (BEAR) centre.