As we have noted in recent blogs, data collection and analysis is essential to the long term success of your business. Whether it be profitability analysis or looking for ways to improve processes, data guides the process. However, data analysis is not something to jump into blindly. While working on drawing meaningful conclusions from your data, you are likely to encounter many challenges. It may be difficult to collect or identify the right data streams. The data itself may be incomplete, inconsistent, or unavailable.
However, having identified the data you need, and why you need it, you still may not be out of the woods. Analysing data comes with its own pitfalls. In this blog, we explore 11 common pitfalls you may encounter during the data analysis process. After all, forewarned is forearmed!
1. Texas Sharpshooter Bias
The Texas Sharpshooter Bias typically arises where a person has a large amount of data, but only focuses on a small subset of this data. In many cases because this subset leads to the most interesting conclusion. It is named after a fictitious sharpshooter who lets off a lot of shots at the side of a barn, looks at it, finds a tight grouping of hits, paints a target around it, and then claims to be a “great sharpshooter”.
This bias is related to the clustering illusion, which is the tendency in human cognition to interpret patterns where none actually exist. You can prevent falling for this fallacy by first formulating a hypothesis, and then testing it. In other words, do not use the same information to both construct and test your hypothesis.
2. Gambler’s Fallacy
The Gambler’s Fallacy, or the Monte Carlo Fallacy, is the belief that if something happens more frequently than normal, it will happen less frequently in the future. Consider a coin toss with a friend. After fifteen “heads” in a row, you might feel there must be an end to the pattern. You may think the chances of “tails” on the next flip are higher than before. However, this is not true. The coin does not have any memory of past flips, and the chances of flipping “heads” or “tails” do not change over time. Ensure you evaluate whether your assumptions are based on statistical likeliness or more personal intuition.
3. Simpson’s Paradox
The Simpson’s Paradox, or Yule-Simpson’s Effect, is a phenomenon in statistics, in which a trend appears in different groups of data, but disappears when these groups are combined. A famous example occurred in the 1970s when UC Berkeley was accused of sexism. Female applicants were less likely to be accepted than male applicants. However, when researchers tried to find the source of the problem, they noticed that for individual departments, the acceptance rates were better for women than for men. The paradox existed because a greater proportion of the female applicants were applying to highly competitive departments that had lower acceptance rates for both genders.
This fallacy is difficult to overcome beforehand. However, if you ever encounter this weird phenomenon by finding a bias that reverses if you look at different groups within your data, then know that you have not necessarily made a mistake. You may simply have found an example of Simpson’s paradox.
4. Cherry Picking
Cherry Picking is the most harmful example of dishonesty in data analysis, and it is one of the most simple to overcome. It is the practice of selecting results that fit your claim and excluding those results which do not fit your claim. This is seen often in public debate and politics and can be either deliberate or accidental.
Never forget that conclusions drawn from skewed data may lead to poor choices and negative consequences for your firm. You want the conclusions from your data to be accurate, so make sure you use the entire body of results. This is a responsibility to be shared by all of us. When you are receiving data second hand, it is important to ask yourself what you are not being told.
5. Data Dredging
Data Dredging, Data Fishing, Data Snooping, Data Butchery. However you name it, it is the misuse of data analysis to find patterns when there is no real underlying causality. This is done by performing many tests and only looking at the ones that come back with an interesting result.
Data Dredging is the inverse of Cherry Picking. With Cherry Picking you pick the data that is most interesting, and with Data Dredging you pick the conclusion that is most interesting.
The Texas Sharpshooter Bias is actually a specific version of Data Dredging, and so the solution is the same: first formulate a hypothesis, and then test it. Do not use the same data to both construct and test your hypothesis.
6. Survivorship Bias
The Survivorship Bias, or Survival Bias, is the error of focusing on people or things that make it past a selection process. Or rather, it is the error of overlooking those that did not make it past a selection process, because of their lack of visibility. It is named after the common fallacy we all experience when sharing information following a dangerous incident. You could be fooled into thinking that the incident under discussion was not particularly dangerous because everyone you communicate with afterwards survived. However, it may be that a number of people died in the incident. These people would not be able to add their voice to the conversation, which leads to the bias.
The same bias can be seen in highly competitive careers. You often hear movie stars, athletes, and musicians tell the story of how the determined individual who pursues their dreams will beat the odds. However, there is much less focus on the people that are similarly skilled and determined but do not succeed due to factors beyond their control. This leads to the false perception that anyone can achieve great things, whereas the reality is often a lot less equal.
When concluding something about the data that has survived a selection process, make sure you do not generalise this conclusion for the entire population. The incident may not be harmless just because the survivors survived.
7. False Causality
False Causality is a common mistake and is known for the phrase “cum hoc ergo propter hoc” (“with this, therefore because of this.”) Often when two variables correlate, our brains tend to make up a story, and find causation. “Children who watch a lot of TV are the most violent.” easily leads us to believe that TV makes children more violent, but it could easily be the other way around (violent children like watching TV more than less violent children). There could be an underlying cause (bored children tend to be more violent, and tend to watch more TV), or it could be a complete coincidence.
When you see correlation, there can be no conclusion made regarding the existence or the direction of a cause-and-effect relationship. Luckily, in practice, you often do not even need to know about causality. Knowing about correlation is in a lot of cases enough. And if you do need to know about causation, then you need to do more research.
8. Sampling Bias
The Sampling Bias is a bias that occurs when your data sample does not accurately represent the population. A classic example occurred in the 1948 US presidential election. The Chicago Tribune printed the headline “Dewey defeats Truman”, expecting Thomas Dewey to become the next US president. However, they had not considered that their survey was done via telephone and that only the most prosperous part of the population owned a telephone. This caused a bias in the data, and so they were mistaken about Dewey’s victory.
To prevent yourself from falling for the Sampling Bias, make sure that your data sample represents the population accurately so that whatever conclusions you make about your sample, you can conclude the same about your population.
9. Hawthorne Effect
The Hawthorne Effect is named after experiments done in the Hawthorne factories of Western Electric. Scientists researched the influence of working conditions, like light and heating, on the productivity of the workers. To their surprise, both the group where they did change the working conditions and the group where they did not change anything, showed better productivity during the experiment.
After some additional research, the scientists concluded that both groups’ productivity increased because they found the scientists’ interest motivating and stimulating.
The Hawthorne Effect, or Observer Effect, is the effect that something changes just because you are observing it. This is something you will encounter often when collecting data on human research subjects, and less often when collecting data on less influenceable subjects. However, it is still good practice to ask yourself whether you are in any way affecting the data by collecting it.
10. McNamara Fallacy
The McNamara Fallacy is the mistake of making a decision based solely on metrics and ignoring all others. It is named for Robert McNamara, the US secretary of defense (1961 – 1968), who measured success in the Vietnam War by enemy body count. This meant that other relevant insights like the mood of the US public and feelings of the Vietnamese people were largely ignored.
Although data and numbers can tell you a lot, you should not obsess over optimising numbers while ignoring all other information.
A complex explanation will often describe your data better than a simple one. However, a simple explanation is usually more representative of the actual world than a complex explanation. The more complex the explanation, the better it works for the data that you already have. However, it is likely to involve you needing to explain random variations that you captured in your data. As soon as you add more or new data, the random variations break down. Simpler models are usually more robust, and so are a better fit for new data.
Overfitting is probably the best-known fallacy. It is also an easy one to fall for, and a difficult one to prevent. If you find yourself coming up with very good results on your test data, but not when you proceed to test these theories on new data, then you might be overfitting.
So, as we mentioned at the beginning, forewarned is forearmed. At Clocktimizer we know that data is not, in and of itself, the key to success. However, when collected, analysed and used in a precise manner, with precise goals, it can make the difference between success and failure. To do that, you need to make sure that the entire process is free of bias. Now that you are aware of the pitfalls you may encounter, you are ready to tackle your data and start analysing. If you are looking at making the next step in data analysis, why not check out our free e-book. It offers a step by step guide to introducing Machine Learning in your firm.
Illustration by Tara Lingard