“War is ninety percent information.” said Napoleon Bonaparte in the nineteenth century. This is just as true as ever in the information age that we currently live in. The most successful companies like Google, Facebook, and Amazon, have their revenue model based on the sharing and selling of information. And they get this information by analysing data. But not only the big fish are able to make money on data: lots of companies benefit from leveraging data to obtain valuable insights. It’s why Clocktimizer has introduced our own Machine Learning Engine to give our customers the edge over the competition.
However, it can be difficult to know where to start with Machine Learning. How much data do I need? How do I collect it? What format should I store it in? In this four part blog series, we will introduce you to the art of data science. You will learn the best practices of collecting, cleaning, analysing, visualising, and interpreting data. We’ll also help you think of the ways Clocktimizer’s Machine Learning can support your law firm. In the first part of the series we looked at collecting data. In part 2, we will introduce you to data preparation.
Once you have collected your data, you probably want to dig in and start analysing right away. But we have to stop you there. Unfortunately, your data is not perfect. It will contain mistakes and have missing information. You want to start by cleaning up your data, then take care of missing values, and last take care of missing attributes.
The data janitor
Cleaning your data means going through your data, trying to find whatever is not right, and correct that. Often, you can automate the correction process, but you will have to find the mistakes yourself. Here is a small list that demonstrates some of the mistakes to look for:
- There may be inconsistencies in the representation of your data. Sometimes a value may be stored as 1 or 0, whereas at other times it will be stored as Yes or No, or as True or False. Since these are practically the same, you will have to choose one representation, and transform the anomalies to match this choice consistently.
- There may be differences in data types. You have probably encountered this in a simple Excel sheet already:
1may in one cell be interpreted as the number 1, but in another cell as the text “1”. The computer will see these two values as being different. You want the cells to be consistent, and so you should change the data type of one of the cells.
- There may be misspelled categories. Data about gender may contain “male”, “Male”, and “female”. Obviously, “male” and “Male” should be regarded as the same gender. So you have to use one spelling consistently.
This is by no means an exhaustive list, but it should give you an indication on the kind of mistakes you have to look for. Cleaning up data can be a boring task, but it is essential for your analysis. After all, the quality of your analysis’ result depends on the quality of the data. Or, as the saying goes: garbage in, garbage out. So you better get rid of as much garbage as possible.
One of the most annoying problems is that of missing values. Say you have data about the age of your employees, but some of them have not filled in their age. What should you do? In general, there are multiple ways to handle this: either ignore the missing values, ignore the whole data point, ignore the attribute, or fill in the gaps.
The first solution, ignoring the missing values, can give you some trouble during analysis. If you want to know the average age of your employees, or if you want to know how many of your employees are younger than thirty years old, what are you going to do with the “ageless” employees? There are certain kinds of analyses that do not work with missing data at all, so leaving the gaps open is not always a possibility.
The second solution, ignore the whole data point, can be a feasible solution. You just throw data points with missing values out of your data set. Sometimes this is no problem at all. However, if these data points contain other important data, you might be throwing away valuable information. For example, if women are less likely to fill in their age than men are, and you throw away all data points with the age missing, then you create a bias towards men in your data set. Another unwanted consequence is that if there are too many data points with missing values, and you throw them all away, you might end up with a very small data set.
The third solution, ignore the attribute, is another feasible solution. You then ignore the age attribute at all, no matter if it is missing or not. A disadvantage of this is that you throw away more information than necessary. Most likely, only a small number of data points have missing values, and there might still be valuable insights regarding this attribute in the remaining data points.
The fourth solution, fill in the gaps, is a more creative solution. You can fill in the blanks by randomly drawing values from the data points that do have these values, or fill in the average. This way, you keep all your data points without having any missing values, and it still represents your sample fairly well. But you have to be careful: this method may lead to relatively correct analysis results, but it also changes some other properties. Imputing the average value, for example, reduces the variability of that attribute. This is not necessarily a problem, but you have to keep it in mind when interpreting your results.
Often, not only values are missing, but whole attributes might be missing. You might want the computer to analyse something that is not in the data yet. For example, if you want your computer to categorise some data automatically, it needs a few examples on how to categorise. You then have to extend your data manually. Say you have a million time cards, and you want the computer to categorise them by matter. Then you must first teach the computer how to do that, and one way to do that is by giving it a number of time cards that are already labelled by a human.
How many time cards is sufficient ranges from the hundreds to tens of thousands. It all depends on how many categories there are, how predictable your data is, and how much time you are willing to dedicate to manually labeling. It may seem like a lot of work (and it is), but try to gather some nice colleagues on a wintry afternoon and work on it together. Before you know it, you will have a great collection of labelled data points, and you can let the computer take care of labeling the other 990.000 time cards.
Now that you have prepared your data, you can get started on the fun stuff. In the next blog we will talk about what kind of information you can get from your data analysis.