“War is ninety percent information.” said Napoleon Bonaparte in the nineteenth century. This is just as true as ever in the information age that we currently live in. The most successful companies like Google, Facebook, and Amazon, have their revenue model based on the sharing and selling of information. And they get this information by performing analysis on data. But not only the big fish are able to make money on data: lots of companies benefit from leveraging data to obtain valuable insights. It’s why Clocktimizer has introduced our own Machine Learning Engine to give our customers the edge over the competition.
However, it can be difficult to know where to start with Machine Learning. How much data do I need? How do I collect it? What format should I store it in? In this four part blog series, we will introduce you to the art of data science. You will learn the best practices of collecting, cleaning, analysing, visualising, and interpreting data. We’ll also help you think of the ways Clocktimizer’s Machine Learning can support your law firm. In the first part of the series we looked at collecting data. In part 2, we introduced you to data preparation. This week’s installment is getting to the meat of the problem – data analysis.
Why do data analysis?
A good reason for analysing data is that it can improve your decision making process. Instead of guessing, or listening to your gut feeling, you can now start testing your ideas, and confirm your initial assumptions. It not only gives you a clearer view of reality, but it is also a great way to create arguments for your decisions.
Remember, the most important thing is to be curious. You want to dig further into the data to find causation or correlation. Try to find and explain unexpected results. Ask questions, and use the data to test your hypotheses.
The first thing you want to do is to explore the data. It is important to get a feeling for the information. How many data points do you have? What kind of values are there? Is your data distributed evenly over some values, or biased towards a group (are there just as many matters from each different practice group)? What is the range of values (lowest, highest, average)? Try to understand what kind of patterns you see in the data, and try to explain why these patterns are there.
Types of questions
Now let us take a look at what types of questions you can answer with data analysis. There are many different types. Most common are regression, classification, clustering, and anomaly detection.
Regression answers the question of how strongly related two different concepts are. For example, how strongly related are seniority and hourly rate? More specifically: how much does your bill change when you involve more Senior Partners? Or it could be something more obscure: how strongly related are your recovery rates and the weather?
You might want to know what will give you more revenue over time: focusing on existing practice groups, or focusing on new clients. This is a regression problem. You want to create a function of the revenue for both cases, and then see where you are at in about five or ten years.
Classification answers the question of what category something belongs to. Given that you have ten different practice groups, if you see a time card which has “Share Purchase Agreement” in the narrative, what practice group is this likely to fall under?
You could take this further and categorise potential clients into what service they need the most. Then you can target one specific category of client for a specific service that you provide.
Clustering answers the question of how you can group your data. Say you want to analyse lots of emails. You may want to cluster emails based on the language used, to see how it affects the outcome of a matter. How can we create five groups, each that contain emails using similar language styles?
Clustering seems similar to classification. The difference is that with classification, you already know what your categories are, and you only want to categorise new data points correctly. With clustering, you do not know what your categories are, or how your data should be grouped at all, and you want to find the best way to do that.
Anomaly detection, as the name suggests, looks to identify what is ‘weird’ in your data. You can assume that you on average have “normal” data. Which data points seem abnormal compared to the rest of the data? You might for example want to detect which bills are likely to be turned down by e-billing software.
Often you will find that your data is not rich enough to answer your questions. You might have to tweak the data a little bit by feature engineering to make sure that your analysis focuses on the important parts of your data. Feature engineering consists of two parts: feature selection and feature construction.
Feature selection is something you want to do to reduce variables. Too many variables can cause your analysis to take too much time, or give you overly specific information. Note that computer analysis is a compound problem. This means that each additional variable compounds the amount of time it will take to analyse a data set, rather than simply add another chunk of time. This could mean that, if you have information on practice groups, but you are only interested in activities you remove practice group as a variable.
Feature construction is the opposite of feature selection, and is something you want to do when your variables are not specific enough. For example, maybe you are not actually interested in someone’s age, but rather whether someone is an adult or not. Or you want to take the sum, difference, or product of different values. For example, you might actually not be interested in the amount of hours each person logged under due diligence, but you want to know the total of all hours logged under due diligence. These little changes can make your analysis much easier and more interesting.
The cycle of data analysis
After the analysis you will probably have new questions, realise that you do not have the right data, or find that your data contains some mistakes. In that case, you should go back to the earlier steps of data collection or data preparation.
This is okay. You want to go back and forth between collecting, preparing, and analysing. Each step you gain additional insights, and in the end you will have learned a lot.
Now that you have analysed your data, the next step is to actually do something with it. In our final section we look at what you can do with your analysed data.