In Lowell, Massachusetts Dr. Moses Greeley Parker invented the telephone number in 1859. It was still early days for the fledgling telephone network. Operators made connections between people lucky enough to have access to a telephone using directories that consisted of subscriber lists of people's names. Maintaining an accurate list was crucial to the success of the network. To make a long distance call required two operators working at different exchanges to lookup the names of callers and to physically make the connections. But there was a new threat to the stability of this still fragile network.
A measles epidemic had broken out and Dr Moses feared that the telephone network would come to a grinding halt if Lowell's four telephone switchboard operators became sick and could not work. Using a phone number system instead of just names made the training of replacement operators faster and more reliable. The idea became wildly popular and was adopted by all telephone exchanges around the world. The invention of the telephone number simplified the operation of telephone exchanges and allowed them to grow more rapidly as the number of subscribers grew to a deluge.
To handle the deluge of data available for machine learning projects we need to develop an efficient cleaning process. Just like the invention of the telephone number helped organise telephone directories, data scientists must develop a process to clean enormous volumes of data efficiently and without mess. The phone number enabled the development of a clean way to uniquely identify subscribers and simplified the process for making phone calls. Similarly, data cleansing is the first step in making efficient machine models that are able to handle the data deluge.
This is the first article in the machine learning workflow series of articles. We will be starting with data cleansing and breaking this up in the following three parts:
- Part 1: Labelling your data
- Part 2: Filtering good data from bad
- Part 3: Building cleaned datasets
Part 1: Labelling Your Data
Before data cleansing let's agree on what data is
We hear the word so much that it can be easy to assume we all know what we are talking about but in machine learning this assumption is a big mistake. Your data may be classification type data and this can be purely categorical such as colours or hierarchical such as rankings from 1 to 10. You might have time series data which need to be considered. For this article we will assume we have continuous numerical data. The other thing to consider is what are you trying to do with all of this data. Sometimes we are only concerned with prediction but we may also be interested in optimisation. For optimisation problems it is important to understand the difference between independent and dependent variables.
The thing we want to optimise we call the dependent variable
The dependent variable is often a key performance indicator or KPI. All of the variables that affect the value of this dependent variable can be thought of as independent. Even if the data are not strictly independent it helps to have some labels to indicate the relationships between the different variables in our datasets. For example, in optimisation some data will be useful to know because they influence the dependent variable.
The data we use to predict our KPI are the independent variables
Many of the input variables that are causally related to our dependent variable may not be easy or even possible to control for. We might call this type of data upstream forecasting variables. In contrast, some of our independent variables we can control directly. The essential feature of both types of independent variables is that they are not influenced by our dependent variable. In a sequential manufacturing process it is common that variables that are measured upstream of our dependent variable can be considered to be independent of our variable to be optimised. In biological systems with feedback loops this is not always strictly correct.
Control variables are important for optimisation because they have the feature of being controllable. For example, if we are baking bread a control variable would be the temperature of the oven. The ambient air pressure, on the other hand, would be called an upstream forecasting variable. Once you have spent some time labelling your data it is time to decide how you might clean it.
Part 2: Filtering Good Data from Bad
Data cleansing is the process of removing bad data
This requires some understanding of the source of the data itself. How you define 'bad data' depends on how you want to use it. If you are using inputs from an instrument in a process it is likely that there will be gaps in the signal. The appearance of no values for some of your variables (or NULLs as we will refer to them) might be an example of bad data. What will you do when you have many other values measured at the same time what? Do we only accept data that is complete for every variable at each timestamp? Nulls are very common in process datasets and accepting no less than perfect data at each timestamp would result in the loss of a very large amount of data. Such waste is not necessary and a simple solution for this common problem is to take an average of each variable over time. Averaging can address the problem of nulls but can introduce a new problem.
Even more tricky is when the data don't make any sense at all
For example, imagine we are measuring weights on a scale and there is some vibration that is causing the reading to bounce around giving you some negative values at different time points. In this case keeping the negative weights in your dataset does not make any sense. If we apply an average over data that is fluctuating between positive and negative we may end up with the negatives cancelling out the positives. Even worse, a machine learning model will consider the negative values as real data and part of a possible solution space. We generally need to get rid of the negatives in our data before applying averages if negatives don't represent reality.
Another common example of bad data in a manufacturing setting is where you have flatlining data that doesn't seem to change over time at. Some values such as set-points may be constant for a long period of time but this is not typical for measured process variables. Flatlining can be caused by problems with your instruments or how you are measuring the data. So how do you know you have bad data? This is where it helps to get a better feel for the numbers.
We need a way to make sense of a screen full of numbers
In the movie The Matrix you might recall seeing screens filled with symbols representing the code behind the visual images. To convert this sea of data into something that makes sense we need to generate some simple statistics and data plots. The statistics can simply be the mean, maximum, minimum, number of missing data points and the upper and lower quartiles. How to calculate these simple statistics are covered very well in a general statistics blogs or textbooks. The main idea at this stage is to develop an intuitive sense for the data. If your minimum values are negative then this will generally be a sign that you need to remove some negatives. Another way to develop a feel for your data is to simply plot it.
Drawing graphs is simple and can save you trouble down the road
If you have time series data then plotting the data over time can be very instructive. Another common thing to do is to plot the thing you want to predict or model, often called the dependent variable against the independent variables. Visualising your data is the surest way for identifying where you might have data problems. Missing and flatlining data are just two types of problem data. It is here where your subject matter expertise will come in most handy. The main thing you need to be concerned with is whether the data look real. If you have no reason to believe the data are not real then it is best to keep it. An exception is if you have some data that you know doesn't represent the normal process you want to model. For example, when a continuous process is shutdown the data collected during the shutdown period might be real but is best removed because it doesn't represent the normal process that you want to model.
Part 3: Building Clean Datasets
We generally clean data with the use of filters
In the example above a simple filter might exclude negative values of weight as well as data where successive values are not changing over time. But be careful not to apply blanket rules to all of you data. Each needs to be considered separately. For example, as mentioned earlier some data is expected to flatline on a regular basis because they are set-points and will remain at one value until someone changes it. Data cleansing can be done in many different ways. A combination of SQL and code in a programme such as R or Python will generally do the job quite nicely.
You create alternative datasets that have different levels of data cleaning
Starting with one raw dataset you might create two or three additional cleaned data sets. Each dataset will have successively fewer data points as you apply more stringent filters to your data. There is a tradeoff with data cleansing. Machine learning involves the use of some data hungry algorithms and procedures and if your data cleansing is too aggressive you might not have enough data to train your models. Rather than just creating one cleaned dataset you need to create multiple cleaned datasets. In later stages of our workflow we will be training our candidate machine learning models on each cleaned dataset and selecting the best combination of model and dataset for further work.
The first step is to develop a set of rules and filters to clean a raw dataset
A good filter can be repeatedly applied to raw data for model training, testing and optimisation. This is something that can quickly become a nightmare in Excel because the rules would be hidden in countless fields across a spreadsheet. Data filtration is easily handled at the point of data ingestion with simple SQL or other programmatic filters.
Summary
- Understanding the relationships and associations between your data is important.
- Identify the dimension you want to understand, predict or optimise. This will be referred to here as our dependent variable and its value might depend on the values and relationships of many other variables.
- Our independent variables are all the other data that is considered to be either causally related or at least correlated with our dependent variable.
- The independent variables that can not be controlled are called upstream forecasting variables.
- The independent variables that can be controlled are our process control variables.
- Averaging variables over time can help address nulls.
- Removing negatives is important to do prior to averaging.
- Plotting data can be useful for identifying other data that does not represent the normal process
- An algorithmic filter applied during the process of data ingestion is simple and explicit
Next Steps
Once we have created our cleaned datasets the next step in the machine learning workflow is feature selection. But in the next article we will be making a detour into regression.