When starting out in data science it is a great idea to practice with very well characterised data sets. The Ames housing data set is one such example that can be used for building machine learning models using packages such as tidymodels in R. The data set contains both categorical and numeric data and have a large number of features that could quickly make exploratory data analysis (EDA) rather time consuming.
Instead of spending all morning on EDA you can use the correlationfunnel package to speed up the process of getting the most important features uncovered.
This article will introduce a simple 3 step process for plotting all of your categorical and numerical features to see how well they are correlated with the variable you want to predict using the correlationfunnel package in R. These steps are:
- Prepare the data as binary features
- Prepare table of correlation coefficients
- Plot the data
Step 1 - Prepare Data as Binary Features Using binarize()
A binary feature is just what it sounds like. It's either a 0 or a 1. This is easiest to understand with categorical data. For example, a house that is two stories high may be assigned a value of 1 and one that is not can be assigned a value of 0. It's either one or zero. That's binary information. But what about continuous numerical features such as sales price. There are many values along a continuous scale. The answer is to use binning. A continuous numerical variable can, for example, be binned into 5 different ranges and then each house can be assigned a value of 1 for the range that matches its sales price and be assigned 0 for the other 4 bins.
Another problem can arise with categorical variables when there are a large number of low frequency categories. The binarize() function in the correlationfunnel package allows you to set a threshold for low frequency categories to prevent the number of binary categories getting out of hand. The solution involves lumping the less frequent categories into an "other category.
The code below shows you how to load the Ames data set into your RStudio session and binarise the data yourself.
Step 2 - Correlate Your Binarised Data Using correlate()
The next step is really simple because the bulk of the work was already done in Step 1. Don't you feel glad about that! The main thing to note about this step is that if your target variable is numerical that you will have to insert the variable name created by the binning in set 1. All you need to do is inspect the output from the glimpse() command above and copy and paste that name into the correlate function from the correlationfunnel package.
Now pipe your binarised data into the correlate function to create a summary of the correlation coefficients.
Step 3 - Plotting With plot_correlation_funnel()
Now that you have a table of correlation coefficients it is just a matter of plotting that data. The plot_correlation_funnel function from the correlationfunnel package makes this simple and orders the features from the highest to lowest correlation to create the desired funnel effect. Because of the large number of features in the Ames dataset I have filtered by the absolute value of the correlation coefficients to keep it manageable.
Those 3 steps are all you need to do to create your first correlation funnel plot.
I've added additional vertical lines at 0.3 to highlight the strongest correlations. Note that you will expect to see a correlation coefficient of 1.0 between the target variable and itself. You can see that the Sales_Price bin of '230278.4_Inf' has a correlation coefficient of 1. This is the binary sales price bin created in step 1 for all houses with sales prices of $230k and greater.
You are well on your way to uncovering the most important features to select for your machine learning model without breaking a sweat. Let's summarise the process shall we?
- First you need to install and load the correlationfunnel package from Matt Dancho along with your data
- The modeldata package has many different datasets to experiment with.
- Prepare your data by getting rid of any non-relational data such as timestamps and id columns.
- Binarise your dataset using binarize() and remember to play around with number of bins and the threshold for lumping categorical variables together.
- Correlate your binarised dataset using correlate().
- Plot your correlation summary using plot_correlation_funnel and include standard ggplot labels and titles.
Once you have selected the features to model the next step is to build your model. My favourite machine learning package is tidymodels for R. In the next article we will step through how businesses can make better decisions when informed by insights and predictions from machine learning models.
If you want to get notifications about new posts add your email in the form below so I can keep you up to date!