Machine Learning is an in-demand skill that can seem intimidating, but understanding the process to follow isn’t that difficult. This Machine Learning overview will help aspiring data scientists understand the workflow, as well as several common algorithms. From there it makes sense to get hands-on experience with popular tools.
- The Machine Learning Process
- Problem Definition
- Data Preparation
- Choosing a Machine Learning Algorithm
- Model Scoring
The Machine Learning Process
Machine Learning models are developed through the iterative process outlined below. Since it might take several passes before the data pipeline works the way it should and the best algorithm is selected, a sub-optimal model on the first pass is nothing unusual.
Once the problem has been adequately defined and the desired outcome understood, the data is prepared and an algorithm is chosen. The data is divided into training and test sets (Cross-Validation). The algorithm is trained with the training data and then the resulting model is used on the test data set. The results from the test set are then scored to determine how suitable the model would be for the purpose. This process repeats, sometimes with changes to the data and sometimes with a different algorithm, until the results from scoring show that the model is good enough to be put into production.
In order to solve a problem it is first necessary to define it as clearly as possible. What is the desired output of the model? If that outcome can be defined as a data point, that data point becomes the Target (which is simply what we want the model to predict).
It’s also necessary to identify the data that will be fed into the algorithm to predict the Target, which are called Features. Each candidate Feature should be checked for relevance to the Target through a process called Feature Engineering.
Preparing data for machine learning is not an easy task; in fact, some estimates say it consumes 80% of a machine learning project. Unfortunately, most tutorials (and even college courses) on machine learning gloss over this area without giving it the attention it deserves. Following are some of the most common tasks in data preparation.
Correcting Data Types
Database columns have a header and a type. Types are generally either numbers (Decimal, Integer, Floating, Double, etc.), Text (String, Varchar, etc.), or booleans (True or False, Yes or No). Booleans frequently show up as either text types or numbers, depending on the database.
It is not unusual to work with a dataset that has incorrect data types for some columns. ZIP codes, for example, should be text (string, char, varchar, etc.) instead of integers. This is because they are identifiers of a place, not measures of quantity.
Another common mistake is confusing Decimal, Floating, and Double-precision types. While they all involve a decimal point, the Decimal type will always round to the precision you specify and is the only one of the three that should be used for currency. Floats and Doubles are essentially the same idea, but whereas floats are 32-bit, doubles are 64-bit.
Converting Text to Numbers
Most Machine Learning algorithms do not accept text as input. Consequently, the text has to be converted to a number in order to be used. For booleans like “yes” or “no” the process is simple; we can convert to 0 or 1. Categorical data like the various regions of a country can be represented by phone or postal codes, or if that isn’t sufficient they can be labeled with sequential numbers.
Python (and subsequently, PySpark) has a number of functions to convert text into numbers.
There are several options for filling in missing numerical data. Based on the nature of the data and the values present in the same column, we might choose:
- Mean – what most people call the “average”
- Median – the value equally distant from the lowest and highest values
- Mode – the value that occurs most often in the column
- Zero – if that makes sense in the context of the data
- Leave the null if there aren’t too many nulls in the column and the algorithm can work with them
For text data it is usually best to convert to numbers first and then choose one of the above methods for filling in missing fields.
Different Features in our dataset will frequent have radically different minimum and maximum values. One column may have a range from 0-100 while another’s range may be 3.00 to 1,498,667.22. If left alone, the feature with the higher range will carry more weight in the calculation and will skew the results, resulting in suboptimal results. Normalization will convert both sets of values to a range of 0-1 (or a range you specify), making them equally influential. This prevents one feature from dominating the results.
Not all data is useful as a Feature for an algorithm to predict a Target. To determine which candidate Features make the cut, we need to apply criteria for eliminating those with major problems. An excessive number of nulls, a high correlation to other Features, and a low amount of variability in the column’s values are all reasons to disregard a potential Feature.
Choosing a Machine Learning Algorithm
There are many potential algorithms to choose from when deciding which to use, but they fall into only a handful of major categories. Let’s examine those categories.
Correlation has a number of uses in Machine Learning, and is most often used in the process of feature selection. However, Correlation also has a place as an ML algorithm itself and can be used to find contributing factors to Targets. For example, if sales margins are not what we want them to be we can label high-margin transactions as 0 and low-margin as 1 in a new Target column. We can then use Correlation to determine which Features contribute most to the Target, which is low-margin deals. The Features with the highest correlation to the Target should be investigated as potential causes.
Linear Regression is one of the few algorithms that can predict a continuous number. This is useful for sales forecasting, production line prediction, and other uses that require numbers rather than categories.
To forecast sales, for example, we could use the past 18 months of sales data to predict next month. The results of such a simple analysis are often usable with no other Features, but if we want to add in other data (CRM data, customer sentiment scores, etc.) to improve the accuracy of the model we can do so. This is called Multi-Variate Linear Regression.
Linear Regression is based on the trigonometric idea of finding the slope of a line, which is taught in most middle-school math curricula worldwide. This makes it one of the simplest algorithms to learn.
A friend of yours has a fever and red spots on her face and arms. What is the likelihood that she has measles?
You have a king and a one on the table in front of you while playing blackjack. The dealer has a ten and a two. What are the chances that the next card will be a face card?
Both of these are examples of probability. Doctors use probability when diagnosing diseases, usually without even realizing it. Top card players study probability to improve their skills. In both cases, the probability of an unknown is based upon what is known.
Logistic Regression and Naive Bayes are the two most-used algorithms for probability, and both use known Features to predict an unknown Target. The output of the model is a number between 0 and 1, with 0 representing no probability and 1 representing a certainty. A score of .5 is the same as a coin flip.
Classification is simply deciding what category the data fits best. Suppose you have divided your customer list into different segments for marketing purposes, and you want to determine into which segment a new customer should be placed. A classification model would look at Features such as geographic location, age, and annual income to determine that new customer’s ideal segment.
There are many classification algorithms to choose from, including Support Vector Machines, Decision Trees, Random Forests (which are really just multiple Decision Trees, hence the name), and Neural Networks. As a result, iteration within a classification problem will involve trying different algorithms to find the one that provides the best result.
Using our customer segmentation example above, how would you first create the segments? In the past this was done by using an arbitrary criteria like location or income. Thanks to Machine Learning, segmentation can now be done automatically with multiple criteria at once.
The most common clustering algorithm is K-means, in which items are divided into the number of categories you specify (the ‘K’ in K-means is that number) based on the most relevant Features. Once the clusters (categories) are determined by the algorithm, classification can be used to identify the best category for new items.
Iteration in clustering problems involves varying the number of clusters to find the optimal number.
Some things go well together, like chocolate and peanut butter. Affinity algorithms put items together based on historical data.
Recommenders fall into this category, as do marketing concepts like Market Basket Analysis. Grocers have long known the importance of identifying items that are frequently bought together and mastered the technique before the term “data science” was coined. Amazon built a thriving online business around the idea.
Affinity algorithms like Collaborative Filtering work by looking for items that frequently occur together. If people usually buy chocolate at the same time they buy coffee, for instance, the algorithm will learn that behavior. Once these relationships are identified you can then write code to suggest chocolate to someone who has coffee in their online shopping cart.
[Back to Top]
Cross validation is the process by which models are trained and tested. Historical data is divided into two sets, the training dataset and the test dataset. The training set is larger than the test set, usually from 60-40 to 80-20.
The training data is fed into the algorithm, and the algorithm looks for patterns in that data to produce the model. Once the training phase is complete the resulting model is used on the test set. Since the Target values in the test set are known, the model’s predictions can be compared to the actual results and the model can be scored.
The percentage of times a model correctly predicted the Target is its accuracy. When determining the usefulness of a model, accuracy is usually sufficient for Linear Regression but not for other algorithms. Let’s examine two other ways to evaluate the usefulness of a model.
To take scoring a step further than accuracy, the Confusion Matrix is often employed. This is a representation of true positives, true negatives, false positives, and false negatives that resulted from testing a model.
The best result has a low number of false positives and false negatives. The graph below shows a poor model, as the number of false positives is by higher than any other category.
Area Under Curve
The Receiver Operating Characteristic (“ROC”) curve came into being during World War II, when radar engineers and technicians used it to determine the location of enemy assets on the battlefield. ROC is a way of graphically comparing the true positive rate of a model versus the false positive rate. As such, it is extremely useful for judging the usefulness of that model.
The area under the ROC curve shows the proportion of true positive results, so an ideal curve would bow significantly to the top left, although curves that bow significantly the bottom right also provide useful feedback. The worst result is a straight line from bottom left to top right.
The graph above shows a typical ROC curve in blue, with the baseline in black. Since the curve bows well away from the baseline, this would signify a good result and thus a useful model.
[Back to Top]
The basic process behind machine learning isn’t complicated at all. After preparing the data you choose an appropriate algorithm, train the model with part of the data and then test it with the rest, and finally score the model to determine its usefulness. Although the details in each of these steps can be overly technical, the high-level workflow is not.
If you’d like to learn more about how to implement these steps, look in the blog for posts that include explanations and even code for Python and Spark.