Data has proliferated over the last few years, making it available in large quantities at a much lower cost. As a result, more individuals and companies are increasingly interested in the potential benefits of machine learning. This interest generally leads to the question "how is machine learning different from statistics?”
What is Machine Learning?
Machine learning is a subset of artificial intelligence. Machine learning isn't exactly brand-new technology — it has been around for a while but suffered in the past from a lack of computing power. It is now back in focus, given the volumes and complexity of newly available data. Machine learning makes use of algorithms and models, which evolve over time and datasets. The programming isn't explicit; instead, it learns from the data.
Statistics and Machine Learning
There are differences in purpose and general intent between statistics and machine learning. These differences emerge as datasets and variables of interest grow larger. In traditional statistics, the number of inputs typically does not exceed the number of subjects (a condition known as “big data”). Machine learning, on the other hand, is frequently used to analyze wide data. There are other fundamental differences, including:
- Purpose of data collection
- How data is collected
- Data assumptions to be met before analysis
- The intent of the analysis
A statistician will design their study in response to the research question posed before any attempt at data collection begins. Data is collected in a controlled, purposeful manner using sampling techniques appropriate to the target population and research design using instruments with established psychometric properties, meaning they have been validated over time and found to be reliable. In short, the data are collected for the specific purpose of answering the particular question. While there is no question that machine learning algorithms are improved with purposeful data collection, very often, the algorithms are trained on available data. Or, the data used is not necessarily explicitly collected for the purpose to which they are put.
Statistics v. Machine Learning: Intent of the Analysis
It might be helpful first to describe where statistics and machine learning are similar. Pattern detection is very much a goal in traditional statistics as it is in machine learning. Also, in common is the discovery of knowledge and diving into the data for insights. Both have a statistical learning theory or using a statistical framework to study inference and prediction, in common. However, the intent of the analysis is where significant differences can be found.
Machine learning is generally focused on building predictive models. Typical models might predict:
- Consumer purchasing intent
- How certain stocks will perform
- Books you might enjoy
Many of the data assumptions that guide traditional statistical procedures are not needed. What is necessary to accomplish the model creation is a learning algorithm that minimizes error. The predictions need to be repeatable. Interpretability of the underpinnings is not necessarily required – the accuracy of the model and predictive power is critical.
The predictive model is then validated by using a subset of the data that had not been used in the model training. When training a model, it can become over-fitted -- where it is no longer separating actual patterns and potential relationships from the noise in the data. The unused test data demonstrates the reliability of the predictive capability of the model. If it was overfitted in training, the predictive ability would decrease in testing.
So how does the predictive purpose of machine learning differ from statistics? Are there statistical procedures that are utilized for the primary purpose of prediction? Definitely yes. However, in traditional statistical analysis, there is interest beyond the accuracy of the predictions. Much of the interest is in the interpretability. Unlike in machine learning, the model isn’t trained. Statisticians will look at the goodness of fit indicators and residuals. Understanding the relationships between the predictors is the focus of the model’s predictive power. The predictors and how they impact the results are what answers the research question, and permits conclusions to be drawn.
Should I Bother Learning Statistics if I’m Interested in Machine Learning?
It may be argued that it is going too far to classify machine learning as "glorified statistics". However, the more statistical knowledge used in designing the applications, the better the machine learning capabilities will be. Concepts you would want to learn include:
- Probability (particularly joint probability)
- Measures of central tendency (mean, median, mode)
- Measures of dispersion (range, variance, standard deviation)
- Measures of position (percentiles, z-scores)
If you are interested in statistics and/or machine learning, look into Michigan Technological University’s Online MS in Applied Statistics program. This robust program can be completed entirely online in a reasonable time period. The degree will help you move forward into the role of statistician or data scientist and give you a hand in shaping the future of the data revolution.