The Computer Scientist’s Use of Statistics

computer scientist working on statistics
computer scientist working on statistics

If you are contemplating or enrolled in a computer science program, you probably know that statistics play a role in computer science. What you may not be entirely aware of is the magnitude of the relationship between statistics and computer science. In Michigan Technological University’s blog “Why Computer Scientists can Benefit from Studying Applied Statistics,” many of the uses of statistics are delineated and described, particularly in terms of data science. Statistics (particularly applied statistics) are so critical in the work of a computer scientist that it is helpful to describe how they regularly come into play.

Computer Scientist: A Day in the Life

Let’s take a look at what a typical routine for a computer scientist might entail. Given the specific area of computer science (e.g., software development/engineering, programming, security, etc.), a computer scientist is apt to engage in the following:

  • Designing/maintaining information systems, hardware, and software
  • Data mining
  • Programming/Coding
  • Reporting on findings

It is fair to say that a day in the life of a computer scientist is one of knowledge discovery. But where do statistics come in? Statistics come in at very nearly every point, as knowledge discovery is central to statistics as well. Consider the above-listed endeavors–statistics informs much of them. For example, designing computer systems generally involves modeling and simulation.

Learn More About Statistics Program

Statistics in Modeling and Simulation

Very often, computer science models are based on probability; a solid understanding of probability theory is absolutely required. These models are based on statistical algorithms and are usually either used to predict or to describe. Statistics used in predictive modeling typically include regression (examining the effect of a variable or multiple variables on an outcome and the magnitude of the relationship between them) and classification techniques (such as decision trees and neural networks). These models are seen across industries. One example is marketing–predictive models are built to better understand consumer's buying behaviors. Consumers might also be categorized (seasonal, impulse, designer-driven, etc.). Once the model is built, the computer scientist should know how to employ statistical resampling methods to test their predictive power.

Computer scientists design and deploy simulations to test new systems before they are released into production. Such simulations may be used to test for specification conformance, reliability of the systems, and performance optimization. Statistical measures (probability, patterns, etc.) are at the heart of these simulations, often used to answer “what if” questions (i.e., “what if I add another component to the system”).

Statistics in Data Mining

Among the most critical statistics for computer science are those employed in data mining. Data mining is used to explore relationships and patterns within datasets. We are immersed in the era of big data–volumes of data from multiple sources in varying formats, and most readily and inexpensively accessible. With such access to vast stores of data, data literacy becomes key to a successful data mining mission. Anyone looking to collect or otherwise produce a meaningful dataset must understand the components of the data and the business or research question they intend to address.

Too often, data are culled without regard to sampling theory to inform the techniques. This typically results in data that aren't appropriate for the business question or sufficient to properly analyze. For example, it is essential to understand the distribution of the data. The data distribution clues us into the shape and variability of the dataset. Without such understanding, the wrong kind of analytical technique could be used, resulting in inaccurate and misrepresented results.

Descriptive statistics are frequently employed to investigate and understand a dataset prior to any form of analysis or modeling. Such statistics include:

  • Central tendency (mean, mode, median)
  • Dispersion (variance of the data)
  • The distribution

A computer scientist would want to be very comfortable with producing these statistics and understanding what they mean before proceeding with any type of analysis. The impact of one undetected outlier and results could be more significant than imagined.

Programming/Coding

As with understanding the data being analyzed, statistics inform the type of programming necessary to answer business questions. Statistical knowledge guides the programming itself (such as the algorithms developed), along with its testing and validation. Statistical know-how aids the computer scientist in creating more efficient and accurate programming for meaningful results.

Representing and Reporting

A frequently overlooked aspect of computer science is the representation and reporting of the data findings. However, this is just as crucial as the technical aspects of creating the results. This is another case where statistics and computer science merge–how to best visually display, summarize, and report results. A weak representation of even the most robust analysis will dampen and minimize its impact. A significant part of any statistician's training is reporting and communicating and is yet another reason why a computer scientist would want to reap the benefits of statistical study.

How to Learn Statistics for Computer Science

While some computer science programs may require a course in statistics, often that course alone is insufficient to provide the computer scientist with the well-rounded statistical skill set needed. The computer scientist who is serious about their career should strongly consider an additional program or degree in statistics. An applied statistics program with an emphasis on solving real-world problems that computer scientists face daily would be particularly relevant.

The good news is that there are online graduate programs such as Michigan Tech’s Master of Science in Applied Statistics. Michigan Tech’s program is geared toward professionals who are looking for a robust program that won’t take years to complete but offers the required knowledge to further a computer science career.

Learn More About Our Masters Program