Data Mining and Statistics
We recently received the following question from a reader. (Hi Brad!)
There is, however, a cultural difference between people who call themselves statisticians and people who call themselves data miners. This difference has its origins in different expectations about data size. Statistics grew up in an era of small data and many statisticians still live in that world. There are strong practical and budgetary limits to how many patients you can recruit for a clinical trial, for instance. Statisticians have to extract every last drop of information from their small data sets and so they have developed a lot of clever tools for doing that. Data Miners tend to live in a big data world. With big data, we can often replace cleverness with more data. Gordon's most recent post on oversampling is an example. If you have sufficient data that you can throw away most of the common cases and still have enough data to work with, that really is easier than keeping track of lots of weights. Similarly, with enough data, it is much easier (and more accurate) to estimate the probability that a subscriber will cancel with tenure of 100 days by counting the many people who do quit and dividing by the even larger number of people who could have quit but didn't, than to make some assumptions about the shape of the hazard function.
What is the difference between Data Mining and statistics and should I care?The way I think about it, data mining is the process of using data to figure stuff out. Statistics is a collection of tools used for understanding data. We explicitly use statistical tools all the time to answer questions such as "is the observed change in conversion rate (or response, or order size, . . .) significant or might it be just due to chance?" We also use statistics implicitly when, for example, a chi-square test inside a decision tree algorithm decides which of several candidate splits will make it into a model. When I make a histogram showing number of orders by order size bin, I am really exploring a distribution although I may not choose to describe it that way. So, data miners use statistics and much of what statisticians do might be called data mining.
There is, however, a cultural difference between people who call themselves statisticians and people who call themselves data miners. This difference has its origins in different expectations about data size. Statistics grew up in an era of small data and many statisticians still live in that world. There are strong practical and budgetary limits to how many patients you can recruit for a clinical trial, for instance. Statisticians have to extract every last drop of information from their small data sets and so they have developed a lot of clever tools for doing that. Data Miners tend to live in a big data world. With big data, we can often replace cleverness with more data. Gordon's most recent post on oversampling is an example. If you have sufficient data that you can throw away most of the common cases and still have enough data to work with, that really is easier than keeping track of lots of weights. Similarly, with enough data, it is much easier (and more accurate) to estimate the probability that a subscriber will cancel with tenure of 100 days by counting the many people who do quit and dividing by the even larger number of people who could have quit but didn't, than to make some assumptions about the shape of the hazard function.