Posted on 30 Apr 2012 13:06:42.
Intelligent algorithms and machine learning perform tasks that cannot be solved by the human brain. Statistical modelling of vast information sources is a must in decision making in numerous fields of society, from medicine to traffic planning and from analyses of stock markets to spam identification. However, to interpret and evaluate the functionality of the models, a critically-oriented human being is needed.
In their most complex form, statistical models are based on thousands of explanatory variables. It is, therefore important to find from the material the features that describe the target of investigation as precisely as possible. This task known as feature selection appears trivial, but attempts have been made for decades to solve this basic pattern recognition problem.
Juha Reunanen, in his doctoral dissertation, Overfitting in Feature Selection: Pitfalls and Solutions, at the Aalto University School of Science, shows that feature selection methods are often compared and evaluated on unjustified grounds. Generally, two misleading conclusions are made: it is thought that computationally intensive and slow search algorithms as well as fine-grained feature selection will produce the most accurate results.
– Everyone wants to introduce a new, and the best, feature selection method. Nevertheless, comparisons and choice between these methods is not as easy as it is often believed during investigation, Reunanen summarizes a key problem in his field of science.
Results that are too good to be true
According to Reunanen, the problem lies more on overfitting of models that use machine learning methods to classify statistical data than in the simplicity of algorithms in the fact that irrelevant or redundant features have not been pruned.
A statistical model has been overfitted when it is capable only of repeating information that has been fed to it but unable to classify new data. The model learns about all the pikes swimming in the pond that the fish in question is a pike but not why it is a pike.
– This basic overfitting, of course, has been recognized for a long time in the field of statistical modelling, but it is less frequent that the "overfitting of the second kind" would become accounted for, Reunanen observes.
When it is believed that if a particular fish species swimming in a particular pond can be identified with 95 percent accuracy by a model having a certain variable set and that the model subsequently would perform just as well with different fish species in lakes, rivers and oceans, we come across with the interpretation error discovered by Reunanen. The mistake is not necessarily noted, because the results of overfitting models are often deceptively good.
– It is hard to draw valid conclusions. It is far too easy to get excited over some good results obtained by statistical models and pattern recognition methods. The researcher should exercise self-criticism and take a hard look whenever a promising explanation emerges.
Thus, rather than having discovered an optimal set of variables or created a brilliant algorithm, we might have fallen victim of a statistical bias.
– Accuracy and prediction capabilities of multivariate models are especially important in fields where the difference between accuracies of 85 and 95 percent matters. For example, if a model meant as a tool for a physician can diagnose a rare illness with 95 percent accuracy, it pays to use it.
However, we wouldn't like the tool to consist of variables that require biopsies, that are dangerous and painful to the patient, nor would we like an exhausting wait for the test results.
– To dramatize a bit, a model selected with proper methods should not need to include variables that require drilling a hole in one's skull in order to make the measurements possible.
juha.reunanen [at] iki [dot] fi
tel. +358 50 375 4475