A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization.

Fiche publication


Date publication

novembre 2015

Journal

BMC medical research methodology

Auteurs

Membres identifiés du Cancéropôle Est :
Mme TRUNTZER Caroline


Tous les auteurs :
Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL

Résumé

In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.

Mots clés

Algorithms, Data Interpretation, Statistical, Humans, Oligonucleotide Array Sequence Analysis, Principal Component Analysis, Regression Analysis, Selection Bias

Référence

BMC Med Res Methodol. 2015 Nov;15:95