On the increase of predictive performance with high-level data fusion

The combination of the different data sources for classification purposes, also called data fusion, can be done at different levels: low-level, i.e. concatenating data matrices, medium-level, i.e. concatenating data matrices after feature selection and high-level, i.e. combining model outputs. In this paper the predictive performance of high-level data fusion is investigated. Partial least squares is used on each of the data sets and dummy variables representing the classes are used as response variables. Based on the estimated responses ŷ(j) for data set j and class k, a Gaussian distribution p(g(k)|ŷ(j)) is fitted. A simulation study is performed that shows the theoretical performance of high-level data fusion for two classes and two data sets. Within group correlations of the predicted responses of the two models and differences between the predictive ability of each of the separate models and the fused models are studied. Results show that the error rate is always less than or equal to the best performing subset and can theoretically approach zero. Negative within group correlations always improve the predictive performance. However, if the data sets have a joint basis, as with metabolomics data, this is not likely to happen. For equally performing individual classifiers the best results are expected for small within group correlations. Fusion of a non-predictive classifier with a classifier that exhibits discriminative ability lead to increased predictive performance if the within group correlations are strong. An example with real life data shows the applicability of the simulation results.

T.G. Doeswijk, A.K. Smilde, J.A. Hageman, J.A. Westerhuis, F.A. van Eeuwijk
Publication data (text): 
2011, 705, 1-2, 41-7
2011; 705 (1-2): 41-47
Published in: 
Analytica Chimica Acta
Date of publication: 
October, 2011
Status of the publication: