J Supercomput

DOI 10.1007/s11227-015-1421-0

Critical data points-based unsupervised linear dimension reduction technology for science data

Di Wu1 · Naixue Xiong2 · Jinrong He3 ·

Chuanhe Huang1 © Springer Science+Business Media New York 2015

Abstract Recent advances in machine learning and data mining have led to powerful methods for the analysis and visualization of high dimensional data. This paper proposes an unsupervised linear dimension reduction algorithm named critical points preserving projection (CPPP). Selecting some key data points to represent the others has become more and more popular owing to its effectiveness and efficiency. Rather than considering all data points equally, the proposed algorithm just preserves both local neighborhood structure and global variance of critical data points. We explore a joint modification of locality preserving projection and principal component analysis to achieve these objectives. Experimental results on the UCI data sets show its good performance on pattern classification.

Keywords Dimension reduction · Critical data points · Unsupervised learning

B Chuanhe Huang huangch@whu.edu.cn

Di Wu wudi2012@whu.edu.cn

Naixue Xiong nxiong@coloradotech.edu

Jinrong He hejingrong@163.com 1 School of Computer, Wuhan University, Wuhan, China 2 School of Computer Science, Colorado Technical University,

Colorado Springs, CO, USA 3 State Key Laboratory of Software Engineering, School of Computer,

Wuhan University, Wuhan 430072, China 123

D. Wu et al. 1 Introduction

Technology advances have made data collection easier and faster, usually these data are represented by vectors which can be modeled as points in high-dimensional Euclidean spaces. Traditional distance-based similarity search algorithms over high-dimensional data do not work well due to the phenomenon of “curse of dimensionality” [1].

Although there are already some algorithms that can handle high-dimensional data directly (e.g., support vector machines [2] and naive Bayes models [3]), it is still a good practice to reduce the number of input features, aiming at finding a more meaningful low-dimensional representation which can also reduce computational cost and lead us to better models for inference. Thus, dimension reduction is an important data preparation step for many data mining and database applications. To extract useful information from data, we need to determine the type of intrinsic geometry structures (either locally or globally) needed to be preserved after the data embedding, and it is the core issue of dimension reduction.

Since the intrinsic information needed for preservation is not clear, parts of which are relevant to the user, this problem is inherently ill-posed. Usually these information can be expressed as many general priors about the world, depending on the specific data domain and the situation at hand. Therefore, a variety of different methods has been proposed which try to preserve different properties of the data. For example, principal component analysis (PCA) [4] is an orthonormal transformation of data to a low-dimensional space such that maximum variance of the original data is preserved. Turki et al. [5] proposed a generalized PCA named weighted maximum margin (WMV). Locality preserving projection (LPP) [6] is a locality preserving algorithm, and there is an assumption that nearby points are likely to have the same embedding. Neighborhood preserving embedding (NPE) [7] is aimed to preserving the local neighborhood structure on the data manifold, which is based on the assumption that each data point in the reduced space can be linearly reconstructed from its neighbors by the same weights used in the input space. Orthogonal locality preserving projection (OLPP) [8] and orthogonal neighborhood preserving projection (ONPP) [9] impose orthogonality constraints on projection matrix in LPP and NPP, respectively, to preserve distances and so the overall geometry will be preserved. Since the process of dimensionality reduction can be viewed as a linear transformation from high-dimensional data to its low-dimensional representation, Christos Boutsidis et al. [10] present randomized dimensionality reduction algorithms for k-means clustering.

To deal with small sample size problem, Wang et al. [11] first compute exponential of the scatter matrix, then the singularity of scatter matrix is overcome.

The main idea in our models is to preserve global and local relationship between critical data points which are selected from original sample sets. By analogy to LPP, we call our approach critical points preserving projection (CPPP). The proposed method is based on LPP criterion and PCA criterion, combining global discriminant measure and local clustering information together to achieve good discrimination for better classification, thus can eliminate or reduce the overlap of local structures and is robust to outliers. Like other linear methods, the proposed algorithm involves few tuning parameters, makes no strong distributional assumptions and can be efficiently solved with global optima. 123

Critical data points-based unsupervised...

The remaining sections of this paper are organized as follows: Sect. 2 introduces related work. In Sect. 3, we presented a novel linear dimension reduction algorithm based on critical data points. Experimental study of the performance of the proposed algorithm is presented in Sect. 4, and extensive experimental results show the superiority of the algorithm. We conclude in Sect. 5 with a discussion of future work. 2 Problem formulation and related works

The problem of dimension reduction can be formulated as: given high-dimensional inputs, {x1, x2, . . . , xN }, compute low-dimensional outputs {y1, y2, . . . , yN } that preserve certain intrinsic properties or information. The linear methods consist of replacing the original data X by a matrix of the form

Y = V T X (1) here V ∈ Rd×r is called projection matrix, X and Y denote data matrix whose columns are data points of high- and low- dimensional spaces, respectively. While for the nonlinear dimension reduction, it is usually difficult to provide an explicit mapping to transform measurements from a high-dimensional space to a low-dimensional subspace.