A sparse nearest mean classifier for high dimensional multi-class problems

June 12, 2017 | Autor: Cor Veenman | Categoría: Cognitive Science, Support vector machine, Production Process, High Dimensionality, Electrical And Electronic Engineering, Curse of Dimensionality

Share Embed

Laporkan tautan ini

Descripción

A Sparse Nearest Mean Classifier for High Dimensional Multi-Class Problems Cor J. Veenmana,∗, Annabel Bolcka a Digital

Technology and Biometrics Department, Netherlands Forensic Institute, The Hague

Abstract The analysis of small datasets in high dimensional spaces is inherently difficult. For two-class classification problems there are a few methods that are able to face the so-called curse of dimensionality. However, for multi-class sparsely sampled datasets there are hardly any specific methods. In this paper, we propose four multi-class classifier alternatives that effectively deal with this type of data. Moreover, these methods implicitly select a feature subset optimized for class separation. Accordingly, they are especially interesting for domains where an explanation of the problem in terms of the original features is desired. In the experiments, we applied the proposed methods to an MDMA powders dataset, where the problem was to recognize the production process. It turns out that the proposed multi-class classifiers perform well, while the few utilized features correspond to known MDMA synthesis ingredients. In addition, to show the general applicability of the methods, we applied them to several other sparse datasets, ranging from bioinformatics to chemometrics datasets having as few as tens of samples in tens to even thousands of dimensions and three to four classes. The proposed methods had the best average performance, while very few dimensions were effectively utilized. Keywords: Classification, multi-class, support vector machine, high ∗ Corresponding author. Tel: +31-70-8886654 Fax: +31-70-8886559 Address: Laan van Ypenburg 6, 2497 GB, The Hague, The Netherlands Email addresses: [email protected] (Cor J. Veenman), [email protected] (Annabel Bolck)

Preprint submitted to Pattern Recognition Letters

December 17, 2010

dimensional, mass spectrometry, chemometrics, gene expression profiles, bioinformatics.

1

1. Introduction

2

The dimensionality of the problems that data analyzers are faced with is

3

growing enormously. In many application areas, so-called high-throughput tech-

4

nologies, such as DNA microarrays Alon et al. (1999), Golub et al. (1999), van ’t

5

Veer et al. (2002), proteomics Pandey and Mann (2000), Patterson and Aeber-

6

sold (2003) and mass spectrometry are used to measure the state of processes

7

and systems or the composition of substances.

8

We are studying mass spectrometry data to find a common structure in

9

illegal synthetic drugs. The drugs that we are analyzing are MDMA (3,4-

10

methyleendioxymethylamphetamine) powders from dismantled production sites.

11

MDMA, also known as XTC, is one of the most widely spread illegal synthetic

12

drugs in Europe. By estimate, 4300 kg of MDMA was seized in Europe in 2004

13

UNODC (2006). To fight the organizations of production and trade of MDMA,

14

comparative analysis could be of great aid for intelligence purposes. For in-

15

stance, the type and possibly the number of operational illicit production sites

16

can be derived from the recognition of MDMA powders or tablets in one of the

17

known synthesis methods. That is, in the Netherlands MDMA is mostly syn-

18

thesized in three variations of a reductive amination procedure Cason (1990).

19

The problem of recognizing the synthesis method of MDMA powders has been

20

already attacked before in Koper et al. (2007), where the powders were analyzed

21

using knowledge of the synthesis methods and substances seized in illicit labs.

22

The results were certainly promising, but it took a lot of ’manual’ effort to es-

23

tablish the decision rules based on process knowledge. Further, with 15% of the

24

samples recognized wrongly, there is room for improvement. Finally, automatic

25

methods are required, since similar problems will be attacked for other synthetic

26

drugs, where suitable domain knowledge is not yet available. These automatic

27

methods, however, should be transparent, so forensic experts can understand

2

28

the automatic recognition process. This understanding is desirable from an an-

29

alytical and intelligence point of view. More importantly, it is required when

30

the forensic expert needs to explain his findings in the court of justice.

31

In this research we focus on the following questions: 1) can the three known

32

synthesis methods for MDMA powders automatically and accurately be iden-

33

tified using statistical pattern recognition methods and 2) which chemical ele-

34

ments are important to distinguish the synthesis methods. Because the dataset

35

that we have access to is collected from MDMA powders found at discovered

36

production sites, it is limited in size. We have 53 mass spectrometry measure-

37

ments from the powders of only 57 MDMA samples Koper et al. (2007). As

38

a consequence, we are faced with a small sample size multi-class classification

39

problem. Recently, we proposed a sparse classifier model for two-class classi-

40

fication problems Veenman and Tax (2005). This so-called LESS classifier is

41

a nearest mean classifier with integrated dimension reduction in a linear pro-

42

gramming model. Here, we propose four extensions of the model that enable for

43

multi-class classification, while the number of effectively used dimensions is still

44

minimized. To explore the general applicability of the method, we also apply it

45

to several other small sample size multi-class problems from the bioinformatics

46

and chemometrics domain.

47

In the next section, we first put our work in perspective by describing re-

48

lated work in high dimensional multi-class recognition problems. Then, we

49

describe the previously published LESS classifier Veenman and Tax (2005) and

50

we propose four multi-class extensions. In the following section, we describe the

51

MDMA mass spectrometry dataset and report the experimental results of ap-

52

plying the proposed multi-class classifier to the MDMA and other small sample

53

size datasets. We conclude with a discussion of the results.

54

2. Related Work

55

In order to deal with high dimensional multi-class small sample size classifi-

56

cation problems, the main problem is clearly the sparseness of the data, or the

3

57

emptiness of the high-dimensional data space. The problem is probably stronger

58

for generative classifiers that try model the high dimensional data distributions

59

of the classes involved and use these models to define the inter-class decision

60

boundaries. However, discriminative classifiers, that directly model the decision

61

boundary, also suffer from the sparseness of the data. In both cases lack of data,

62

that is, insufficient sampling of the high dimensional distributions or the high

63

dimensional boundary between the distributions, hinders the modeling resulting

64

in models that easily overfit on the training data. Below we list related work

65

for dimension reduction and high-dimensional multi-class methods.

66

Dimension reduction as a pre-processing step

67

The typical method of dealing with high-dimensional data is by reducing

68

the dimensionality of the dataset through a pre-processing step. This can be

69

done in two different ways. Either through the projection to a lower dimen-

70

sional space, e.g. Cox and Cox (1994), Joliffe (1986), Roweis and Saul (2000),

71

Tenenbaum et al. (2000), or, through the selection of a feature subset with a

72

suitable class separation criterion Guyon and Elissee (2003), Peng et al. (2005).

73

After the dimensionality has been reduced any multi-class classifier can be used.

74

While such an approach is straightforward and offers great flexibility in choosing

75

any preferred classifier in the lower dimensional space, it has some fundamental

76

problems. Because the dimension reduction step is independent from the clas-

77

sification, the model that is used for dimension reduction interferes generally

78

with the classifier model that is used in the reduced space. This holds espe-

79

cially when non-linear classifiers are applied. Further, most projection methods

80

are unsupervized, so they may harm the class separation by ignoring the class

81

distributions. Another disadvantage of projection methods is that the resulting

82

features are not a subset of the original features. Consequently, an explana-

83

tion of the classification results in terms of a subset of the input features is

84

not possible. This is often desirable, like in our case of drug synthesis method

85

classification.

4

86

Regularized methods and sparse models

87

Instead of pre-processing methods for dimension reduction, regularized clas-

88

sification methods can handle high dimensional datasets directly by reducing

89

the variance of the model Friedman (1989), Hastie et al. (2001), Hoerl and

90

R.W.Kennard (1970), Tibshirani et al. (2002), Vapnik (1998). For our prob-

91

lems these classifiers have, however, two important shortcomings. First, these

92

classifiers are designed for two-class classification problems. Second, they do not

93

yield a sparse model. That is, all features are in principle used so an explanation

94

of the underlying phenomenon in a subset of the measured features is not pos-

95

sible. Then, several methods have been reported that integrate feature subset

96

selection Bhattacharyya et al. (2003), Bradley and Mangasarian (1998), Mika

97

et al. (2001), Tibshirani (1996), Veenman and Tax (2005), Zhu et al. (2003),

98

but also these are designed for two-class classification problems.

99

Multi-class methods through a post-processing step

100

To deal with multi-class problems, generic post-processing approaches are

101

available. That is, several two-class classifiers can be trained on subproblems,

102

either one-versus-rest or one-versus-one problems Kittler et al. (1998), Platt

103

et al. (2000). The resulting classifiers can be converted into a multi-class classi-

104

fier by fusing the outputs of the classifiers of the respective subproblems. This

105

approach has, however, similar problems as the generic pre-processing dimen-

106

sion reduction methods. That is, the classifiers for the subproblems have no

107

information about the other subproblems that need to be solved. Accordingly,

108

dependencies between classes cannot be exploited.

109

Regularized multi-class methods

110

Finally, several regularized integrated multi-class classification methods have

111

been reported that are all derived from the support vector machine Crammer

112

and Singer (2000), Hsu and Lin (2002), Lee et al. (1981), Vapnik (1998), Weston

113

and Watkins (1999). For these methods, the number of parameters increases

114

at least linearly in the number of classes. Further, these methods still lack the

5

115

ability to yield a sparse classification model. In Weston et al. (2003) multi-

116

class sparse linear and non-linear models are proposed. With respect to design

117

objectives these are the most closely related to the work proposed here. The

118

main differences with our work are that 1) we impose a simple data model

119

based on the class means, 2) our models are linear, 3) the resulting model

120

can be efficiently optimized (a linear program), and most interestingly 4) in

121

contrast with the SVM multi-class implementations, we introduce multi-class

122

versions that solve all inter-class boundaries in one subspace. We explain our

123

model in the next section.

124

3. Multi-Class Generalization of the LESS Model

125

We first formalize the problem. Let X = {~x1 , ~x2 , ..., ~xn } be a data set with

126

n objects, where ~xi is a feature vector in a p-dimensional metric space. Each

127

object has a corresponding class label yi ∈ 1, ..., c, where c is the number of

128

classes. Let Xk ⊂ X be the subset of nk objects from X with label k, and

129

X6=k = X \ Xk the objects with label unequal to k. We denote the mean vector

130

of a set of objects U ⊂ X with µ(U ), the mean vector of Xk with µk , and the

131

predicted label for object ~x with f (~x).

132

The rationale behind the LESS classification model is that all training ob-

133

jects should be closer to the mean vector of their class, while at the same time

134

in the distance computation all dimensions are weighted and the total sum of

135

these weights should be minimal Veenman and Tax (2005). To enable to model

136

non-separable classes, slack variables are introduced as in the support vector

137

machine Vapnik (1998). As a result, some training vectors may not be closer to

138

their class mean vector. The resulting problem is a linear minimization prob-

139

lem with linear constraints (a linear program) that can be solved efficiently. For

140

two-class classification problems the LESS model is defined as follows Veenman

141

and Tax (2005):

6

min

p X

wj + C

j=1

n X

ξi ,

(1)

i=1

  ~x ∈ X1 , Pp wj φ(~x, j) ≥ 1 − ξi j=1 subject to: P  p ~x ∈ X2 , x, j) < −1 + ξi j=1 wj φ(~ 2 2 where φ(~x, j) = xj − µ ~ 1j − xj − µ2j and wj ≥ 0, ξi ≥ 0.

(2)

(3) (4)

142

In these equations, w ~ is a vector with weights wj per dimension, and ξi is

143

the slack variable for object ~xi to enable the modeling of inseparable data. C

144

is a tunable parameter that balances model sparseness against model accuracy.

145

Since in this model the weight entries wj are naturally non-negative and

146

summed in the minimization term (1), the solution to this linear program is

147

typically sparse in the number of effectively used dimensions. This is a phe-

148

nomenon that was first noticed for L1 regularized regression models Breiman

149

(1995), Tibshirani (1996). In Veenman and Tax (2005) it has been shown,

150

that LESS needs substantially less dimensions for similar or better performance

151

than related models for several sparsely sampled high-dimensional two-class

152

problems.

153

3.1. Multi-class Model

154

To generalize the LESS model for multi-class problems, we need to establish

155

subproblems to learn the additional inter-class boundaries. To define a sub-

156

problem for a boundary, two data subsets U and V and a weight vector w ~ are

157

required. The means of the two data subsets define the line on which the vec-

158

tors involved in the subproblem are projected. The weight vector determines

159

in which subspace the projection line is established. The composition of sub-

160

problems using the data subsets S = {(U, V )} and the set of weight vectors

161

W = {w} ~ yield the following general model:

7

min ∀w ~ ∈W

p X

wj + C

j=1

n X

ξi ,

(5)

i=1

  ~x ∈ U, Pp φ(~x, U, V, w, ~ j) ≥ 1 − ξi j=1 subject to: ∀(U, V ) ∈ S :  ~x ∈ V, Pp φ(~x, U, V, w, ~ j) < −1 + ξi j=1 h 2 2 i where φ(~x, U, V, w, ~ j) = wj · xj − µ(U )j − xj − µ(V )j .

(6)

(7)

162

For the selection of the data subsets S and the use of the weight vectors

163

W , we have each two alternatives. The combinations of these alternatives yield

164

four versions of the multi-class LESS generalization. In all cases, the multi-class

165

problem is modeled such that all inter-class boundaries are dependent. This

166

contrasts with the post-processing approaches to multi-class combiners, where

167

all sub-problems (inter-class boundaries) are independent. Below, we list the

168

alternatives for the datasubsets and weight vectors.

169

One-versus-rest constraints

170

For the SVM, the first true multi-class implementations, that is, through

171

a single optimization problem, used one-versus-rest constraints Vapnik (1998),

172

Weston and Watkins (1999). In our model (5)−(7), the subproblems can be

173

implemented similarly by defining the set S from (6) as follows:

S = {(Xk , X6=k )} 174

where

k ∈ {1, ..., c}.

Accordingly, the data vectors are projected on the lines between the mean

175

of a class and the mean of the remaining data vectors.

176

One-versus-one constraints1

177

178

(8)

Alternatively, we can establish the constraints in a one-versus-one way. Then, the data subsets for the subproblems are defined as follows: 1 With

one-versus-one constraints, for two classes the multi-class LESS reduces to the nor-

mal LESS.

8

S = {(Xk , Xh )} 179

180

k, h ∈ {1, ..., c};

where

h > k.

(9)

Multi-subspace Similar to the multi-class SVM proposals Vapnik (1998), Weston and Watkins

181

(1999), the subproblems can be solved simultaneously each in a separate sub-

182

space. That is, for each pair of data subsets (U, V ) a corresponding weight

183

vector w ~ ∈ W is used.

184

Single-subspace

185

Because the data vectors are projected on the line between the means of the

186

data subsets, either the one-versus-rest or the one-versus-one subsets, the sub-

187

problems can also be resolved all in the same subspace. Then, W contains only

188

one weight vector that is used for all subproblems. This is an important dif-

189

ference with the SVM implementations for multi-class problems Vapnik (1998),

190

Weston and Watkins (1999). Both computationally and qualitatively this dif-

191

ference has advantages. The computationally advantages result from a lower

192

number of variables, see Table 1. Additionally, because the model has fewer

193

parameters it also has fewer degrees of freedom, which is advantageous when

194

only minimal data is available.

195

Problem sizes

196

Depending on the model setup in terms of subproblems and subspaces the

197

model has a varying complexity in number of constraints and variables as can

198

be seen in Table 1. #variables #constraints

#subproblems

multi-subspace

single-subspace

one-versus-one

(c − 1) · n

c · (c − 1)/2

(c − 1) · (n + p · c/2)

(c − 1) · n + p

one-versus-rest

c·n

c

c · (n + p)

c·n+p

Table 1: Complexity of the models denoted in the number of constraints and variables.

9

199

In addition to the constraints that are required to impose class separation,

200

all variables (wj and ξi ) must be non-negative. We ignored these constraints in

201

the table.

202

3.2. Classification

203

204

205

206

To classify unseen objects, these objects must be mapped on all subproblems. One-versus-rest subproblems In case of one-versus-rest subproblems the unseen object obtains the label of the class that yields the highest distance toward its class mean:

∀(Xk , X6=k ) ∈ S : λ(k) =

p X

φ(~x, Xk , X6=k , w, ~ j).

(10)

j=1 207

208

209

One-versus-one subproblems With one-versus-one subproblems, all subproblems and thus classes receive many contributions for the unseen object, both positive and negative:

∀(Xk , Xh ) ∈ S :

  λ(k) = λ(k) + Pp

φ(~x, Xk , Xh , w, ~ j)

 λ(h) = λ(h) − Pp

φ(~x, Xk , Xh , w, ~ j).

j=1 j=1

210

211

The label f (~x) of the unseen object ~x equals the index of λ(i) with the highest total contribution:

f (~x) = arg max λ(i). i

212

(11)

(12)

4. Experiments

213

To evaluate the general effectiveness of the proposed method, we applied it to

214

several high dimensional datasets in addition to the MDMA dataset. Moreover,

215

we compared the performance with two methods that are typically applied in

216

these high dimensional domains. The data sets used in the experiments are

10

Dataset

Ref.

#classes

#dimensions

#samples

MDMA

Koper et al. (2007)

3

53

57

Nutt et al. (2003)

4

4433

50

Bhattacharjee et al. (2001)

5

3312

203

MLL

Armstrong et al. (2002)

3

8685

72

OLITOS

Armanino et al. (1989)

4

25

120

SRBCT

Khan et al. (2001)

3

2308

83

WINE

Alon et al. (1999)

4

2700

44

GLIOMA LUNG

Table 2: Overview of the characteristics of the data sets used in the experiments.

217

biological tissue classification datasets measured with gene expression micro-

218

arrays and chemometrics data sets measured through gas chromatography -

219

mass spectrometry (GS-MS). The gene expression data sets SRBCT, GLIOMA

220

and LUNG have been preprocessed as in Yang et al. (2006). In Table 2, we list

221

the characteristics of the data.

222

We compared the LESS multi-class extensions with the support vector ma-

223

chine with linear kernel (LIN-SVM) Vapnik (1998) and partial least squares dis-

224

criminant analysis (PLS-DA) Barker and Rayens (2003). The SVM has shown

225

competitive performance in many applications areas, especially with high di-

226

mensional problems, e.g. Statnikov et al. (2008), Yang et al. (2006), Zadora

227

(2007). For the small sample size problems that we consider here, the linear

228

kernel is the most appropriate. PLS-DA is a popular technique in the chemo-

229

metrics community, where mass spectrometry devices yield high dimensional

230

recognition problems, e.g. Ballabio et al. (2008), Bankefors et al. (2008).

231

The LESS classifier has been implemented with the general purpose linear

232

program solver GLPK GLPK (2004). We used the SVM implementation lib-

233

SVM Chang and Lin (2001). In this toolbox, multi-class problems are handled

234

by solving and combining several independent one-versus-all SVM problems.

235

Finally, the PLS-DA classifier is from PLS Toolbox PLS Toolbox (2004).

236

Parameter tuning and validation

237

The LESS etxensions, the SVM, and PLS-DA all have a parameter that

238

constrains the complexity of the model. For LESS and the SVM the (typically

11

Dataset

LESS(ovr/ss)

LESS(ovr/ms)

LESS(ovo/ss)

LESS(ovo/ms)

LIN-SVM

PLS-DA

MDMA

19.3 ± 17.8

8.8 ± 10.8

9.3 ± 11.2

8.2 ± 15.0

12.1 ± 12.5

17.7 ± 13.8

GLIOMA

35.4 ± 16.6

23.4 ± 18.1

21.7 ± 13.8

18.8 ± 12.2

24.1 ± 17.0

17.2 ± 16.5

LUNG

2.9 ± 3.2

3.4 ± 5.2

2.5 ± 3.3

5.8 ± 6.3

2.8 ± 4.1

2.6 ± 3.6

MLL

7.2 ± 9.3

6.7 ± 9.2

6.7 ± 9.2

5.6 ± 7.9

5.0 ± 8.8

6.1 ± 10.1

OLITOS

17.6 ± 12.5

17.0 ± 11.7

15.6 ± 10.8

30.3 ± 14.6

27.4 ± 15.6

17.8 ± 14.3

SRBCT

1.7 ± 5.0

1.4 ± 3.1

2.6 ± 4.1

1.7 ± 3.8

0.0 ± 0.0

0.0 ± 0.0

29.7 ± 18.4

29.8 ± 20.0

30.8 ± 21.6

39.0 ± 19.4

27.3 ± 18.8

27.2 ± 17.7

14.5

12.7

12.3

16.4

13.1

15.3

WINE Average

Table 3: Overview of the estimated class averaged generalisation errors in percentages (through 10-fold cross-validation repeated three times). Dataset

LESS(ovr/ss)

LESS(ovr/ms)

LESS(ovo/ss)

LESS(ovo/ms)

LIN-SVM

PLS-DA

MDMA

5.7 ± 1.7

4.9 ± 1.5

6.2 ± 1.4

3.9 ± 1.1

53

(4.72 ± 3.5)

GLIOMA

42.4 ± 6.1

61.9 ± 5.1

32.5 ± 2.7

52.9 ± 7.0

4434

(9.8 ± 2.3)

LUNG

44.7 ± 11.9

63.6 ± 8.0

30.2 ± 9.3

21.6 ± 9.5

3312

(24.4 ± 2.3)

MLL

10.6 ± 7.0

27.4 ± 5.8

7.9 ± 2.5

9.3 ± 4.9

5848

(7.0 ± 2.7)

OLITOS

20.2 ± 1.7

22.0 ± 2.9

18.3 ± 1.4

17.3 ± 3.1

25

(23.00 ± 2.8)

SRBCT

21.5 ± 2.5

27.8 ± 3.2

19.4 ± 2.8

24.2 ± 2.2

2308

(20.1 ± 2.4)

WINE

22.4 ± 6.6

32.7 ± 3.0

13.7 ± 2.5

15.8 ± 4.2

2700

(24.3 ± 2.6)

26.97

39.23

20.33

23.52

3100

(16.2)

Average

Table 4: Overview of the average number of utilized dimensions in the cross-validation folds. For PLS-DA the numbers refer to the optimised number of latent variables. Therefore, for PLS-DA the number of dimensions are printed in brackets.

239

called) ’C’ parameter balances model complexity and accuracy. The parame-

240

ter of PLS-DA sets the number of latent variables. For all methods, we tune

241

the respective parameter through stratified cross-validation. Because the per-

242

formance of the methods is also estimated through stratified cross-validation, a

243

nested cross-validation is required to include the parameter tuning sensitivity

244

in the performance estimation. For both the parameter tuning (inner loop) and

245

the performance testing (outer loop), we used stratified 10-fold cross-validation.

246

In order to restrict the influence of inbalanced classes, we computed the aver-

247

aged class error (balanced error ) in the 10-fold cross-validation. We repeated

248

the cross-validation three times to have a more reliable performance estimate.

12

249

MDMA dataset

250

We now turn to the high-dimensional MDMA powder classification problem

251

that we started with. We have two objectives for the model validation. First,

252

the model should be able to predict the class labels accurately. Second, we

253

are interested in the features that the model effectively used, so we know which

254

elements contribute to the separation of the classes. We start with a description

255

of the data in the next section. After that we describe the experimental results.

256

In total 57 MDMA powders were collected at 12 illegal production sites.

257

These powders were produced using one of three different synthesis methods.

258

The first method is the high pressure method. This method is performed at el-

259

evated pressure (3 bar) with hydrogen and platinum (Pt) as the catalysts. An-

260

other procedure is called the cold method. Here, sodium borohydride (NaBH4 )

261

serves as a reducing agent. The name cold method refers to performance of the

262

synthesis at lower temperatures by cooling the reagents in a fridge to suppress

263

unwanted reactions. The third method is the so-called aluminium amalgam

264

method, where aluminium foil and mercury chloride (HgCl2 ) are used to form

265

an aluminium amalgam, which serves as the reducing agent. Per lab the synthe-

266

sis method is established based on the type of reaction vessels, precursors and

267

chemicals present at the site. Consequently, all sample powders collected at one

268

site are assigned the same synthesis method label. From the collected MDMA

269

powder samples, 29 powders (from five different labs) were synthesized by the

270

high pressure method, 18 powders (from four different labs) were synthesized

271

by the cold method and 10 powders (from three different labs) produced by the

272

aluminium amalgam method. The concentration levels of 53 chemical elements

273

in the powders are measured with Inductively Coupled Plasma Optical Emission

274

Spectrometry (ICP-OES) and Inductively Coupled Plasma Mass Spectrometry

275

(ICP-MS) devices. For details on the measurement techniques and detection

276

limits, see Koper et al. (2007).

277

In Table 3, we show the results of the cross-validation runs. In the first row

278

of this table, it can be seen that the LESS multi-class version with one-versus-

279

one constraints in multiple subspaces LESS(ovo/ms) has the best performance 13

#constraints

#variables

Dataset

ovr

ovo

ovr/ss

ovr/ms

ovo/ss

MDMA

171

114

224

330

167

ovo/ms 273

GLIOMA

200

150

4634

17936

4584

26754

LUNG

1015

812

4327

17575

4124

33932

MLL

171

114

6019

17715

5962

17658

OLITOS

480

360

505

580

385

510

SRBCT

332

249

2640

9564

2557

14097

WINE

176

132

2876

10976

2832

16332

Total

2545

1931

21225

74676

20611

109556

Table 5: Overview of the number of constraints and variables of the four multi-class LESS models for the various datasets.

280

with 8.2% error. In contrast, the analytical method using chemical knowledge

281

resulted in 9 samples for which the synthesis method could not correctly be

282

determined (error is 15.8%) Koper et al. (2007). On average LESS(ovo/ms) used

283

approximately 4 dimensions in the cross-validation loops, see Table 4 first row.

284

These dimensions correspond to the concentrations of the chemical elements B,

285

Zn, Pt, and Hg. From these elements Zn has not been recognized as marker for

286

the production process in the previous study Koper et al. (2007).

287

Additional small sample size datasets

288

In Table 3, we show the results of applying the four LESS alternatives, the

289

LIN-SVM and PLS-DA to the MDMA dataset and 6 other high dimensional

290

small sample size datasets. From the table several observations can be done.

291

Overall, LESS(ovo/ss) has the lowest average error. Remarkably, LESS(ovo/ss)

292

is the LESS model with the fewest parameters, both in the number of constraints

293

and in the number of variables, see Table 5. Besides, from the LESS models

294

LESS(ovo/ms) has the highest average error and also the most parameters. This

295

could indicate a relation between model complexity and data sparseness. LIN-

296

SVM has for three datasets the best performance and has also on average a

297

competitive performance. In addition to the performance measured in error,

298

the transparency of the model in the utilized dimensions is important. Neither

299

LIN-SVM nor PLS-DA effectively reduce the number of utilized dimensions in

14

300

the model. PLS-DA reduces the complexity of the model through a restricted

301

number of latent variables. However, the latent variables are composed of a

302

linear combination of in general all original dimensions. For this reason the

303

number of utilized dimensions (i.e. latent variables) are printed in brackets

304

for PLS-DA in Table 4. The table shows that all LESS methods use only a

305

fraction of the original dimensions, especially for the very high dimensional

306

datasets. Moreover, LESS(ovo/ss) that has the lowest average error also clearly

307

consistently used the fewest dimensions.

308

5. Conclusions

309

In this paper, we proposed a method for the classification of high dimen-

310

sional sparse multi-class datasets. The method is a multi-class extension of the

311

previously reported two-class LESS method. We proposed four versions that

312

differ in the number of parameters and in the complexity. The method aims at

313

utilizing a subset of the original dimensions to deal with data sparseness and to

314

obtain relevant dimensions for the given classification task.

315

With the proposed method, we studied a dataset consisting of MDMA pow-

316

ders that were produced with one of three known synthesis methods. The task

317

was to predict the synthesis method based on the element concentrations of

318

the powders as estimated with mass spectrometry. Additionally, it should be

319

possible to inspect the utilized element concentrations in order to obtain prob-

320

lem knowledge from the model. The proposed method was able to predict the

321

synthesis method with 8.2% average cross-validation error. That is, the pattern

322

recogntion approach improves on the analytical method using domain knowl-

323

edge (15.8% error) Koper et al. (2007). Further, since the method is a sparse

324

classification model, we could inspect the utilized elements by the model. It

325

turned out that these correspond to the three elements previously reported as

326

significant and related to the production process Koper et al. (2007), while an

327

additional marker element (Zn) has been established.

328

To show the general applicability of the proposed method, we also applied it

15

329

to other small sample size problems. Further, we compared the proposed mod-

330

els to other methods that are typically applied to this type of datasets, i.e. the

331

linear support vector machine (LIN-SVM) and partial least squares discrimant

332

anslysis (PLS-DA). Overall, one of the LESS extensions had the lowest average

333

error. Moreover, only the proposed LESS extensions modeled the inter-class

334

boundaries in a true subspace. It turned out that for all datasets only a small

335

subset of the original data dimensions were effectively used. The obtained clas-

336

sification model itself learns which features are relevant for separation of the

337

classes. Accordingly, besides competitive classification performance, the LESS

338

models enable to gain domain knowledge.

339

References

340

Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine,

341

A. J., 1999. Broad patterns of gene expression revealed by clustering of tumor

342

and normal colon tissues probed by oligonucleotide arrays. Proceedings of the

343

National Academy of Sciences (PNAS) 96 (12), 6745–6750.

344

Armanino, C., Leardi, R., Lanteri, S., G.Modi, 1989. Chemometric analysis of

345

tuscan olive oils. Chemometrics and Intelligent Laboratory Systems 5, 343–

346

354.

347

Armstrong, S., Staunton, J., Silverman, L., Pieters, R., den Boer, M., Minden,

348

M., Sallan, S., Lander, E., Korsmeyer, T. G. S., 2002. MLL translocations

349

specify a distinct gene expression profile that distinguishes a unique leukemia.

350

Nature Genetics 30, 41–47.

351

Ballabio, D., Skov, T., Leardi, R., Bro, R., 2008. Classification of GC-MS mea-

352

surements of wines by combining data dimension reduction and variable se-

353

lection techniques. Journal of Chemometrics 22, 457–463.

354

Bankefors, J., Nord, L., Kenne, L., 2008. Structural classification of acyl-

355

substituted quillaja saponins by electrospray ionisation ion trap multiple-

16

356

stage mass spectrometry in combination with multivariate analysis. Rapid

357

Communications in Mass Spectrometry 22 (23), 3851–3860.

358

359

Barker, M., Rayens, W., 2003. Partial least squares for discrimination. Journal of Chemometrics 17, 166–173.

360

Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd,

361

C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E., Lan-

362

der, E., Wong, W., Johnson, B., Golub, T., Sugarbaker, D., Meyerson, M.,

363

2001. Classification of human lung carcinomas by mRNA expression profil-

364

ing reveals distinct adenocarcinoma subclasses. Proceedings of the National

365

Academy of Sciences 98 (24), 13790–13795.

366

Bhattacharyya, C., Grate, L., Rizki, A., Radisky, D., Molina, F., Jordan, M.,

367

Bissel, M., Mian, I., 2003. Simultaneous classification and relevant feature

368

identification in high-dimensional spaces: Application to molecular profiling

369

data. Signal Processing 83, 729–743.

370

Bradley, P., Mangasarian, O., 1998. Feature selection via concave minimization

371

and support vector machines. In: Proc. 15th International Conf. on Machine

372

Learning. Morgan Kaufmann, San Francisco, CA, pp. 82–90.

373

374

Breiman, L., Nov. 1995. Better subset selection using nonnegative garotte. Technometrics 37 (4), 373–384.

375

Cason, T., 1990. An evaluation of the potential for clandestine manufacture of

376

3,4-methylenedioxyamphetamine analogs and homologs. Journal of Forensic

377

Science 35 (3), 675–697.

378

379

Chang, C.-C., Lin, C.-J., 2001. LIBSVM: a library for support vector machines. Http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

380

Cox, T., Cox, M., 1994. Multidimensional Scaling. Chapman and Hall, London.

381

Crammer, K., Singer, Y., 2000. On the learnability and design of output codes

382

for multiclass problems. In: Proceedings of the Thirteenth Annual Conference

383

on Computational Learning Theory. pp. 35–46. 17

384

385

Friedman, J., 1989. Regularized discriminant analysis. Journal of the American Statistical Association 84 (405), 165–175.

386

GLPK, 2004. GNU Linear Programming Kit. Www.gnu.org.

387

Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.,

388

Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E.,

389

1999. Molecular classification of cancer: Class discovery and class prediction

390

by gene expression monitoring. Science 286 (5439), 531–537.

391

392

393

394

395

396

Guyon, I., Elissee, A., 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Hoerl, A., R.W.Kennard, 1970. Ridge regression:

Biased estimation for

nonorthogonal problems. Technometrics 12 (1), 55–67.

397

Hsu, C.-W., Lin, C.-J., Mar. 2002. A comparison of methods for multiclass

398

support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–

399

425.

400

Joliffe, I. T., 1986. Principal Component Analysis. Springer-Verlag.

401

Khan, J., Wei, J., Ringn´er, M., Saal, L., Ladanyi, M., Westermann, F., Berthold,

402

F., M, M. S., Antonescu, C., Peterson, C., Meltzer, P., 2001. Classification and

403

diagnostic prediction of cancers using gene expression profiling and artificial

404

neural networks. Nature Medicine 7, 673–679.

405

Kittler, J., Hatef, M., Duin, R., Matas, J., Mar. 1998. On combining classifiers.

406

IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (3), 226–

407

239.

408

Koper, C., van den Boom, C., Wiarda, W., Schrader, M., de Joode,

409

P., van der Peijl, G., Bolck, A., Sep. 2007. Elemental analysis of 3,4-

18

410

methylenedioxymethamphetamine (MDMA): A tool to determine the synthe-

411

sis method and trace links. Forensic Science International 171 (2-3), 171–179.

412

Lee, Y., Lin, Y., Wahba, G., 1981. Multicategory support vector machines:

413

Theory and application to the classification of microarray data and satellite

414

radiance data. Journal of the American Statistical Society 99, 67–81.

415

Mika, S., R¨ atsch, G., M¨ uller, K., 2001. A mathematical programming approach

416

to the kernel fisher algorithm. In: Advances in Neural Information Processing

417

Systems (NIPS). Vol. 13. MIT Press, pp. 591–597.

418

Nutt, C., Mani, D., Betensky, R., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl,

419

U., Hartmann, C., McLaughlin, M., Batchelor, T., Black, P., von Deimling,

420

A., Pomeroy, S., Golub, T., Louis, D., 2003. Gene expression-based classifi-

421

cation of malignant gliomas correlates better with survival than histological

422

classification. Cancer Research 63, 1602–1607.

423

424

425

426

Pandey, A., Mann, M., 2000. Proteomics to study genes and genomes. Nature 405, 837–846. Patterson, S., Aebersold, R., 2003. Proteomics: the first decade and beyond. Nature Genetics 33, 311–323.

427

Peng, H., Long, F., Ding, C., Aug. 2005. Feature selection based on mutual in-

428

formation: Criteria of max-dependency, max-relevance, and min-redundancy.

429

IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8),

430

1226–1238.

431

Platt, J., Christianini, N., Shawe-Taylor, J., 2000. Large margin DAGs for mul-

432

ticlass classification. In: Advances in Neural Information Processing Systems.

433

Vol. 12. MIT Press, Cambridge, MA, pp. 547–553.

434

PLS Toolbox, 2004. PLS Toolbox Version 3.5. Www.eigenvector.com.

435

Roweis, S., Saul, L., Dec. 2000. Nonlinear dimensionality reduction by locally

436

linear embedding. Science 290 (5500), 2323–2326. 19

437

Statnikov, A., Wang, L., Aliferis, C., Jul. 2008. A comprehensive comparison

438

of random forests and support vector machines for microarray-based cancer

439

classification. BMC Bioinformatics 9, 319.

440

441

442

443

Tenenbaum, J., de Silva, V., Langford, J., Dec. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), 2319–2323. Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B 58 (1), 267–288.

444

Tibshirani, R., Hastie, T., Balasubramanian, B., Chu, G., 2002. Diagnosis of

445

multiple cancer types by shrunken centroids of gene expression. Proceedings

446

of the National Academy of Sciences (PNAS) 99 (10), 6567–6572.

447

448

UNODC, 2006. World drug report. Tech. rep., United Nations Office on Drugs and Crime.

449

van ’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse,

450

H., van der Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven,

451

R., Roberts, C., Linsley, P., Bernards, R., Friend, S., Jan. 2002. Gene expres-

452

sion profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.

453

Vapnik, V., 1998. Statistical Learning Theory. Wiley.

454

Veenman, C., Tax, D., Sep. 2005. LESS: a model-based classifier for sparse

455

subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence

456

27 (9), 1496–1500.

457

Weston, J., Elisseeff, A., Scholkopf, B., Tipping, M., 2003. Use of the zero-

458

norm with linear models and kernel methods. Journal of Machine Learning

459

Research 3, 1439–1461.

460

Weston, J., Watkins, C., 1999. Support vector machines for multi-class pattern

461

recognition. In: Proceedings of the 6th European Symposium on Artificial

462

Neural Networks (ESANN). Bruges, pp. 219–224.

20

463

464

465

466

Yang, K., Cai, Z., Li, J., Lin, G., 2006. A stable gene selection in microarray data analysis. BMC Bioinformatics 7 (1), 228. Zadora, G., 2007. Glass analysis for forensic purposes – a comparison of classification methods. Journal of Chemometrics 21, 174–186.

467

Zhu, J., Rosset, S., Hastie, T., Tibshirani, R., 2003. 1-norm support vector

468

machines. In: Thrun, S., Saul, L., Sch¨olkopf, B. (Eds.), Advances in Neu-

469

ral Information Processing Systems (NIPS) 16. MIT Press, Cambridge, MA,

470

p. 16.

21

Lihat lebih banyak...

A sparse nearest mean classifier for high dimensional multi-class problems

Descripción

Comentarios