A Sparse Nearest Mean Classifier for High Dimensional Multi-Class Problems Cor J. Veenmana,∗, Annabel Bolcka a Digital
Technology and Biometrics Department, Netherlands Forensic Institute, The Hague
Abstract The analysis of small datasets in high dimensional spaces is inherently difficult. For two-class classification problems there are a few methods that are able to face the so-called curse of dimensionality. However, for multi-class sparsely sampled datasets there are hardly any specific methods. In this paper, we propose four multi-class classifier alternatives that effectively deal with this type of data. Moreover, these methods implicitly select a feature subset optimized for class separation. Accordingly, they are especially interesting for domains where an explanation of the problem in terms of the original features is desired. In the experiments, we applied the proposed methods to an MDMA powders dataset, where the problem was to recognize the production process. It turns out that the proposed multi-class classifiers perform well, while the few utilized features correspond to known MDMA synthesis ingredients. In addition, to show the general applicability of the methods, we applied them to several other sparse datasets, ranging from bioinformatics to chemometrics datasets having as few as tens of samples in tens to even thousands of dimensions and three to four classes. The proposed methods had the best average performance, while very few dimensions were effectively utilized. Keywords: Classification, multi-class, support vector machine, high ∗ Corresponding author. Tel: +31-70-8886654 Fax: +31-70-8886559 Address: Laan van Ypenburg 6, 2497 GB, The Hague, The Netherlands Email addresses:
[email protected] (Cor J. Veenman),
[email protected] (Annabel Bolck)
Preprint submitted to Pattern Recognition Letters
December 17, 2010
dimensional, mass spectrometry, chemometrics, gene expression profiles, bioinformatics.
1
1. Introduction
2
The dimensionality of the problems that data analyzers are faced with is
3
growing enormously. In many application areas, so-called high-throughput tech-
4
nologies, such as DNA microarrays Alon et al. (1999), Golub et al. (1999), van ’t
5
Veer et al. (2002), proteomics Pandey and Mann (2000), Patterson and Aeber-
6
sold (2003) and mass spectrometry are used to measure the state of processes
7
and systems or the composition of substances.
8
We are studying mass spectrometry data to find a common structure in
9
illegal synthetic drugs. The drugs that we are analyzing are MDMA (3,4-
10
methyleendioxymethylamphetamine) powders from dismantled production sites.
11
MDMA, also known as XTC, is one of the most widely spread illegal synthetic
12
drugs in Europe. By estimate, 4300 kg of MDMA was seized in Europe in 2004
13
UNODC (2006). To fight the organizations of production and trade of MDMA,
14
comparative analysis could be of great aid for intelligence purposes. For in-
15
stance, the type and possibly the number of operational illicit production sites
16
can be derived from the recognition of MDMA powders or tablets in one of the
17
known synthesis methods. That is, in the Netherlands MDMA is mostly syn-
18
thesized in three variations of a reductive amination procedure Cason (1990).
19
The problem of recognizing the synthesis method of MDMA powders has been
20
already attacked before in Koper et al. (2007), where the powders were analyzed
21
using knowledge of the synthesis methods and substances seized in illicit labs.
22
The results were certainly promising, but it took a lot of ’manual’ effort to es-
23
tablish the decision rules based on process knowledge. Further, with 15% of the
24
samples recognized wrongly, there is room for improvement. Finally, automatic
25
methods are required, since similar problems will be attacked for other synthetic
26
drugs, where suitable domain knowledge is not yet available. These automatic
27
methods, however, should be transparent, so forensic experts can understand
2
28
the automatic recognition process. This understanding is desirable from an an-
29
alytical and intelligence point of view. More importantly, it is required when
30
the forensic expert needs to explain his findings in the court of justice.
31
In this research we focus on the following questions: 1) can the three known
32
synthesis methods for MDMA powders automatically and accurately be iden-
33
tified using statistical pattern recognition methods and 2) which chemical ele-
34
ments are important to distinguish the synthesis methods. Because the dataset
35
that we have access to is collected from MDMA powders found at discovered
36
production sites, it is limited in size. We have 53 mass spectrometry measure-
37
ments from the powders of only 57 MDMA samples Koper et al. (2007). As
38
a consequence, we are faced with a small sample size multi-class classification
39
problem. Recently, we proposed a sparse classifier model for two-class classi-
40
fication problems Veenman and Tax (2005). This so-called LESS classifier is
41
a nearest mean classifier with integrated dimension reduction in a linear pro-
42
gramming model. Here, we propose four extensions of the model that enable for
43
multi-class classification, while the number of effectively used dimensions is still
44
minimized. To explore the general applicability of the method, we also apply it
45
to several other small sample size multi-class problems from the bioinformatics
46
and chemometrics domain.
47
In the next section, we first put our work in perspective by describing re-
48
lated work in high dimensional multi-class recognition problems. Then, we
49
describe the previously published LESS classifier Veenman and Tax (2005) and
50
we propose four multi-class extensions. In the following section, we describe the
51
MDMA mass spectrometry dataset and report the experimental results of ap-
52
plying the proposed multi-class classifier to the MDMA and other small sample
53
size datasets. We conclude with a discussion of the results.
54
2. Related Work
55
In order to deal with high dimensional multi-class small sample size classifi-
56
cation problems, the main problem is clearly the sparseness of the data, or the
3
57
emptiness of the high-dimensional data space. The problem is probably stronger
58
for generative classifiers that try model the high dimensional data distributions
59
of the classes involved and use these models to define the inter-class decision
60
boundaries. However, discriminative classifiers, that directly model the decision
61
boundary, also suffer from the sparseness of the data. In both cases lack of data,
62
that is, insufficient sampling of the high dimensional distributions or the high
63
dimensional boundary between the distributions, hinders the modeling resulting
64
in models that easily overfit on the training data. Below we list related work
65
for dimension reduction and high-dimensional multi-class methods.
66
Dimension reduction as a pre-processing step
67
The typical method of dealing with high-dimensional data is by reducing
68
the dimensionality of the dataset through a pre-processing step. This can be
69
done in two different ways. Either through the projection to a lower dimen-
70
sional space, e.g. Cox and Cox (1994), Joliffe (1986), Roweis and Saul (2000),
71
Tenenbaum et al. (2000), or, through the selection of a feature subset with a
72
suitable class separation criterion Guyon and Elissee (2003), Peng et al. (2005).
73
After the dimensionality has been reduced any multi-class classifier can be used.
74
While such an approach is straightforward and offers great flexibility in choosing
75
any preferred classifier in the lower dimensional space, it has some fundamental
76
problems. Because the dimension reduction step is independent from the clas-
77
sification, the model that is used for dimension reduction interferes generally
78
with the classifier model that is used in the reduced space. This holds espe-
79
cially when non-linear classifiers are applied. Further, most projection methods
80
are unsupervized, so they may harm the class separation by ignoring the class
81
distributions. Another disadvantage of projection methods is that the resulting
82
features are not a subset of the original features. Consequently, an explana-
83
tion of the classification results in terms of a subset of the input features is
84
not possible. This is often desirable, like in our case of drug synthesis method
85
classification.
4
86
Regularized methods and sparse models
87
Instead of pre-processing methods for dimension reduction, regularized clas-
88
sification methods can handle high dimensional datasets directly by reducing
89
the variance of the model Friedman (1989), Hastie et al. (2001), Hoerl and
90
R.W.Kennard (1970), Tibshirani et al. (2002), Vapnik (1998). For our prob-
91
lems these classifiers have, however, two important shortcomings. First, these
92
classifiers are designed for two-class classification problems. Second, they do not
93
yield a sparse model. That is, all features are in principle used so an explanation
94
of the underlying phenomenon in a subset of the measured features is not pos-
95
sible. Then, several methods have been reported that integrate feature subset
96
selection Bhattacharyya et al. (2003), Bradley and Mangasarian (1998), Mika
97
et al. (2001), Tibshirani (1996), Veenman and Tax (2005), Zhu et al. (2003),
98
but also these are designed for two-class classification problems.
99
Multi-class methods through a post-processing step
100
To deal with multi-class problems, generic post-processing approaches are
101
available. That is, several two-class classifiers can be trained on subproblems,
102
either one-versus-rest or one-versus-one problems Kittler et al. (1998), Platt
103
et al. (2000). The resulting classifiers can be converted into a multi-class classi-
104
fier by fusing the outputs of the classifiers of the respective subproblems. This
105
approach has, however, similar problems as the generic pre-processing dimen-
106
sion reduction methods. That is, the classifiers for the subproblems have no
107
information about the other subproblems that need to be solved. Accordingly,
108
dependencies between classes cannot be exploited.
109
Regularized multi-class methods
110
Finally, several regularized integrated multi-class classification methods have
111
been reported that are all derived from the support vector machine Crammer
112
and Singer (2000), Hsu and Lin (2002), Lee et al. (1981), Vapnik (1998), Weston
113
and Watkins (1999). For these methods, the number of parameters increases
114
at least linearly in the number of classes. Further, these methods still lack the
5
115
ability to yield a sparse classification model. In Weston et al. (2003) multi-
116
class sparse linear and non-linear models are proposed. With respect to design
117
objectives these are the most closely related to the work proposed here. The
118
main differences with our work are that 1) we impose a simple data model
119
based on the class means, 2) our models are linear, 3) the resulting model
120
can be efficiently optimized (a linear program), and most interestingly 4) in
121
contrast with the SVM multi-class implementations, we introduce multi-class
122
versions that solve all inter-class boundaries in one subspace. We explain our
123
model in the next section.
124
3. Multi-Class Generalization of the LESS Model
125
We first formalize the problem. Let X = {~x1 , ~x2 , ..., ~xn } be a data set with
126
n objects, where ~xi is a feature vector in a p-dimensional metric space. Each
127
object has a corresponding class label yi ∈ 1, ..., c, where c is the number of
128
classes. Let Xk ⊂ X be the subset of nk objects from X with label k, and
129
X6=k = X \ Xk the objects with label unequal to k. We denote the mean vector
130
of a set of objects U ⊂ X with µ(U ), the mean vector of Xk with µk , and the
131
predicted label for object ~x with f (~x).
132
The rationale behind the LESS classification model is that all training ob-
133
jects should be closer to the mean vector of their class, while at the same time
134
in the distance computation all dimensions are weighted and the total sum of
135
these weights should be minimal Veenman and Tax (2005). To enable to model
136
non-separable classes, slack variables are introduced as in the support vector
137
machine Vapnik (1998). As a result, some training vectors may not be closer to
138
their class mean vector. The resulting problem is a linear minimization prob-
139
lem with linear constraints (a linear program) that can be solved efficiently. For
140
two-class classification problems the LESS model is defined as follows Veenman
141
and Tax (2005):
6
min
p X
wj + C
j=1
n X
ξi ,
(1)
i=1
~x ∈ X1 , Pp wj φ(~x, j) ≥ 1 − ξi j=1 subject to: P p ~x ∈ X2 , x, j) < −1 + ξi j=1 wj φ(~ 2 2 where φ(~x, j) = xj − µ ~ 1j − xj − µ2j and wj ≥ 0, ξi ≥ 0.
(2)
(3) (4)
142
In these equations, w ~ is a vector with weights wj per dimension, and ξi is
143
the slack variable for object ~xi to enable the modeling of inseparable data. C
144
is a tunable parameter that balances model sparseness against model accuracy.
145
Since in this model the weight entries wj are naturally non-negative and
146
summed in the minimization term (1), the solution to this linear program is
147
typically sparse in the number of effectively used dimensions. This is a phe-
148
nomenon that was first noticed for L1 regularized regression models Breiman
149
(1995), Tibshirani (1996). In Veenman and Tax (2005) it has been shown,
150
that LESS needs substantially less dimensions for similar or better performance
151
than related models for several sparsely sampled high-dimensional two-class
152
problems.
153
3.1. Multi-class Model
154
To generalize the LESS model for multi-class problems, we need to establish
155
subproblems to learn the additional inter-class boundaries. To define a sub-
156
problem for a boundary, two data subsets U and V and a weight vector w ~ are
157
required. The means of the two data subsets define the line on which the vec-
158
tors involved in the subproblem are projected. The weight vector determines
159
in which subspace the projection line is established. The composition of sub-
160
problems using the data subsets S = {(U, V )} and the set of weight vectors
161
W = {w} ~ yield the following general model:
7
min ∀w ~ ∈W
p X
wj + C
j=1
n X
ξi ,
(5)
i=1
~x ∈ U, Pp φ(~x, U, V, w, ~ j) ≥ 1 − ξi j=1 subject to: ∀(U, V ) ∈ S : ~x ∈ V, Pp φ(~x, U, V, w, ~ j) < −1 + ξi j=1 h 2 2 i where φ(~x, U, V, w, ~ j) = wj · xj − µ(U )j − xj − µ(V )j .
(6)
(7)
162
For the selection of the data subsets S and the use of the weight vectors
163
W , we have each two alternatives. The combinations of these alternatives yield
164
four versions of the multi-class LESS generalization. In all cases, the multi-class
165
problem is modeled such that all inter-class boundaries are dependent. This
166
contrasts with the post-processing approaches to multi-class combiners, where
167
all sub-problems (inter-class boundaries) are independent. Below, we list the
168
alternatives for the datasubsets and weight vectors.
169
One-versus-rest constraints
170
For the SVM, the first true multi-class implementations, that is, through
171
a single optimization problem, used one-versus-rest constraints Vapnik (1998),
172
Weston and Watkins (1999). In our model (5)−(7), the subproblems can be
173
implemented similarly by defining the set S from (6) as follows:
S = {(Xk , X6=k )} 174
where
k ∈ {1, ..., c}.
Accordingly, the data vectors are projected on the lines between the mean
175
of a class and the mean of the remaining data vectors.
176
One-versus-one constraints1
177
178
(8)
Alternatively, we can establish the constraints in a one-versus-one way. Then, the data subsets for the subproblems are defined as follows: 1 With
one-versus-one constraints, for two classes the multi-class LESS reduces to the nor-
mal LESS.
8
S = {(Xk , Xh )} 179
180
k, h ∈ {1, ..., c};
where
h > k.
(9)
Multi-subspace Similar to the multi-class SVM proposals Vapnik (1998), Weston and Watkins
181
(1999), the subproblems can be solved simultaneously each in a separate sub-
182
space. That is, for each pair of data subsets (U, V ) a corresponding weight
183
vector w ~ ∈ W is used.
184
Single-subspace
185
Because the data vectors are projected on the line between the means of the
186
data subsets, either the one-versus-rest or the one-versus-one subsets, the sub-
187
problems can also be resolved all in the same subspace. Then, W contains only
188
one weight vector that is used for all subproblems. This is an important dif-
189
ference with the SVM implementations for multi-class problems Vapnik (1998),
190
Weston and Watkins (1999). Both computationally and qualitatively this dif-
191
ference has advantages. The computationally advantages result from a lower
192
number of variables, see Table 1. Additionally, because the model has fewer
193
parameters it also has fewer degrees of freedom, which is advantageous when
194
only minimal data is available.
195
Problem sizes
196
Depending on the model setup in terms of subproblems and subspaces the
197
model has a varying complexity in number of constraints and variables as can
198
be seen in Table 1. #variables #constraints
#subproblems
multi-subspace
single-subspace
one-versus-one
(c − 1) · n
c · (c − 1)/2
(c − 1) · (n + p · c/2)
(c − 1) · n + p
one-versus-rest
c·n
c
c · (n + p)
c·n+p
Table 1: Complexity of the models denoted in the number of constraints and variables.
9
199
In addition to the constraints that are required to impose class separation,
200
all variables (wj and ξi ) must be non-negative. We ignored these constraints in
201
the table.
202
3.2. Classification
203
204
205
206
To classify unseen objects, these objects must be mapped on all subproblems. One-versus-rest subproblems In case of one-versus-rest subproblems the unseen object obtains the label of the class that yields the highest distance toward its class mean:
∀(Xk , X6=k ) ∈ S : λ(k) =
p X
φ(~x, Xk , X6=k , w, ~ j).
(10)
j=1 207
208
209
One-versus-one subproblems With one-versus-one subproblems, all subproblems and thus classes receive many contributions for the unseen object, both positive and negative:
∀(Xk , Xh ) ∈ S :
λ(k) = λ(k) + Pp
φ(~x, Xk , Xh , w, ~ j)
λ(h) = λ(h) − Pp
φ(~x, Xk , Xh , w, ~ j).
j=1 j=1
210
211
The label f (~x) of the unseen object ~x equals the index of λ(i) with the highest total contribution:
f (~x) = arg max λ(i). i
212
(11)
(12)
4. Experiments
213
To evaluate the general effectiveness of the proposed method, we applied it to
214
several high dimensional datasets in addition to the MDMA dataset. Moreover,
215
we compared the performance with two methods that are typically applied in
216
these high dimensional domains. The data sets used in the experiments are
10
Dataset
Ref.
#classes
#dimensions
#samples
MDMA
Koper et al. (2007)
3
53
57
Nutt et al. (2003)
4
4433
50
Bhattacharjee et al. (2001)
5
3312
203
MLL
Armstrong et al. (2002)
3
8685
72
OLITOS
Armanino et al. (1989)
4
25
120
SRBCT
Khan et al. (2001)
3
2308
83
WINE
Alon et al. (1999)
4
2700
44
GLIOMA LUNG
Table 2: Overview of the characteristics of the data sets used in the experiments.
217
biological tissue classification datasets measured with gene expression micro-
218
arrays and chemometrics data sets measured through gas chromatography -
219
mass spectrometry (GS-MS). The gene expression data sets SRBCT, GLIOMA
220
and LUNG have been preprocessed as in Yang et al. (2006). In Table 2, we list
221
the characteristics of the data.
222
We compared the LESS multi-class extensions with the support vector ma-
223
chine with linear kernel (LIN-SVM) Vapnik (1998) and partial least squares dis-
224
criminant analysis (PLS-DA) Barker and Rayens (2003). The SVM has shown
225
competitive performance in many applications areas, especially with high di-
226
mensional problems, e.g. Statnikov et al. (2008), Yang et al. (2006), Zadora
227
(2007). For the small sample size problems that we consider here, the linear
228
kernel is the most appropriate. PLS-DA is a popular technique in the chemo-
229
metrics community, where mass spectrometry devices yield high dimensional
230
recognition problems, e.g. Ballabio et al. (2008), Bankefors et al. (2008).
231
The LESS classifier has been implemented with the general purpose linear
232
program solver GLPK GLPK (2004). We used the SVM implementation lib-
233
SVM Chang and Lin (2001). In this toolbox, multi-class problems are handled
234
by solving and combining several independent one-versus-all SVM problems.
235
Finally, the PLS-DA classifier is from PLS Toolbox PLS Toolbox (2004).
236
Parameter tuning and validation
237
The LESS etxensions, the SVM, and PLS-DA all have a parameter that
238
constrains the complexity of the model. For LESS and the SVM the (typically
11
Dataset
LESS(ovr/ss)
LESS(ovr/ms)
LESS(ovo/ss)
LESS(ovo/ms)
LIN-SVM
PLS-DA
MDMA
19.3 ± 17.8
8.8 ± 10.8
9.3 ± 11.2
8.2 ± 15.0
12.1 ± 12.5
17.7 ± 13.8
GLIOMA
35.4 ± 16.6
23.4 ± 18.1
21.7 ± 13.8
18.8 ± 12.2
24.1 ± 17.0
17.2 ± 16.5
LUNG
2.9 ± 3.2
3.4 ± 5.2
2.5 ± 3.3
5.8 ± 6.3
2.8 ± 4.1
2.6 ± 3.6
MLL
7.2 ± 9.3
6.7 ± 9.2
6.7 ± 9.2
5.6 ± 7.9
5.0 ± 8.8
6.1 ± 10.1
OLITOS
17.6 ± 12.5
17.0 ± 11.7
15.6 ± 10.8
30.3 ± 14.6
27.4 ± 15.6
17.8 ± 14.3
SRBCT
1.7 ± 5.0
1.4 ± 3.1
2.6 ± 4.1
1.7 ± 3.8
0.0 ± 0.0
0.0 ± 0.0
29.7 ± 18.4
29.8 ± 20.0
30.8 ± 21.6
39.0 ± 19.4
27.3 ± 18.8
27.2 ± 17.7
14.5
12.7
12.3
16.4
13.1
15.3
WINE Average
Table 3: Overview of the estimated class averaged generalisation errors in percentages (through 10-fold cross-validation repeated three times). Dataset
LESS(ovr/ss)
LESS(ovr/ms)
LESS(ovo/ss)
LESS(ovo/ms)
LIN-SVM
PLS-DA
MDMA
5.7 ± 1.7
4.9 ± 1.5
6.2 ± 1.4
3.9 ± 1.1
53
(4.72 ± 3.5)
GLIOMA
42.4 ± 6.1
61.9 ± 5.1
32.5 ± 2.7
52.9 ± 7.0
4434
(9.8 ± 2.3)
LUNG
44.7 ± 11.9
63.6 ± 8.0
30.2 ± 9.3
21.6 ± 9.5
3312
(24.4 ± 2.3)
MLL
10.6 ± 7.0
27.4 ± 5.8
7.9 ± 2.5
9.3 ± 4.9
5848
(7.0 ± 2.7)
OLITOS
20.2 ± 1.7
22.0 ± 2.9
18.3 ± 1.4
17.3 ± 3.1
25
(23.00 ± 2.8)
SRBCT
21.5 ± 2.5
27.8 ± 3.2
19.4 ± 2.8
24.2 ± 2.2
2308
(20.1 ± 2.4)
WINE
22.4 ± 6.6
32.7 ± 3.0
13.7 ± 2.5
15.8 ± 4.2
2700
(24.3 ± 2.6)
26.97
39.23
20.33
23.52
3100
(16.2)
Average
Table 4: Overview of the average number of utilized dimensions in the cross-validation folds. For PLS-DA the numbers refer to the optimised number of latent variables. Therefore, for PLS-DA the number of dimensions are printed in brackets.
239
called) ’C’ parameter balances model complexity and accuracy. The parame-
240
ter of PLS-DA sets the number of latent variables. For all methods, we tune
241
the respective parameter through stratified cross-validation. Because the per-
242
formance of the methods is also estimated through stratified cross-validation, a
243
nested cross-validation is required to include the parameter tuning sensitivity
244
in the performance estimation. For both the parameter tuning (inner loop) and
245
the performance testing (outer loop), we used stratified 10-fold cross-validation.
246
In order to restrict the influence of inbalanced classes, we computed the aver-
247
aged class error (balanced error ) in the 10-fold cross-validation. We repeated
248
the cross-validation three times to have a more reliable performance estimate.
12
249
MDMA dataset
250
We now turn to the high-dimensional MDMA powder classification problem
251
that we started with. We have two objectives for the model validation. First,
252
the model should be able to predict the class labels accurately. Second, we
253
are interested in the features that the model effectively used, so we know which
254
elements contribute to the separation of the classes. We start with a description
255
of the data in the next section. After that we describe the experimental results.
256
In total 57 MDMA powders were collected at 12 illegal production sites.
257
These powders were produced using one of three different synthesis methods.
258
The first method is the high pressure method. This method is performed at el-
259
evated pressure (3 bar) with hydrogen and platinum (Pt) as the catalysts. An-
260
other procedure is called the cold method. Here, sodium borohydride (NaBH4 )
261
serves as a reducing agent. The name cold method refers to performance of the
262
synthesis at lower temperatures by cooling the reagents in a fridge to suppress
263
unwanted reactions. The third method is the so-called aluminium amalgam
264
method, where aluminium foil and mercury chloride (HgCl2 ) are used to form
265
an aluminium amalgam, which serves as the reducing agent. Per lab the synthe-
266
sis method is established based on the type of reaction vessels, precursors and
267
chemicals present at the site. Consequently, all sample powders collected at one
268
site are assigned the same synthesis method label. From the collected MDMA
269
powder samples, 29 powders (from five different labs) were synthesized by the
270
high pressure method, 18 powders (from four different labs) were synthesized
271
by the cold method and 10 powders (from three different labs) produced by the
272
aluminium amalgam method. The concentration levels of 53 chemical elements
273
in the powders are measured with Inductively Coupled Plasma Optical Emission
274
Spectrometry (ICP-OES) and Inductively Coupled Plasma Mass Spectrometry
275
(ICP-MS) devices. For details on the measurement techniques and detection
276
limits, see Koper et al. (2007).
277
In Table 3, we show the results of the cross-validation runs. In the first row
278
of this table, it can be seen that the LESS multi-class version with one-versus-
279
one constraints in multiple subspaces LESS(ovo/ms) has the best performance 13
#constraints
#variables
Dataset
ovr
ovo
ovr/ss
ovr/ms
ovo/ss
MDMA
171
114
224
330
167
ovo/ms 273
GLIOMA
200
150
4634
17936
4584
26754
LUNG
1015
812
4327
17575
4124
33932
MLL
171
114
6019
17715
5962
17658
OLITOS
480
360
505
580
385
510
SRBCT
332
249
2640
9564
2557
14097
WINE
176
132
2876
10976
2832
16332
Total
2545
1931
21225
74676
20611
109556
Table 5: Overview of the number of constraints and variables of the four multi-class LESS models for the various datasets.
280
with 8.2% error. In contrast, the analytical method using chemical knowledge
281
resulted in 9 samples for which the synthesis method could not correctly be
282
determined (error is 15.8%) Koper et al. (2007). On average LESS(ovo/ms) used
283
approximately 4 dimensions in the cross-validation loops, see Table 4 first row.
284
These dimensions correspond to the concentrations of the chemical elements B,
285
Zn, Pt, and Hg. From these elements Zn has not been recognized as marker for
286
the production process in the previous study Koper et al. (2007).
287
Additional small sample size datasets
288
In Table 3, we show the results of applying the four LESS alternatives, the
289
LIN-SVM and PLS-DA to the MDMA dataset and 6 other high dimensional
290
small sample size datasets. From the table several observations can be done.
291
Overall, LESS(ovo/ss) has the lowest average error. Remarkably, LESS(ovo/ss)
292
is the LESS model with the fewest parameters, both in the number of constraints
293
and in the number of variables, see Table 5. Besides, from the LESS models
294
LESS(ovo/ms) has the highest average error and also the most parameters. This
295
could indicate a relation between model complexity and data sparseness. LIN-
296
SVM has for three datasets the best performance and has also on average a
297
competitive performance. In addition to the performance measured in error,
298
the transparency of the model in the utilized dimensions is important. Neither
299
LIN-SVM nor PLS-DA effectively reduce the number of utilized dimensions in
14
300
the model. PLS-DA reduces the complexity of the model through a restricted
301
number of latent variables. However, the latent variables are composed of a
302
linear combination of in general all original dimensions. For this reason the
303
number of utilized dimensions (i.e. latent variables) are printed in brackets
304
for PLS-DA in Table 4. The table shows that all LESS methods use only a
305
fraction of the original dimensions, especially for the very high dimensional
306
datasets. Moreover, LESS(ovo/ss) that has the lowest average error also clearly
307
consistently used the fewest dimensions.
308
5. Conclusions
309
In this paper, we proposed a method for the classification of high dimen-
310
sional sparse multi-class datasets. The method is a multi-class extension of the
311
previously reported two-class LESS method. We proposed four versions that
312
differ in the number of parameters and in the complexity. The method aims at
313
utilizing a subset of the original dimensions to deal with data sparseness and to
314
obtain relevant dimensions for the given classification task.
315
With the proposed method, we studied a dataset consisting of MDMA pow-
316
ders that were produced with one of three known synthesis methods. The task
317
was to predict the synthesis method based on the element concentrations of
318
the powders as estimated with mass spectrometry. Additionally, it should be
319
possible to inspect the utilized element concentrations in order to obtain prob-
320
lem knowledge from the model. The proposed method was able to predict the
321
synthesis method with 8.2% average cross-validation error. That is, the pattern
322
recogntion approach improves on the analytical method using domain knowl-
323
edge (15.8% error) Koper et al. (2007). Further, since the method is a sparse
324
classification model, we could inspect the utilized elements by the model. It
325
turned out that these correspond to the three elements previously reported as
326
significant and related to the production process Koper et al. (2007), while an
327
additional marker element (Zn) has been established.
328
To show the general applicability of the proposed method, we also applied it
15
329
to other small sample size problems. Further, we compared the proposed mod-
330
els to other methods that are typically applied to this type of datasets, i.e. the
331
linear support vector machine (LIN-SVM) and partial least squares discrimant
332
anslysis (PLS-DA). Overall, one of the LESS extensions had the lowest average
333
error. Moreover, only the proposed LESS extensions modeled the inter-class
334
boundaries in a true subspace. It turned out that for all datasets only a small
335
subset of the original data dimensions were effectively used. The obtained clas-
336
sification model itself learns which features are relevant for separation of the
337
classes. Accordingly, besides competitive classification performance, the LESS
338
models enable to gain domain knowledge.
339
References
340
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine,
341
A. J., 1999. Broad patterns of gene expression revealed by clustering of tumor
342
and normal colon tissues probed by oligonucleotide arrays. Proceedings of the
343
National Academy of Sciences (PNAS) 96 (12), 6745–6750.
344
Armanino, C., Leardi, R., Lanteri, S., G.Modi, 1989. Chemometric analysis of
345
tuscan olive oils. Chemometrics and Intelligent Laboratory Systems 5, 343–
346
354.
347
Armstrong, S., Staunton, J., Silverman, L., Pieters, R., den Boer, M., Minden,
348
M., Sallan, S., Lander, E., Korsmeyer, T. G. S., 2002. MLL translocations
349
specify a distinct gene expression profile that distinguishes a unique leukemia.
350
Nature Genetics 30, 41–47.
351
Ballabio, D., Skov, T., Leardi, R., Bro, R., 2008. Classification of GC-MS mea-
352
surements of wines by combining data dimension reduction and variable se-
353
lection techniques. Journal of Chemometrics 22, 457–463.
354
Bankefors, J., Nord, L., Kenne, L., 2008. Structural classification of acyl-
355
substituted quillaja saponins by electrospray ionisation ion trap multiple-
16
356
stage mass spectrometry in combination with multivariate analysis. Rapid
357
Communications in Mass Spectrometry 22 (23), 3851–3860.
358
359
Barker, M., Rayens, W., 2003. Partial least squares for discrimination. Journal of Chemometrics 17, 166–173.
360
Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd,
361
C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E., Lan-
362
der, E., Wong, W., Johnson, B., Golub, T., Sugarbaker, D., Meyerson, M.,
363
2001. Classification of human lung carcinomas by mRNA expression profil-
364
ing reveals distinct adenocarcinoma subclasses. Proceedings of the National
365
Academy of Sciences 98 (24), 13790–13795.
366
Bhattacharyya, C., Grate, L., Rizki, A., Radisky, D., Molina, F., Jordan, M.,
367
Bissel, M., Mian, I., 2003. Simultaneous classification and relevant feature
368
identification in high-dimensional spaces: Application to molecular profiling
369
data. Signal Processing 83, 729–743.
370
Bradley, P., Mangasarian, O., 1998. Feature selection via concave minimization
371
and support vector machines. In: Proc. 15th International Conf. on Machine
372
Learning. Morgan Kaufmann, San Francisco, CA, pp. 82–90.
373
374
Breiman, L., Nov. 1995. Better subset selection using nonnegative garotte. Technometrics 37 (4), 373–384.
375
Cason, T., 1990. An evaluation of the potential for clandestine manufacture of
376
3,4-methylenedioxyamphetamine analogs and homologs. Journal of Forensic
377
Science 35 (3), 675–697.
378
379
Chang, C.-C., Lin, C.-J., 2001. LIBSVM: a library for support vector machines. Http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
380
Cox, T., Cox, M., 1994. Multidimensional Scaling. Chapman and Hall, London.
381
Crammer, K., Singer, Y., 2000. On the learnability and design of output codes
382
for multiclass problems. In: Proceedings of the Thirteenth Annual Conference
383
on Computational Learning Theory. pp. 35–46. 17
384
385
Friedman, J., 1989. Regularized discriminant analysis. Journal of the American Statistical Association 84 (405), 165–175.
386
GLPK, 2004. GNU Linear Programming Kit. Www.gnu.org.
387
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.,
388
Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E.,
389
1999. Molecular classification of cancer: Class discovery and class prediction
390
by gene expression monitoring. Science 286 (5439), 531–537.
391
392
393
394
395
396
Guyon, I., Elissee, A., 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Hoerl, A., R.W.Kennard, 1970. Ridge regression:
Biased estimation for
nonorthogonal problems. Technometrics 12 (1), 55–67.
397
Hsu, C.-W., Lin, C.-J., Mar. 2002. A comparison of methods for multiclass
398
support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–
399
425.
400
Joliffe, I. T., 1986. Principal Component Analysis. Springer-Verlag.
401
Khan, J., Wei, J., Ringn´er, M., Saal, L., Ladanyi, M., Westermann, F., Berthold,
402
F., M, M. S., Antonescu, C., Peterson, C., Meltzer, P., 2001. Classification and
403
diagnostic prediction of cancers using gene expression profiling and artificial
404
neural networks. Nature Medicine 7, 673–679.
405
Kittler, J., Hatef, M., Duin, R., Matas, J., Mar. 1998. On combining classifiers.
406
IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (3), 226–
407
239.
408
Koper, C., van den Boom, C., Wiarda, W., Schrader, M., de Joode,
409
P., van der Peijl, G., Bolck, A., Sep. 2007. Elemental analysis of 3,4-
18
410
methylenedioxymethamphetamine (MDMA): A tool to determine the synthe-
411
sis method and trace links. Forensic Science International 171 (2-3), 171–179.
412
Lee, Y., Lin, Y., Wahba, G., 1981. Multicategory support vector machines:
413
Theory and application to the classification of microarray data and satellite
414
radiance data. Journal of the American Statistical Society 99, 67–81.
415
Mika, S., R¨ atsch, G., M¨ uller, K., 2001. A mathematical programming approach
416
to the kernel fisher algorithm. In: Advances in Neural Information Processing
417
Systems (NIPS). Vol. 13. MIT Press, pp. 591–597.
418
Nutt, C., Mani, D., Betensky, R., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl,
419
U., Hartmann, C., McLaughlin, M., Batchelor, T., Black, P., von Deimling,
420
A., Pomeroy, S., Golub, T., Louis, D., 2003. Gene expression-based classifi-
421
cation of malignant gliomas correlates better with survival than histological
422
classification. Cancer Research 63, 1602–1607.
423
424
425
426
Pandey, A., Mann, M., 2000. Proteomics to study genes and genomes. Nature 405, 837–846. Patterson, S., Aebersold, R., 2003. Proteomics: the first decade and beyond. Nature Genetics 33, 311–323.
427
Peng, H., Long, F., Ding, C., Aug. 2005. Feature selection based on mutual in-
428
formation: Criteria of max-dependency, max-relevance, and min-redundancy.
429
IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8),
430
1226–1238.
431
Platt, J., Christianini, N., Shawe-Taylor, J., 2000. Large margin DAGs for mul-
432
ticlass classification. In: Advances in Neural Information Processing Systems.
433
Vol. 12. MIT Press, Cambridge, MA, pp. 547–553.
434
PLS Toolbox, 2004. PLS Toolbox Version 3.5. Www.eigenvector.com.
435
Roweis, S., Saul, L., Dec. 2000. Nonlinear dimensionality reduction by locally
436
linear embedding. Science 290 (5500), 2323–2326. 19
437
Statnikov, A., Wang, L., Aliferis, C., Jul. 2008. A comprehensive comparison
438
of random forests and support vector machines for microarray-based cancer
439
classification. BMC Bioinformatics 9, 319.
440
441
442
443
Tenenbaum, J., de Silva, V., Langford, J., Dec. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), 2319–2323. Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B 58 (1), 267–288.
444
Tibshirani, R., Hastie, T., Balasubramanian, B., Chu, G., 2002. Diagnosis of
445
multiple cancer types by shrunken centroids of gene expression. Proceedings
446
of the National Academy of Sciences (PNAS) 99 (10), 6567–6572.
447
448
UNODC, 2006. World drug report. Tech. rep., United Nations Office on Drugs and Crime.
449
van ’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse,
450
H., van der Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven,
451
R., Roberts, C., Linsley, P., Bernards, R., Friend, S., Jan. 2002. Gene expres-
452
sion profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.
453
Vapnik, V., 1998. Statistical Learning Theory. Wiley.
454
Veenman, C., Tax, D., Sep. 2005. LESS: a model-based classifier for sparse
455
subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence
456
27 (9), 1496–1500.
457
Weston, J., Elisseeff, A., Scholkopf, B., Tipping, M., 2003. Use of the zero-
458
norm with linear models and kernel methods. Journal of Machine Learning
459
Research 3, 1439–1461.
460
Weston, J., Watkins, C., 1999. Support vector machines for multi-class pattern
461
recognition. In: Proceedings of the 6th European Symposium on Artificial
462
Neural Networks (ESANN). Bruges, pp. 219–224.
20
463
464
465
466
Yang, K., Cai, Z., Li, J., Lin, G., 2006. A stable gene selection in microarray data analysis. BMC Bioinformatics 7 (1), 228. Zadora, G., 2007. Glass analysis for forensic purposes – a comparison of classification methods. Journal of Chemometrics 21, 174–186.
467
Zhu, J., Rosset, S., Hastie, T., Tibshirani, R., 2003. 1-norm support vector
468
machines. In: Thrun, S., Saul, L., Sch¨olkopf, B. (Eds.), Advances in Neu-
469
ral Information Processing Systems (NIPS) 16. MIT Press, Cambridge, MA,
470
p. 16.
21