Identifying splice-junction sequences by hierarchical multiclassifier

June 8, 2017 | Autor: Loris Nanni | Categoría: Cognitive Science, Machine Learning, Artificial Intelligence applications, Electrical And Electronic Engineering

Share Embed

Laporkan tautan ini

Descripción

Pattern Recognition Letters 27 (2006) 1390–1396 www.elsevier.com/locate/patrec

Identifying splice-junction sequences by hierarchical multiclassiﬁer Alessandra Lumini, Loris Nanni

*

DEIS, IEIIT – CNR, Universita` di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy Received 13 September 2004; received in revised form 27 April 2005 Available online 2 May 2006 Communicated by L. Goldfarb

Abstract Splice sites are locations in DNA which separate protein-coding regions (exons) from noncoding regions (introns). Recently, several works have approached the splice-junction problem by applying methods from the ﬁeld of machine learning techniques. In this work, we propose a hierarchical multiclassiﬁer (HM) architecture, whose results show a drastically error reduction with respect to the performance of methods proposed in the literature. 2006 Elsevier B.V. All rights reserved. Keywords: Hierarchical multiclassiﬁer; Machine learning; Splice-junction

1. Introduction Splice sites are locations in DNA which separate protein-coding regions (exons) from noncoding regions (introns). Accurate splice site detectors thus form important components of computational gene ﬁnders. Splice junctions (Maisheng and Wang, 1998) are points on a DNA sequence at which ‘superﬂuous’ DNA is removed during the process of protein creation in higher organism. The problem posed is to recognize, given a sequence of DNA, the boundaries between exons (the part of DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). The computational identiﬁcation of genes and coding regions in DNA sequences is a major goal for molecular biology. The reliable detection of genes and protein coding regions in DNA sequences is critical for the success of the computational gene discovery (Noordewier et al., 1991). According to the central dogma of molecular biology, DNA codes for the production of messenger RNA (mRNA) during the transcription process. Messenger *

Corresponding author. Fax: +39 0547 338890. E-mail addresses: [email protected] (A. Lumini), lnanni@deis. unibo.it (L. Nanni). 0167-8655/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.01.013

RNA carries coded information to ribosomes. The ribosomes ‘‘read’’ this information, and use it for protein synthesis during the translation process. This problem can also be posed as to ﬁnd a classiﬁcation rule: given a position in the middle of a window of DNA sequence elements (nucleotides), decide whether this is an ‘intron–exon’ boundary (‘IE’), ‘exon–intron’ boundary (‘EI’), or ‘neither’ (‘N’). Many works (Maisheng and Wang, 1998; Thivierge and Schultz, 2002; Rampone, 1998) show that machine learning techniques perform better than classiﬁcation based on canonical pattern matching. In (Noordewier et al., 1991) the application of a knowledge based artiﬁcial neural network (KBANN) approach is described. KBANN is a hybrid method that combines explanation-based and empirical learning methods. In (Pertea et al., 2001) a probabilistic model was presented, this method estimate position-speciﬁc probabilities of splice sites by computing likelihoods of candidates of signal sequence. At present, the most successful splice site prediction techniques are (Ho and Rajapakse, 2003; Rampone, 1998). In (Rampone, 1998) the authors present BRAIN, an algorithm that infers Boolean formulae from examples and considers the splicing rules as disjunctive normal form (DNF) formulae. The result is reﬁned by means of a neural

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

network and combined with a discriminant analysis procedure. In (Ho and Rajapakse, 2003) the authors present a hybrid approach consisting of two ﬁrst- and two-secondorder Markov chain models at the ﬁrst stage and of a three-layer neural network at the second stage. In this work we show that the splice-junction problem can be eﬃciently solved using a hierarchical multiclassiﬁer (HM) architecture. The proposed method is hierarchical in view of the fact that at each step a classiﬁer is employed which processes some patterns and rejects some others: only the patters rejected at each step are passed to the next classiﬁer. The system is a multiclassiﬁer since it integrates several data-driven models for the same problem. In the ﬁrst step of our HM, we adopt a subspace classiﬁer (Oja, 1983), patterns rejected from this step are classiﬁed using an edit distance classiﬁer (Levenshtein, 1965), ﬁnally remaining patterns are classiﬁed using a combination of linear support vector machines (Duda et al., 2001). There are many ways to build a system of hierarchical multiclassiﬁers. A typical example is a system where the set of classes is organized in a hierarchical way and diﬀerent classiﬁers are trained in order to drive the input pattern towards the most speciﬁc classiﬁer to be used for the ﬁnal classiﬁcation (Simon et al., 1999). Another example is a system where only a subset of the input features are given as input to a ﬁrst-level classiﬁer and if the ﬁnal classiﬁcation cannot be performed with a suﬃcient level of conﬁdence the remaining features are used at further classiﬁcation stages (Rokui and Shimodaira, 1999). Another approach where a hierarchical system is used in order to improve performance in term of both accuracy and response requires that at each step a classiﬁer with rejection is used to classify patterns with high conﬁdence. Rejected patterns are forwarded to a more complex and usually slower classiﬁer for classiﬁcation (or further rejection to the next stage). Typically, the system as a whole holds better classiﬁcation performance with respect to the single classiﬁers at the cost of a slower response time (Giusti et al., 2002). In this work, we deﬁne a four-stage pattern recognition scheme consisting of a hierarchy made up by some classiﬁers with rejection. It has been proven (Giusti et al., 2002) that the probability of error for a hierarchical system, given a rejection threshold for the classiﬁers at each levels, can be expressed as the sum of the optimal Bayes error and the error rate of the diﬀerent classiﬁers (related to the patterns eﬀectively classiﬁed – i.e. not rejected). This result means that in principle, we do not loose the possibility to reach the optimal Bayes error even if all the stand-alone classiﬁers are not optimal. The rest of the paper is organized as follows: in Section 2, we review and compare several machine learning methods tested in this work; in Section 3, the new hierarchical multiclassiﬁer is detailed; in Section 4, the results of the experiments are discussed; and ﬁnally, in Section 5, we draw some conclusions.

1391

2. Machine learning approaches In the following a brief description of the feature extraction methodologies, feature transformations, classiﬁers and ensemble methods combined and tested in this work is given. 2.1. Feature extraction Feature extraction is a process that extracts a set of new features from the original pattern representation through some functional mapping. In this case a pattern is a DNA sequence of length m = 60 over an alphabet of eight symbolic variables: the four kinds of nucleotides (‘A’, ‘G’, ‘T’, ‘C’) and four characters (‘D’, ‘N’, ‘S’, ‘R’) which indicate ambiguity among the ﬁrst four characters according to the following rules. • • • •

‘D’ = ‘A’, ‘G’ or ‘T’ ‘N’ = ‘A’, ‘G’, ‘C’, or ‘T’ ‘S’ = ‘C’ or ‘G’ ‘R’ = ‘A’ or ‘G’

In the orthonormal encoding the symbolic variables representing the nucleotides are replaced by eight binary indicator variables: each symbolic variable is then represented by an 8 bits vector with 7 bits set to zero and one bit set to one, so that each vector is orthogonal to all others (see Table 1). Feature ranking is a process for sorting the features and retaining only a subset of the initial features. In this paper we use a simple feature ranking which sorts the features that maximize the distance between the centroids of the diﬀerent classes of patterns (Duda et al., 2001). For each feature f(f = 1,. . .,n) we compute the scalar Mahalanobis distance (dMc,f = dM(lc,f, Sf)) between the mean lc,f of the training patterns of each class c from a total of Nc classes (in our case Nc = 3 : ‘IE’, ‘EI’, ‘N’), and the set of all the

Table 1 The orthonormal sparse encoding Characters

8 bit vector

‘A’ ‘G’ ‘T’ ‘C’ ‘D’ ‘N’ ‘S’ ‘R’

10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

Table 2 Codons that must not be contained in class ‘neither’ TGG

GGG

GGA

TTA

CAG

AGG

GGT

1392

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

training patterns S. The Mahalanobis distance between a pattern x (lc,f in our case) and a set S of patterns with mean lS and covariance matrix CS is given by dMðx; SÞ ¼ ðx lS ÞT C 1 S ðx lS Þ

Nc X Nc X i¼i

jdM i;f þ dM j;f j

ð2Þ

j¼i

In this paper, we retain only the features corresponding to the 100 higher values of DistF. Feature transformation is a process through which a new set of features is created from the original features through some functional mapping. In this work we adopt the Karhunen Loe`ve (KL) transform (Duda et al., 2001): this transform projects n-dimensional data onto a lowerdimensional subspace in a way that is optimal in a sumsquared sense. It has known that KL is the best linear transform for dimensionality reduction. Given a n-dimensional data points x, the goal of the KL transform is to reduce the dimensionality of the observed vector, by ﬁnding k principal axes, denoted as principal components, which are given by the eigenvectors associated with the k largest eigenvalues U = [/1, /2, . . . , /k] of the covariance matrix CS of the training set. A map between the original space and the reduced eigenspace is performed by means of the operator of projection, where the projection z of a vector x 2 Rn into the reduced space Rk is z ¼ UT ðx lS Þ

z O

ð1Þ

Features are then ranked according to the following inter-user class separability measure: DistF ðf Þ ¼

x

ð3Þ

where lS is the mean of the training set S. 2.2. Classiﬁcation A classiﬁer is a component that uses the feature vector provided by the feature extraction or transformation to assign a pattern to a class. In our hierarchical multiclassiﬁer the following classiﬁers are used:

Fig. 1. Projection z of a vector x in a two-dimensional subspace.

Given a linearly separable set S, the optimal separating hyperplane is the separating hyperplane for which the distance of the closest point of S is maximum. Since the distance of the closest point equals 1/w, the optimal separating hyperplane can be regarded as the solution of the problem: 1 w w subject to: y i ðw xi þ bÞ P 1 2 i ¼ 1; 2; . . . ; #S

Minimize

It is important to note that the parameter b enters in the constraints but not in the function to be minimized. The quantity 2/w, the lower bound of the minimum distance between points of diﬀerent classes, is named ‘‘margin’’. • SubSpace classiﬁer (Oja, 1983). For each class we create one KL subspace, with the aim of capturing the intra-class variance. Therefore we have three KL subspaces, one for each of the subsets (SEI, SIE, SN) of the training set related to the three classes (‘EI’, ‘IE’, ‘N’). All the objects in each subspace are normalized such that their Euclidian distance to the origin is one. This normalization is useful to employ the norm of the projection of a pattern on a subspace as similarity measure. A map between the original space and the reduced eigenspace is performed by means of the operator of projection, where the projection zj of a vector x 2 Rn into the space of the class j is evaluated by the projection operator deﬁned in Eq. (3). The norm of the projection (Fig. 1) of a pattern on each eigenspace is used as similarity measure between the input vector and the class related to the subspace. The input vector is then classiﬁed according to the maximal similarity value 2

2

arg maxðlogðkzj k Þ logð1 kzj k ÞÞ • Linear support vector machines. The goal of this two-class classiﬁer (LSVM) is to establish the equation of a hyperplane that divides the training set leaving all the points of the same class on the same side, while maximizing the distance between the two classes and the hyperplane. We brieﬂy recall the basic notions of the theory of LSVMs (Cristianini and Shawe-Taylor, 2000). Each pattern xi belongs to either of two classes and thus is given a label yi. The goal is to establish the equation of a hyperplane that divides the training set S, leaving all the points of the same class on the same side, while maximizing the distance between the two classes and the hyperplane. The set S is linearly separable if there exists w such that: y i ðw xi þ bÞ P 1

i ¼ 1; 2; . . . ; #S

ð4Þ

ð5Þ

ð6Þ

j¼1...N c

where Nc is the number of classes. • Edit distance classiﬁer (EDC). The edit distance of two strings, s1 and s2, is deﬁned as the minimum number of point mutations required to change s1 into s2, where a point mutation is a substitution, insertion of deletion of a letter. The edit distance is coupled with a nearest-neighbors classiﬁer in order to classify a new pattern. 3. Hierarchical multiclassiﬁer We develop a four-phases hierarchical structure for the classiﬁcation of patterns, where only the patterns rejected at one step are evaluated by the classiﬁer of the next step.

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

To determine an order for the classiﬁers we adopt the following greedy rule: we rank the classiﬁers by their precision in the classiﬁcation of the patterns with a high conﬁdence value. The ﬁrst classiﬁer is a subspace classiﬁer due to its better performance for patterns classiﬁed with high conﬁdence value with respect to LSVM (see the Section 4.3). As to the edit distance classiﬁer combined with a syntactic rule employed in the second phase of our hierarchical multiclassiﬁer, it is useful mainly to classify a limited number of patterns with a very low error rate (see Table 6). In phase 3 we found the two most likely classes for each pattern, and we discriminate between those three couples of classis by ad hoc LSVMs coupled with some rules for splice-junction determination (Thivierge and Schultz, 2002). Considering that a LSVM classiﬁer gives a higher performance than a Subspace classiﬁer on patterns classiﬁed with low conﬁdence, we adopt this classiﬁer in the last phase where no rejection is allowed. The algorithm for the deﬁnition of the structure is detailed in Fig. 2: • First phase is composed by: Subspace classiﬁer. We classify only the patterns whose similarity is larger than a ﬁxed threshold t1 (we use t1 = 0.125 in our experiments), while reject the others. The conﬁdence value adopted in our classiﬁcation approach is given by the diﬀerence between the two highest similarity values given by our classiﬁer. • Second phase is composed by: Edit distance and rule. For these particular problems the class ‘neither’ is shift invariant, the other classes are not. Shift invariance means that the category remains unchanged if we shift left or right the pattern of one position. For instance, the sequences ‘‘CGT. . ..AAT’’ belong to class ‘neither’, which means that all its subsequences belong to the same category. The other categories, however, are not shift invariant because we believe the splice sites to occur at one speciﬁcity site and not in nearby sites.

Orthonormal Encoding + Feature Ranking

SubSpace Classifier

1393

The edit distance classiﬁer gives good performance for pattern belonging to the shift invariant class, while it is not reliable when assigns a pattern to the shift variant class. For example, if a new pattern is near (with respect the edit distance) to a pattern of the shift invariant class we can reasonable assume that it belongs to the same class, on the contrary if the new pattern is near to a pattern of the shift variant class, we cannot make any assumption with high degree of certainly.Starting from this consideration, we design a classiﬁer that assign to the class ‘neither’ each pattern classiﬁed as ‘neither’ by EDC, while reject the others. The dataset documentation indicates that much better performance is generally observed if attributes closest to the junction are used. In our case, this means that we use, in this step, only feature between @ 5 a @ + 5. Where @ 5 indicates 5 symbols before the potential splice point. A possible method to reduce the error of this step is to reject the patterns that satisfy this rule: ‘‘Patterns classiﬁed as ‘N’ must not contain the codons (sequences of three bases) shown in Table 2 in position @ 3 @ 2 @ 1 @ + 1’’ From a statistical study on the training set, we noted that the codons listed above are rarely contained in the ‘neither’ class. • Third phase is composed by: LSVM 2-Most. We trained a LSVM classiﬁer (using the 1-vs-1 approach (Duda et al., 2001) to solve this multi-class problem), using ‘‘orthonormal encoding’’ as feature extractor and KL as feature transformation (reducing the feature vector to 50 dimensions), and storing, for each pattern, the 2-most likely classes (obtained using the conﬁdence value of the LSVM trained in this step). Then we discriminate between the two most classes of each pattern by LSVM1 (when 2-most likely classes are ‘EI’ and ‘N’). We train a LSVM classiﬁer on the feature vectors

Edit Distance Classifier

RULE

KL

LSVM 1 LSVM

LSVM 2

LSVM 2-most

LSVM 3

SYMBOLS

Feature extraction

Feature transformation

Classifier

Sequence arrow

Classified patterns

Rejected patterns

Fig. 2. Hierarchical classiﬁer proposed.

1394

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

reduced to 50 dimensions by KL. We classify only the patterns whose similarity is larger than a ﬁxed threshold t2 (t2 = 1 in our experiments). LSVM2 (when 2-most likely classes are ‘EI’ and ‘IE’).We train a LSVM classiﬁer as in the step LSVM1. In this step we classify only the patterns whose similarity is larger than a ﬁxed threshold t2. Moreover, the following rules partially matching with those shown in (Thivierge and Schultz, 2002) are applied: ‘‘Patterns classiﬁed as ‘IE’ but with a similarity lower than half of the threshold t2 are classiﬁed only if between @ 16 and @ 5 there are more than 6 ‘C’ or ‘T’’’ ‘‘All the patterns containing the codons ‘‘TAA’’ or ‘‘TAG’’ in position @ + 1 @ + 2 @ + 3 are classiﬁed as ‘IE’’’. LSVM3 (when 2-most likely classes are ‘IE’ and ‘N’). We train a LSVM classiﬁer as in the step LSVM1. In this step we classify only the patterns whose similarity is larger than a ﬁxed threshold t2. Moreover, the following rules partially matching with those shown in (Thivierge and Schultz, 2002) are applied: ‘‘Patterns classiﬁed as ‘IE’ but with a similarity lower than half of the previous threshold t2 are classiﬁed only if between @ 16 and @ 5 there are at least 5 ‘C’’’ ‘‘Patterns classiﬁed as ‘N’ but with a similarity lower than the preﬁxed threshold t3 = 0.35 in our experiments are classiﬁed only if between @ 16 and @ 5 there are less than 3 ‘C’’’ ‘‘All patterns containing the codons ‘‘TAA’’ or ‘‘TAG’’ in position @ + 1 @ + 2 @ + 3 are classiﬁed as ‘IE’’’. • Fourth phase is composed by: LSVM The patterns, rejected by the previous steps, are ﬁnally classiﬁed by a LSVM classiﬁer trained on the feature vectors reduced to 50 dimensions by KL. 4. Experimental results This method has been tested on the ‘‘IP data set’’. ‘‘IP data set’’ is a benchmark set of human splice site data from the UCI Machine Learning Repository (Blake and Merz, 1998). It contains 765 acceptor sites, 767 donor sites and 1654 decoys. The task is a donor/acceptor classiﬁcation, given a position in the middle of a window of 60 DNA letters as input. There are no missing attribute values. Each instance is of one of 3 classes (‘EI’, ‘IE’ or ‘N’). Each of the 60 ﬁelds is ﬁlled by one of the eight characters listed in Section 2. On this dataset, we performed 10 tests, each time randomly resampling learning, and test sets (containing respectively 2000, and 1186 patterns) as in (Rampone, 1998), but maintaining the distribution of the patterns in the three classes. The results reported refer to the average

classiﬁcation accuracy achieved throughout the 10 experiments. We use as indicators of the predictive performance the error rate and the correlation coeﬃcient (CC), a measure that takes the relationship between correctly predicted positives and negatives as well as false positives and negatives: P jN j P jN j CCj ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðN j þ N j ÞðN j þ P j ÞðP j þ N j ÞðP j þ P j Þ where Pj and Nj are the correctly predicted splicing and non-splicing sites for class j, respectively, and P j and N j are similarly the incorrectly predicted sites and j 2 {‘IE’, ‘EI’} (the class ‘N’ is not considered for this indicator). 4.1. Literature comparison In Tables 3–5 we compare the classiﬁcation error rate, the correlation coeﬃcient and the variance of the performance obtained by our hierarchical method (HM), by the systems proposed in (Thivierge and Schultz, 2002) (KBANN, MLP, ID3, NN) and (Rampone, 1998) (NNBRAIN, BRAIN) and by some stand-alone classiﬁers (used in our HM): a subspace classiﬁer (SUBSPACE); a Linear SVM classiﬁer (RANKSVM) trained using only the best 100 features corresponding to the higher values of DistF (see Section 2.1) reduced to a vector of 50 dimensions by KL; and another Linear SVM classiﬁer (SVM) trained using orthonormal encoding as feature extractor (without the feature ranking) and KL as feature transformation (reducing the feature vector to 50 dimensions). This last

Table 3 Error rates produced by HM and by various machine learning algorithms System

‘EI’ (%)

‘IE’ (%)

‘N’ (%)

HM SUBSPACE RANKSVM SVM NN BRAIN BRAIN KBANN MLP ID3 NN

0.76 1.68 0.86 1.65 2.6 5 7.6 5.7 10.6 11.6

1.59 6.25 1.86 1.99 4.3 4 8.5 10.7 14 9.1

1.60 1.61 1.71 1.90 n.d. 4 4.6 5.3 8.8 31.1

Table 4 Correlation coeﬃcient produced by HM, NN-BRAIN and the stand-alone classiﬁers System

‘EI’

‘IE’

HM SUBSPACE RANKSVM SVM NN BRAIN

0.939 0.83 0.93 0.90 0.93

0.908 0.85 0.904 0.89 0.87

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

1395

Table 5 Variance of the performance produced by HM and the stand-alone classiﬁers

Table 7 Classiﬁcation performance of RANKSVM, with a ﬁxed rejection rate (100%-GC)

System

Variance

GC (%)

Global CC IE

0.47 0.5 0.5 0.8

Global error rate (%)

Global CC EI

HM SUBSPACE RANKSVM SVM

34.35 60.8 84.8 100

0.9 1.51 3 4.4

0.947 0.94 0.94 0.903

0.88 0.96 0.94 0.904

classiﬁer has been tested in order to prove that feature ranking is a major step for improving performance. These results show that our hierarchical multiclassiﬁer outperforms all the methods proposed in the literature by Thivierge and Schultz (2002) and Rampone (1998) and the considered stand-alone classiﬁers. In Table 6 the classiﬁcation performances of each step of the new hierarchical classiﬁer are summarized: the local error rate is evaluated considering only the patterns eﬀectively classiﬁed at each step, while the global error rate is the cumulative error obtained considering all the patterns classiﬁed till that step; analogously with ‘‘local classiﬁed’’ (LC) we mean the percentage of the whole patterns classiﬁed at each step, while ‘‘global classiﬁed’’ (GC) is the cumulative percentage of classiﬁed patterns at each step. It is interesting to note that the 15% of patterns can be considered ‘‘diﬃcult’’, since they contribute to generate the higher part of the total error rate. The introduced method allows a light improvement (as shown in Tables 3 and 6) in comparison to the best ‘‘stand-alone’’ method: RANKSVM. The greater advantage in the use of our HM is a lower error rate and a higher correlation coeﬃcient, with the same rejection rate, with respect to RANKSVM (as shown in Table 7). That is if we classify the 34.35% of patterns with higher conﬁdence the error rate is 0.9% for RANKSVM while is 0.44% for HM. 4.2. Tests varying the dimensionality of the training set and of the validation set In Tables 8 and 9 the classiﬁcation performances of each step of our classiﬁer are reported for diﬀerent sizes of the training and test sets. These results show that both the global error rate and the percentage of pattern classiﬁed in each phase (LC) for the HM classiﬁer are similar using different sizes for the training and the test sets. Moreover this test conﬁrms that the size of the training set does not aﬀect the conclusion about the advantage of our HM classiﬁer

Table 8 Classiﬁcation performance of HM and RANKSVM (only the global error rate) using a training set of size 1686 and a test set of size 1500 Steps

Global error rate (%)

Local error rate (%)

GC (%)

LC (%)

RANKSVM global error rate

1 2 3 4

0.2 0.8 1.3 4.7

0.2 1.6 2.5 21.7

35.54 59.42 84.8 100

35.54 23.88 25.38 15.2

2.34 2.1 2.95 4.7

Table 9 Classiﬁcation performance of HM and RANKSVM (only the global error rate) using a training set of size 2436 and a test set of size 750 Steps

Global error rate (%)

Local error rate (%)

GC (%)

LC (%)

RANKSVM global error rate

1 2 3 4

0.8 1.2 1.3 4.2

0.8 1.7 1.6 20.2

33.7 57.5 83.5 100

33.7 23.8 26 16.5

1.41 1.7 2.63 4.5

versus the RANKSVM: a lower error rate with the same rejection value. All parameters have been set using a validation set (200 patterns for class) extracted from the training set. The size of the validation set has been chosen by experimental considerations: a lower number of patterns per class would cause overﬁtting, while a larger number would reduce too much the size of the training set (ﬁxed to 2000 in these experiments). In Table 10 the global error rate of HM for diﬀerent sizes of the validation set is reported. 4.3. Rejection curve for SUBSPACE and RANKSVM In Fig. 3 the rejection curves (global error rate) of two methods used in our multiclassiﬁer (SUBSPACE and RANKSVM) are reported. It is possible to note the better

Table 6 Classiﬁcation performance at each phase of HM Steps

GC (%)

Global error rate (%)

LC (%)

Local error rate (%)

Global CC EI (%)

Global CC IE (%)

1 2 3 4

34.35 60.8 84.8 100

0.44 0.85 1.27 3.95

34.35 26.45 24 15.2

0.44 1.4 2.3 28.85

0.99 0.989 0.98 0.939

0.985 0.987 0.981 0.908

1396

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

Table 10 Classiﬁcation performance of HM using diﬀerent sizes for the validation set Validation set size Global error rate (%)

50 4.8

100 4.25

150 3.95

200 3.95

Error Rate

4

RankSVM

3

Subspace 2 1 0 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Rejection Rate

Fig. 3. Rejection curves for SUBSPACE and RANKSVM.

performance of the SUBSPACE classiﬁer with a higher rejection rate: this result justiﬁes our choice of using SUBSPACE before RANKSVM in HM. 5. Conclusions The problem addressed in this paper is to recognize, given a sequence of DNA letters, splice-junction sites. In this work, we proposed a hierarchical multiclassiﬁer. We show by experimental results that our method is able to improve classiﬁcation accuracy and correlation coeﬃcient with respect to other state-of-art methods. Moreover our tests show that the size of the training set does not aﬀect the conclusion about the advantage of our HM classiﬁer versus other stand-alone classiﬁers: a lower error rate with the same rejection value. References Blake, C.L., Merz, C.J., 1998. The UCI Repository of Machine Learning Databases (Also Contains Software). University of California, Department of Information and Computer Science, Irvine, CA.

250 3.95

300 3.95

350 4.2

200 4.5

300 4.8

350 5

400 5.2

Available from: . Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK. Duda, R., Hart, P., Stork, D., 2001. Pattern Classiﬁcation. Wiley, New York. Giusti, N., Masulli, F., Sperduti, A., 2002. Theoretical and experimental analysis of a two-stage system for classiﬁcation. IEEE Trans. PAMI 24 (7), 893–904. Ho, L., Rajapakse, J., 2003. Splice site detection with a higher-order Markov model implemented on a neural network. Genome Inform. 14, 64–72. Levenshtein, V.I., 1965. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163 (4), 845– 848. Maisheng, Y., Wang, J., 1998. Algorithms for splicing junction donor recognition in genomic DNA sequences. IEEE Internat. Joint Symp. Intell. Syst., 169–176. Noordewier, M.O., Towell, G.G., Shavlik, J.W., 1991. Training knowledge-based neural networks to recognize genes in DNA sequences. Adv. Neural Inform. Process. Syst., vol. 3. Morgan Kaufmann. Oja, E., 1983. Subspace Methods of Pattern Recognition. Research Studies Press Ltd., Letchworth, England. Pertea, M., XiaoYing, L., Salzberg, M., 2001. GeneSplicer: a new computational method for splice site detection. Nucleic Acids Res. 29, 1185–1195. Rampone, S., 1998. Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics 14 (8), 676–684. Rokui, J., Shimodaira, H., 1999. Multistage building learning based on misclassiﬁcation measure. Proc. Internat. Conf. Artiﬁcial Neural Networks, 221–226. Simon, S., Kestler, H.A., Baune, A., Schwenker, F., Palm, G., 1999. Object classiﬁcation with simple visual attention and a hierarchical neural network for subsymbolic-symbolic coupling. Proc. IEEE Internat. Sympos. Computat. Intell. Robotics Automation, 244– 249. Thivierge, J.P., Schultz, T.R., 2002. Finding relevant knowledge: KBCC applied to DNA splice junction determination. IEEE Internat. Joint Conf. Neural Network, 1401–1405.

Lihat lebih banyak...

Identifying splice-junction sequences by hierarchical multiclassifier

Descripción

Comentarios