Identifying splice-junction sequences by hierarchical multiclassifier

Share Embed


Descripción

Pattern Recognition Letters 27 (2006) 1390–1396 www.elsevier.com/locate/patrec

Identifying splice-junction sequences by hierarchical multiclassifier Alessandra Lumini, Loris Nanni

*

DEIS, IEIIT – CNR, Universita` di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy Received 13 September 2004; received in revised form 27 April 2005 Available online 2 May 2006 Communicated by L. Goldfarb

Abstract Splice sites are locations in DNA which separate protein-coding regions (exons) from noncoding regions (introns). Recently, several works have approached the splice-junction problem by applying methods from the field of machine learning techniques. In this work, we propose a hierarchical multiclassifier (HM) architecture, whose results show a drastically error reduction with respect to the performance of methods proposed in the literature.  2006 Elsevier B.V. All rights reserved. Keywords: Hierarchical multiclassifier; Machine learning; Splice-junction

1. Introduction Splice sites are locations in DNA which separate protein-coding regions (exons) from noncoding regions (introns). Accurate splice site detectors thus form important components of computational gene finders. Splice junctions (Maisheng and Wang, 1998) are points on a DNA sequence at which ‘superfluous’ DNA is removed during the process of protein creation in higher organism. The problem posed is to recognize, given a sequence of DNA, the boundaries between exons (the part of DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). The computational identification of genes and coding regions in DNA sequences is a major goal for molecular biology. The reliable detection of genes and protein coding regions in DNA sequences is critical for the success of the computational gene discovery (Noordewier et al., 1991). According to the central dogma of molecular biology, DNA codes for the production of messenger RNA (mRNA) during the transcription process. Messenger *

Corresponding author. Fax: +39 0547 338890. E-mail addresses: [email protected] (A. Lumini), lnanni@deis. unibo.it (L. Nanni). 0167-8655/$ - see front matter  2006 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.01.013

RNA carries coded information to ribosomes. The ribosomes ‘‘read’’ this information, and use it for protein synthesis during the translation process. This problem can also be posed as to find a classification rule: given a position in the middle of a window of DNA sequence elements (nucleotides), decide whether this is an ‘intron–exon’ boundary (‘IE’), ‘exon–intron’ boundary (‘EI’), or ‘neither’ (‘N’). Many works (Maisheng and Wang, 1998; Thivierge and Schultz, 2002; Rampone, 1998) show that machine learning techniques perform better than classification based on canonical pattern matching. In (Noordewier et al., 1991) the application of a knowledge based artificial neural network (KBANN) approach is described. KBANN is a hybrid method that combines explanation-based and empirical learning methods. In (Pertea et al., 2001) a probabilistic model was presented, this method estimate position-specific probabilities of splice sites by computing likelihoods of candidates of signal sequence. At present, the most successful splice site prediction techniques are (Ho and Rajapakse, 2003; Rampone, 1998). In (Rampone, 1998) the authors present BRAIN, an algorithm that infers Boolean formulae from examples and considers the splicing rules as disjunctive normal form (DNF) formulae. The result is refined by means of a neural

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

network and combined with a discriminant analysis procedure. In (Ho and Rajapakse, 2003) the authors present a hybrid approach consisting of two first- and two-secondorder Markov chain models at the first stage and of a three-layer neural network at the second stage. In this work we show that the splice-junction problem can be efficiently solved using a hierarchical multiclassifier (HM) architecture. The proposed method is hierarchical in view of the fact that at each step a classifier is employed which processes some patterns and rejects some others: only the patters rejected at each step are passed to the next classifier. The system is a multiclassifier since it integrates several data-driven models for the same problem. In the first step of our HM, we adopt a subspace classifier (Oja, 1983), patterns rejected from this step are classified using an edit distance classifier (Levenshtein, 1965), finally remaining patterns are classified using a combination of linear support vector machines (Duda et al., 2001). There are many ways to build a system of hierarchical multiclassifiers. A typical example is a system where the set of classes is organized in a hierarchical way and different classifiers are trained in order to drive the input pattern towards the most specific classifier to be used for the final classification (Simon et al., 1999). Another example is a system where only a subset of the input features are given as input to a first-level classifier and if the final classification cannot be performed with a sufficient level of confidence the remaining features are used at further classification stages (Rokui and Shimodaira, 1999). Another approach where a hierarchical system is used in order to improve performance in term of both accuracy and response requires that at each step a classifier with rejection is used to classify patterns with high confidence. Rejected patterns are forwarded to a more complex and usually slower classifier for classification (or further rejection to the next stage). Typically, the system as a whole holds better classification performance with respect to the single classifiers at the cost of a slower response time (Giusti et al., 2002). In this work, we define a four-stage pattern recognition scheme consisting of a hierarchy made up by some classifiers with rejection. It has been proven (Giusti et al., 2002) that the probability of error for a hierarchical system, given a rejection threshold for the classifiers at each levels, can be expressed as the sum of the optimal Bayes error and the error rate of the different classifiers (related to the patterns effectively classified – i.e. not rejected). This result means that in principle, we do not loose the possibility to reach the optimal Bayes error even if all the stand-alone classifiers are not optimal. The rest of the paper is organized as follows: in Section 2, we review and compare several machine learning methods tested in this work; in Section 3, the new hierarchical multiclassifier is detailed; in Section 4, the results of the experiments are discussed; and finally, in Section 5, we draw some conclusions.

1391

2. Machine learning approaches In the following a brief description of the feature extraction methodologies, feature transformations, classifiers and ensemble methods combined and tested in this work is given. 2.1. Feature extraction Feature extraction is a process that extracts a set of new features from the original pattern representation through some functional mapping. In this case a pattern is a DNA sequence of length m = 60 over an alphabet of eight symbolic variables: the four kinds of nucleotides (‘A’, ‘G’, ‘T’, ‘C’) and four characters (‘D’, ‘N’, ‘S’, ‘R’) which indicate ambiguity among the first four characters according to the following rules. • • • •

‘D’ = ‘A’, ‘G’ or ‘T’ ‘N’ = ‘A’, ‘G’, ‘C’, or ‘T’ ‘S’ = ‘C’ or ‘G’ ‘R’ = ‘A’ or ‘G’

In the orthonormal encoding the symbolic variables representing the nucleotides are replaced by eight binary indicator variables: each symbolic variable is then represented by an 8 bits vector with 7 bits set to zero and one bit set to one, so that each vector is orthogonal to all others (see Table 1). Feature ranking is a process for sorting the features and retaining only a subset of the initial features. In this paper we use a simple feature ranking which sorts the features that maximize the distance between the centroids of the different classes of patterns (Duda et al., 2001). For each feature f(f = 1,. . .,n) we compute the scalar Mahalanobis distance (dMc,f = dM(lc,f, Sf)) between the mean lc,f of the training patterns of each class c from a total of Nc classes (in our case Nc = 3 : ‘IE’, ‘EI’, ‘N’), and the set of all the

Table 1 The orthonormal sparse encoding Characters

8 bit vector

‘A’ ‘G’ ‘T’ ‘C’ ‘D’ ‘N’ ‘S’ ‘R’

10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

Table 2 Codons that must not be contained in class ‘neither’ TGG

GGG

GGA

TTA

CAG

AGG

GGT

1392

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

training patterns S. The Mahalanobis distance between a pattern x (lc,f in our case) and a set S of patterns with mean lS and covariance matrix CS is given by dMðx; SÞ ¼ ðx  lS ÞT C 1 S ðx  lS Þ

Nc X Nc X i¼i

jdM i;f þ dM j;f j

ð2Þ

j¼i

In this paper, we retain only the features corresponding to the 100 higher values of DistF. Feature transformation is a process through which a new set of features is created from the original features through some functional mapping. In this work we adopt the Karhunen Loe`ve (KL) transform (Duda et al., 2001): this transform projects n-dimensional data onto a lowerdimensional subspace in a way that is optimal in a sumsquared sense. It has known that KL is the best linear transform for dimensionality reduction. Given a n-dimensional data points x, the goal of the KL transform is to reduce the dimensionality of the observed vector, by finding k principal axes, denoted as principal components, which are given by the eigenvectors associated with the k largest eigenvalues U = [/1, /2, . . . , /k] of the covariance matrix CS of the training set. A map between the original space and the reduced eigenspace is performed by means of the operator of projection, where the projection z of a vector x 2 Rn into the reduced space Rk is z ¼ UT ðx  lS Þ

z O

ð1Þ

Features are then ranked according to the following inter-user class separability measure: DistF ðf Þ ¼

x

ð3Þ

where lS is the mean of the training set S. 2.2. Classification A classifier is a component that uses the feature vector provided by the feature extraction or transformation to assign a pattern to a class. In our hierarchical multiclassifier the following classifiers are used:

Fig. 1. Projection z of a vector x in a two-dimensional subspace.

Given a linearly separable set S, the optimal separating hyperplane is the separating hyperplane for which the distance of the closest point of S is maximum. Since the distance of the closest point equals 1/w, the optimal separating hyperplane can be regarded as the solution of the problem: 1 w  w subject to: y i ðw  xi þ bÞ P 1 2 i ¼ 1; 2; . . . ; #S

Minimize

It is important to note that the parameter b enters in the constraints but not in the function to be minimized. The quantity 2/w, the lower bound of the minimum distance between points of different classes, is named ‘‘margin’’. • SubSpace classifier (Oja, 1983). For each class we create one KL subspace, with the aim of capturing the intra-class variance. Therefore we have three KL subspaces, one for each of the subsets (SEI, SIE, SN) of the training set related to the three classes (‘EI’, ‘IE’, ‘N’). All the objects in each subspace are normalized such that their Euclidian distance to the origin is one. This normalization is useful to employ the norm of the projection of a pattern on a subspace as similarity measure. A map between the original space and the reduced eigenspace is performed by means of the operator of projection, where the projection zj of a vector x 2 Rn into the space of the class j is evaluated by the projection operator defined in Eq. (3). The norm of the projection (Fig. 1) of a pattern on each eigenspace is used as similarity measure between the input vector and the class related to the subspace. The input vector is then classified according to the maximal similarity value 2

2

arg maxðlogðkzj k Þ  logð1  kzj k ÞÞ • Linear support vector machines. The goal of this two-class classifier (LSVM) is to establish the equation of a hyperplane that divides the training set leaving all the points of the same class on the same side, while maximizing the distance between the two classes and the hyperplane. We briefly recall the basic notions of the theory of LSVMs (Cristianini and Shawe-Taylor, 2000). Each pattern xi belongs to either of two classes and thus is given a label yi. The goal is to establish the equation of a hyperplane that divides the training set S, leaving all the points of the same class on the same side, while maximizing the distance between the two classes and the hyperplane. The set S is linearly separable if there exists w such that: y i ðw  xi þ bÞ P 1

i ¼ 1; 2; . . . ; #S

ð4Þ

ð5Þ

ð6Þ

j¼1...N c

where Nc is the number of classes. • Edit distance classifier (EDC). The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is a substitution, insertion of deletion of a letter. The edit distance is coupled with a nearest-neighbors classifier in order to classify a new pattern. 3. Hierarchical multiclassifier We develop a four-phases hierarchical structure for the classification of patterns, where only the patterns rejected at one step are evaluated by the classifier of the next step.

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

To determine an order for the classifiers we adopt the following greedy rule: we rank the classifiers by their precision in the classification of the patterns with a high confidence value. The first classifier is a subspace classifier due to its better performance for patterns classified with high confidence value with respect to LSVM (see the Section 4.3). As to the edit distance classifier combined with a syntactic rule employed in the second phase of our hierarchical multiclassifier, it is useful mainly to classify a limited number of patterns with a very low error rate (see Table 6). In phase 3 we found the two most likely classes for each pattern, and we discriminate between those three couples of classis by ad hoc LSVMs coupled with some rules for splice-junction determination (Thivierge and Schultz, 2002). Considering that a LSVM classifier gives a higher performance than a Subspace classifier on patterns classified with low confidence, we adopt this classifier in the last phase where no rejection is allowed. The algorithm for the definition of the structure is detailed in Fig. 2: • First phase is composed by: Subspace classifier. We classify only the patterns whose similarity is larger than a fixed threshold t1 (we use t1 = 0.125 in our experiments), while reject the others. The confidence value adopted in our classification approach is given by the difference between the two highest similarity values given by our classifier. • Second phase is composed by: Edit distance and rule. For these particular problems the class ‘neither’ is shift invariant, the other classes are not. Shift invariance means that the category remains unchanged if we shift left or right the pattern of one position. For instance, the sequences ‘‘CGT. . ..AAT’’ belong to class ‘neither’, which means that all its subsequences belong to the same category. The other categories, however, are not shift invariant because we believe the splice sites to occur at one specificity site and not in nearby sites.

Orthonormal Encoding + Feature Ranking

SubSpace Classifier

1393

The edit distance classifier gives good performance for pattern belonging to the shift invariant class, while it is not reliable when assigns a pattern to the shift variant class. For example, if a new pattern is near (with respect the edit distance) to a pattern of the shift invariant class we can reasonable assume that it belongs to the same class, on the contrary if the new pattern is near to a pattern of the shift variant class, we cannot make any assumption with high degree of certainly.Starting from this consideration, we design a classifier that assign to the class ‘neither’ each pattern classified as ‘neither’ by EDC, while reject the others. The dataset documentation indicates that much better performance is generally observed if attributes closest to the junction are used. In our case, this means that we use, in this step, only feature between @  5 a @ + 5. Where @  5 indicates 5 symbols before the potential splice point. A possible method to reduce the error of this step is to reject the patterns that satisfy this rule: ‘‘Patterns classified as ‘N’ must not contain the codons (sequences of three bases) shown in Table 2 in position @  3 @  2 @  1 @ + 1’’ From a statistical study on the training set, we noted that the codons listed above are rarely contained in the ‘neither’ class. • Third phase is composed by: LSVM 2-Most. We trained a LSVM classifier (using the 1-vs-1 approach (Duda et al., 2001) to solve this multi-class problem), using ‘‘orthonormal encoding’’ as feature extractor and KL as feature transformation (reducing the feature vector to 50 dimensions), and storing, for each pattern, the 2-most likely classes (obtained using the confidence value of the LSVM trained in this step). Then we discriminate between the two most classes of each pattern by  LSVM1 (when 2-most likely classes are ‘EI’ and ‘N’). We train a LSVM classifier on the feature vectors

Edit Distance Classifier

RULE

KL

LSVM 1 LSVM

LSVM 2

LSVM 2-most

LSVM 3

SYMBOLS

Feature extraction

Feature transformation

Classifier

Sequence arrow

Classified patterns

Rejected patterns

Fig. 2. Hierarchical classifier proposed.

1394

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

reduced to 50 dimensions by KL. We classify only the patterns whose similarity is larger than a fixed threshold t2 (t2 = 1 in our experiments).  LSVM2 (when 2-most likely classes are ‘EI’ and ‘IE’).We train a LSVM classifier as in the step LSVM1. In this step we classify only the patterns whose similarity is larger than a fixed threshold t2. Moreover, the following rules partially matching with those shown in (Thivierge and Schultz, 2002) are applied: ‘‘Patterns classified as ‘IE’ but with a similarity lower than half of the threshold t2 are classified only if between @  16 and @  5 there are more than 6 ‘C’ or ‘T’’’ ‘‘All the patterns containing the codons ‘‘TAA’’ or ‘‘TAG’’ in position @ + 1 @ + 2 @ + 3 are classified as ‘IE’’’.  LSVM3 (when 2-most likely classes are ‘IE’ and ‘N’). We train a LSVM classifier as in the step LSVM1. In this step we classify only the patterns whose similarity is larger than a fixed threshold t2. Moreover, the following rules partially matching with those shown in (Thivierge and Schultz, 2002) are applied: ‘‘Patterns classified as ‘IE’ but with a similarity lower than half of the previous threshold t2 are classified only if between @  16 and @  5 there are at least 5 ‘C’’’ ‘‘Patterns classified as ‘N’ but with a similarity lower than the prefixed threshold t3 = 0.35 in our experiments are classified only if between @  16 and @  5 there are less than 3 ‘C’’’ ‘‘All patterns containing the codons ‘‘TAA’’ or ‘‘TAG’’ in position @ + 1 @ + 2 @ + 3 are classified as ‘IE’’’. • Fourth phase is composed by: LSVM The patterns, rejected by the previous steps, are finally classified by a LSVM classifier trained on the feature vectors reduced to 50 dimensions by KL. 4. Experimental results This method has been tested on the ‘‘IP data set’’. ‘‘IP data set’’ is a benchmark set of human splice site data from the UCI Machine Learning Repository (Blake and Merz, 1998). It contains 765 acceptor sites, 767 donor sites and 1654 decoys. The task is a donor/acceptor classification, given a position in the middle of a window of 60 DNA letters as input. There are no missing attribute values. Each instance is of one of 3 classes (‘EI’, ‘IE’ or ‘N’). Each of the 60 fields is filled by one of the eight characters listed in Section 2. On this dataset, we performed 10 tests, each time randomly resampling learning, and test sets (containing respectively 2000, and 1186 patterns) as in (Rampone, 1998), but maintaining the distribution of the patterns in the three classes. The results reported refer to the average

classification accuracy achieved throughout the 10 experiments. We use as indicators of the predictive performance the error rate and the correlation coefficient (CC), a measure that takes the relationship between correctly predicted positives and negatives as well as false positives and negatives: P jN j  P jN j CCj ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðN j þ N j ÞðN j þ P j ÞðP j þ N j ÞðP j þ P j Þ where Pj and Nj are the correctly predicted splicing and non-splicing sites for class j, respectively, and P j and N j are similarly the incorrectly predicted sites and j 2 {‘IE’, ‘EI’} (the class ‘N’ is not considered for this indicator). 4.1. Literature comparison In Tables 3–5 we compare the classification error rate, the correlation coefficient and the variance of the performance obtained by our hierarchical method (HM), by the systems proposed in (Thivierge and Schultz, 2002) (KBANN, MLP, ID3, NN) and (Rampone, 1998) (NNBRAIN, BRAIN) and by some stand-alone classifiers (used in our HM): a subspace classifier (SUBSPACE); a Linear SVM classifier (RANKSVM) trained using only the best 100 features corresponding to the higher values of DistF (see Section 2.1) reduced to a vector of 50 dimensions by KL; and another Linear SVM classifier (SVM) trained using orthonormal encoding as feature extractor (without the feature ranking) and KL as feature transformation (reducing the feature vector to 50 dimensions). This last

Table 3 Error rates produced by HM and by various machine learning algorithms System

‘EI’ (%)

‘IE’ (%)

‘N’ (%)

HM SUBSPACE RANKSVM SVM NN BRAIN BRAIN KBANN MLP ID3 NN

0.76 1.68 0.86 1.65 2.6 5 7.6 5.7 10.6 11.6

1.59 6.25 1.86 1.99 4.3 4 8.5 10.7 14 9.1

1.60 1.61 1.71 1.90 n.d. 4 4.6 5.3 8.8 31.1

Table 4 Correlation coefficient produced by HM, NN-BRAIN and the stand-alone classifiers System

‘EI’

‘IE’

HM SUBSPACE RANKSVM SVM NN BRAIN

0.939 0.83 0.93 0.90 0.93

0.908 0.85 0.904 0.89 0.87

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

1395

Table 5 Variance of the performance produced by HM and the stand-alone classifiers

Table 7 Classification performance of RANKSVM, with a fixed rejection rate (100%-GC)

System

Variance

GC (%)

Global CC IE

0.47 0.5 0.5 0.8

Global error rate (%)

Global CC EI

HM SUBSPACE RANKSVM SVM

34.35 60.8 84.8 100

0.9 1.51 3 4.4

0.947 0.94 0.94 0.903

0.88 0.96 0.94 0.904

classifier has been tested in order to prove that feature ranking is a major step for improving performance. These results show that our hierarchical multiclassifier outperforms all the methods proposed in the literature by Thivierge and Schultz (2002) and Rampone (1998) and the considered stand-alone classifiers. In Table 6 the classification performances of each step of the new hierarchical classifier are summarized: the local error rate is evaluated considering only the patterns effectively classified at each step, while the global error rate is the cumulative error obtained considering all the patterns classified till that step; analogously with ‘‘local classified’’ (LC) we mean the percentage of the whole patterns classified at each step, while ‘‘global classified’’ (GC) is the cumulative percentage of classified patterns at each step. It is interesting to note that the 15% of patterns can be considered ‘‘difficult’’, since they contribute to generate the higher part of the total error rate. The introduced method allows a light improvement (as shown in Tables 3 and 6) in comparison to the best ‘‘stand-alone’’ method: RANKSVM. The greater advantage in the use of our HM is a lower error rate and a higher correlation coefficient, with the same rejection rate, with respect to RANKSVM (as shown in Table 7). That is if we classify the 34.35% of patterns with higher confidence the error rate is 0.9% for RANKSVM while is 0.44% for HM. 4.2. Tests varying the dimensionality of the training set and of the validation set In Tables 8 and 9 the classification performances of each step of our classifier are reported for different sizes of the training and test sets. These results show that both the global error rate and the percentage of pattern classified in each phase (LC) for the HM classifier are similar using different sizes for the training and the test sets. Moreover this test confirms that the size of the training set does not affect the conclusion about the advantage of our HM classifier

Table 8 Classification performance of HM and RANKSVM (only the global error rate) using a training set of size 1686 and a test set of size 1500 Steps

Global error rate (%)

Local error rate (%)

GC (%)

LC (%)

RANKSVM global error rate

1 2 3 4

0.2 0.8 1.3 4.7

0.2 1.6 2.5 21.7

35.54 59.42 84.8 100

35.54 23.88 25.38 15.2

2.34 2.1 2.95 4.7

Table 9 Classification performance of HM and RANKSVM (only the global error rate) using a training set of size 2436 and a test set of size 750 Steps

Global error rate (%)

Local error rate (%)

GC (%)

LC (%)

RANKSVM global error rate

1 2 3 4

0.8 1.2 1.3 4.2

0.8 1.7 1.6 20.2

33.7 57.5 83.5 100

33.7 23.8 26 16.5

1.41 1.7 2.63 4.5

versus the RANKSVM: a lower error rate with the same rejection value. All parameters have been set using a validation set (200 patterns for class) extracted from the training set. The size of the validation set has been chosen by experimental considerations: a lower number of patterns per class would cause overfitting, while a larger number would reduce too much the size of the training set (fixed to 2000 in these experiments). In Table 10 the global error rate of HM for different sizes of the validation set is reported. 4.3. Rejection curve for SUBSPACE and RANKSVM In Fig. 3 the rejection curves (global error rate) of two methods used in our multiclassifier (SUBSPACE and RANKSVM) are reported. It is possible to note the better

Table 6 Classification performance at each phase of HM Steps

GC (%)

Global error rate (%)

LC (%)

Local error rate (%)

Global CC EI (%)

Global CC IE (%)

1 2 3 4

34.35 60.8 84.8 100

0.44 0.85 1.27 3.95

34.35 26.45 24 15.2

0.44 1.4 2.3 28.85

0.99 0.989 0.98 0.939

0.985 0.987 0.981 0.908

1396

A. Lumini, L. Nanni / Pattern Recognition Letters 27 (2006) 1390–1396

Table 10 Classification performance of HM using different sizes for the validation set Validation set size Global error rate (%)

50 4.8

100 4.25

150 3.95

200 3.95

Error Rate

4

RankSVM

3

Subspace 2 1 0 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Rejection Rate

Fig. 3. Rejection curves for SUBSPACE and RANKSVM.

performance of the SUBSPACE classifier with a higher rejection rate: this result justifies our choice of using SUBSPACE before RANKSVM in HM. 5. Conclusions The problem addressed in this paper is to recognize, given a sequence of DNA letters, splice-junction sites. In this work, we proposed a hierarchical multiclassifier. We show by experimental results that our method is able to improve classification accuracy and correlation coefficient with respect to other state-of-art methods. Moreover our tests show that the size of the training set does not affect the conclusion about the advantage of our HM classifier versus other stand-alone classifiers: a lower error rate with the same rejection value. References Blake, C.L., Merz, C.J., 1998. The UCI Repository of Machine Learning Databases (Also Contains Software). University of California, Department of Information and Computer Science, Irvine, CA.

250 3.95

300 3.95

350 4.2

200 4.5

300 4.8

350 5

400 5.2

Available from: . Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK. Duda, R., Hart, P., Stork, D., 2001. Pattern Classification. Wiley, New York. Giusti, N., Masulli, F., Sperduti, A., 2002. Theoretical and experimental analysis of a two-stage system for classification. IEEE Trans. PAMI 24 (7), 893–904. Ho, L., Rajapakse, J., 2003. Splice site detection with a higher-order Markov model implemented on a neural network. Genome Inform. 14, 64–72. Levenshtein, V.I., 1965. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163 (4), 845– 848. Maisheng, Y., Wang, J., 1998. Algorithms for splicing junction donor recognition in genomic DNA sequences. IEEE Internat. Joint Symp. Intell. Syst., 169–176. Noordewier, M.O., Towell, G.G., Shavlik, J.W., 1991. Training knowledge-based neural networks to recognize genes in DNA sequences. Adv. Neural Inform. Process. Syst., vol. 3. Morgan Kaufmann. Oja, E., 1983. Subspace Methods of Pattern Recognition. Research Studies Press Ltd., Letchworth, England. Pertea, M., XiaoYing, L., Salzberg, M., 2001. GeneSplicer: a new computational method for splice site detection. Nucleic Acids Res. 29, 1185–1195. Rampone, S., 1998. Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics 14 (8), 676–684. Rokui, J., Shimodaira, H., 1999. Multistage building learning based on misclassification measure. Proc. Internat. Conf. Artificial Neural Networks, 221–226. Simon, S., Kestler, H.A., Baune, A., Schwenker, F., Palm, G., 1999. Object classification with simple visual attention and a hierarchical neural network for subsymbolic-symbolic coupling. Proc. IEEE Internat. Sympos. Computat. Intell. Robotics Automation, 244– 249. Thivierge, J.P., Schultz, T.R., 2002. Finding relevant knowledge: KBCC applied to DNA splice junction determination. IEEE Internat. Joint Conf. Neural Network, 1401–1405.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.