Spectral Pattern Comparison Methods for Cancer Classification Based on Microarray Gene Expression Data

May 21, 2017 | Autor: Tuan Pham | Categoría: Genetics, Pattern Recognition, Cancer, Feature Selection, Spectral method, Feature Extraction, Cancer Classification, Data Preprocessing, Electrical And Electronic Engineering, Microarray Data, Gene Expression Data, Feature Extraction, Cancer Classification, Data Preprocessing, Electrical And Electronic Engineering, Microarray Data, Gene Expression Data

Share Embed

Laporkan tautan ini

Descripción

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 11, NOVEMBER 2006

2425

Spectral Pattern Comparison Methods for Cancer Classification Based on Microarray Gene Expression Data Tuan D. Pham, Senior Member, IEEE, Dominik Beck, and Hong Yan, Fellow, IEEE

Abstract—We present, in this paper, two spectral pattern comparison methods for cancer classification using microarray gene expression data. The proposed methods are different from other current classifiers in the ways features are selected and pattern similarities measured. In addition, these spectral methods do not require any data preprocessing which is neccessary for many other classification techniques. Expertimental results using three popular microarray data sets demonstrate the robustness and effectiveness of the spectral pattern classifiers. Index Terms—Classification, feature selection, microarrays, spectral distortions, vector quantization.

I. INTRODUCTION

M

ICROARRAYS are a relatively new technology that provides novel insights into gene expression and gene regulation [1]–[4]. Microarray technology has been applied in diverse areas ranging from genetics and drug discovery to disciplines such as virology, microbiology, immunology, endocrinology, and neurobiology. Microarray-based methods are the most widely used technology for large-scale analysis of gene expression because they allow simultaneous study of mRNA abundance for thousands of genes in a single experiment [5]. The generation of DNA microarray image spots involves the hybridization of two probes labelled with a fluorescent red dye or a fluorescent green dye . The relative image intensity values of the red dye and the green dye on a particular spot of the arrays indicate the expression ratio for the corresponding gene of the two samples from which the mRNAs have been extracted. Thus, robust image processing of microarray spots plays an important role in microarray technology [6]–[8]. DNA microarray data consists of a large number of genes and a relatively small number of experimental samples. The number of genes on an array is in the order of thousands, and because this far exceeds the number of samples, dimension reduction is needed to allow efficient analysis of data classification techniques. Many statistical and machine-learning techniques based different computational methodologies have been applied Manuscript received November 13, 2005; revised July 27, 2006. This paper was recommended by Guest Editor P.-C. Chung. T. D. Pham and D. Beck are with the Bioinformatics Applications Research Center, and the School of Information Technology, James Cook University, Townsville QLD 4811, Australia (e-mail: [email protected]). H. Yan is with the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong, and also with the School of Electrical and Information Engineering, University of Sydney, Sydney, NSW 2006, Australia (e-mail: [email protected]). Digital Object Identifier 10.1109/TCSI.2006.884407

for cancer classification in microarray experiments. These techniques include linear discriminant analysis, -nearest neighbor agorithms, Bayes classifiers, decision trees, neural networks, and support vector machines [9]–[11]. Nevertheless, common tasks of most classifiers are to perform feature selection and decision logic. Based on the motivation that conventional statistical methods for pattern classification break down when there are more variables (genes) than there are samples, Nguyen and Rocke [12] proposed a partial least-squares method for classifying human tumor samples using microarray gene expression data. Zhou et al. [13] proposed a Bayesian approach for selecting the strongest genes based on microarray gene expression data and the logistic regression model for classifying and predicting cancer genes. Yeung et al. [14] reported that conventional methods for gene selection and classification do not take into account model uncertainty and use a single set of selected genes for prediction, and introduced a Bayesian model averaging (BMA) method, which considers the uncertainty by averaging over multiple sets of overlapping relevant genes. Furey et al. [15] applied support vector machines for the classification of cancer tissue samples or cell types using microarrays. Lee et al. [16] proposed a Bayesian model for gene selection for cancer classification using microarray data. Statnikov et al. [17] carried out a comprehensive evaluation of classification methods for cancer diagnosis based on microarray gene expression data. In this study, we transform microarray data into spectral vectors. We then use the spectral difference or spectral distortion between the pair of spectra for pattern comparison. The proposed analysis and classification of microarrays involve the concept of pattern comparison in which the measures of similarity play the central role. Most measures applied to pattern comparsion or cluster analysis are metric, distance or correlation based functions [18]–[21]. There is rarely any research work on microarray data using the concept of distortion measures for pattern comparison. The concept of spectral distortion measures approach that we discuss herein is distinct from that of spectral clustering [22]—cluster analysis is only one aspect covered in the design of a vector quantizer in the former approach. Based on this motivation and having investigated the application of spectral distortion measures for protein sequence comparison [23], we address herein the applications of spectral distortion measures and spectral pattern classification methods as a potential approach for the cancer classification based on microarray gene expression data. This paper is organized as follows. In Section II, we describe how to extract spectral features of microarray gene expres-

1057-7122/$20.00 © 2006 IEEE

2426

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 11, NOVEMBER 2006

sion data. In Section III, we briefly present the concepts of spectral distortion measures. In Section IV, we then discuss how to classify microarray data using VQ-based decision rule. Section V illustrates the performance of the proposed approach and comparisons on cancer classification using three microarray datasets. Finally, Section VI concludes the findings on the general aspects of the proposed method, and suggests some issues for further research. II. SPECTRAL FEATURES OF MICROARRAY DATA

used for data classification. We will discuss another kind of spectral features for microarray gene expression data in the following subsection. B. LPC Cepstral Coefficients If we can determine the linear prediction coefficients for a microarray gene , then we can also extract another feature for gene expression data using the cepstral coefficients, , which are directly derived from the LPC coefficients. The LPC cepstral coefficients can be determined by the following recursion [24].

A. Linear Prediction Coefficients

(5)

Let a microarray gene expression data matrix be denoted as ; where is the number of genes and the number of experiments. To make the subsequent mathematical expressions more conveas to refer to a particular experinient, we will denote of the microarray matrix. mental value of a particular gene at The estimated experimental value of a particular gene , can be calculated as a linear position or time , denoted as combination of the past microarray samples. This linear prediction can be expressed as (1)

(6) (7) where

is the LPC gain, whose squared term is given as (8)

III. SPECTRAL DISTORTION MEASURES

The prediction coefficients can be optimally determined by minimizing the sum of squared errors

Methods for measuring similarity or dissimilarity between two vectors or sequences is one of the most important algorithms in the field of pattern comparison and recognition. The calculation of vector similarity is based on various developments of distance and distortion measures. In general, to calculate a dis, is to tortion measure between two vectors and calculate a cost of reproducing any input vector as a reproduction of vector . through the inverse LPC Consider passing a sequence system with its LPC coefficient vector . This will yield the prediction error, , which can be alternatively defined by

(3)

(9)

where the terms are called the linear prediction coefficients (LPCs). between the observed sample The prediction error and the predicted value can be defined as

(2)

To solve (3) for the prediction coefficients, we differentiate with respect to eack and equate the result to zero. The result is a set of linear equations with unknowns, which can be expressed in matrix form as (4) autocorrelation matrix is a autocorwhere is a vector of prediction coefficients relation vector, and is a

. where The sum of squared errors can be determined as (10) Similarly, consider passing another sequence through the inverse LPC system with the same LPC coefficients . The , is expressed as prediction error,

(11) . where The sum of squared errors for

Thus, we have introduced an approach for extracting a spectral feature of microarray gene expression data, which will be

is

(12)

PHAM et al.: SPECTRAL PATTERN COMPARISON METHODS FOR CANCER CLASSIFICATION BASED ON MICROARRAY GENE EXPRESSION DATA

It can be seen that must be greater than or equal to because is the minimum prediction error for the LPC system with the LPC coefficients . Thus, the ratio of the two prediction errors, denoted as , can be now defined by

2427

and , where and are the cepstral coefficients of respectively. Since the cepstrum is a decaying sequence, the infinite number of terms in (18) can be truncated to some finite number , that is

(13) (19) By now it can be seen that the derivation of the above distortion is based on the concept of the error matching measure. A. LPC Likelihood Distortion Consider the two spectra, magnitude-squared Fourier transand of the two signals and , where is forms, to . The log specthe normalized frequency ranging from tral difference between the two spectra is defined by [24] (14) which is the basis for the distortion measure proposed by Itakura and Saito in their formulation of linear prediction as an approximate maximum likelihood estimation. is defined as [25] The Itakura–Saito distortion measure

(15) where and are the one-step prediction errors of and , respectively. The LPC likelihood ratio distortion between two signals and is derived from the Itakura–Saito distortion and expressed as [24] (16) where is the autocorrelation matrix of sequence associated with its LPC coefficient vector , and is the LPC coefficient vector of signal .

In the next section, we will discuss the application of VQ approach coupling with these spectral vectors for classifying microarray data. IV. VQ-BASED CLASSIFICATION OF MICROARRAY DATA Vector quantization (VQ) is a data reduction method [27], which is used to convert a feature vector set into a small set of distinct vectors using a clustering technique. Advantages of this reduction are reduced storage and efficient computation. The distinct vectors are called codevectors and the set of codevectors that best represents the training set is called the codebook. Since there is only a finite number of code vectors, the process of choosing the best representation of a given feature vector is equivalent to quantizing the vector and leads to a certain level of quantization error. This error decreases as the size of the codebook increases, however the storage required for a large codebook is nontrivial. The VQ codebook can be used as a model in pattern recognition. A. VQ Procedure and LBG Partition Given a training set of spectral feature vectors , where each source vector is of dimensions, . Let respresent the codebook of size , where are the codewords. Each codeword is assigned to an encoding region in the . The source vector can be partition and expressed by represented by the encoding region

if

(20)

B. LPC Cepstral Distortion The complex cepstrum of the signal is defined as the Fourier transform of the log of the signal spectrum (17) are real and referred to as the cepstral where coefficients. Apply the Parseval’s theorem [26], the -norm cepstral disand can be related to the root-meantance between square log spectral distance as [24]

(18)

In general, the VQ design can be stated as follows. Given a training set , the size of the codebook, we seek to find the codebook , and the partition such that the average distortion is minimized. Using the norm for a squared-error measure, is defined by

(21) A popular method for VQ partition is the LBG algorithm [28]. This algorithm requires an initial codebook, and iteratively bi-partitions the codevectors based on the optimality criteria of nearest-neighbor and centroid conditions until the number of codevectors is reached. The optimal codebook is then used as a basis for a decision rule for data classification. We describe the VQ-based decision rule in the following section.

2428

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 11, NOVEMBER 2006

B. VQ-Based Decision Rule of an unThe average spectral distortion between gene known class and a particular cancer class is defined as (22) where is a spectral distortion measure, is the average disis an LPC vector of is the number of LPC tortion, is the LPC-VQ codevector of a particular vectors of class represented by codebook , and the size of . is assigned to class if the average distortion The gene measure of its spectral feature vector and the spectral feature is minimum, that is codebook (23)

V. EXPERIMENTS A. Data Sets We tested the LPC-likelihood and the LPC-cepstral classification methods with three well-known microarray data sets: acute leukemia, hereditary breast cancer, and small round blue-cell tumors. We also compared the classification results of the two spectral techniques with several other classification approaches using the same data sets. 1) Acute Leukemia: The Leukemia data set [10] has become a benchmark for evaluation and comparisons of different algorithms for classification of gene expression cancer data. This dat set consists of of 6817 human genes, where bone marrow or peripheral blood samples were taken from 72 patients with either acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). The gene expression levels were obtained using Affymetrix high-density oligonucleotide microarrays. The data set is split into a training set and a testing set. The training set has 38 samples of which 27 samples are ALL and the other 11 samples are AML. The test set consists of 34 samples of which 20 samples are ALL and the other 14 samples are AML. This data set involves two classes for the classification: AML and ALL. 2) Hereditary Breast Cancer: The hereditary breast cancer data set [29] consists of 3226 genes. This data set were obtained from patients carrying mutations in the predisposing genes, BRCA1 or BRCA2, or from patients not expected to carry a hereditary predisposing mutation. This data set has 7 samples of cancer with BRCA1 mutation, 8 samples with BRCA2 mutation, and 7 samples of sporadic cases of breast cancer across 3226 genes. Thus, this is a three-class problem for the classification: BRCA1, BRCA2, and sporadic case. 3) Small Round Blue-Cell Tumors: This data set [30] contains the small, round blue cell tumors (SRBCT) of childhood including neuroblastoma (NB), rhabdomyosarcoma (RMS), nonHodgkin lymphoma (NHL), and the Ewing family of tumors (EWS). The classification of these cancers are difficult because of their close similarity in routine histology. The expression profile of 6567 genes for these four types of malignancies

were obtained using cDNA microarrays and filtered to reduce to 2308 genes. The samples of the data set is split into 63 samples for training, and 20 samples for testing. The training set consists of tumor biopsy materials: 13 EWS, and 10 RMS; and cell lines: 10 EWS, 10 RMS, 12 NB, and 8 BL (Burkitt lymphomas which is a subset of NHL). The test set consists of tumors: 5 EWS, 5 RMS, and 4 NB; and cell lines: 1 EWS, 2 NB, and 3 BL. Thus, there are four classes to be identified using this data set: EWS, RMS, NB, and BL. B. Testing and Comparisons We applied both the LPC likelihood ratio (LPC-LR) and the LPC cepstral (LPC-C) distortion measures to analyze the three microarray gene expression datasets without further data preprocessing. Due to the large numbers of genes in each data set, we applied the VQ method to extract the prototypes from the training set for effective classification decision. Due to different sizes of samples in each of the three datasets, we selected smaller number of poles for smaller size of samples. The number of poles for acute leukemia, hereditary breast cancer, and small round blue-cell tumors are 20, 10, and 16, respectively. Due to different numbers of genes in the three data sets, we selected larger codebook sizes for the larger gene sizes. Due to the binary splitting strategy of the LBG algorithm, the increment of the codebook sizes should be doubled. Since the number of genes in the acute leukemia data set is much larger than those in the hereditary breast cancer, and the small round blue-cell tumor data sets, we selected the three sizes for VQ codebooks for acute leukemia, hereditary breast cancer, and small round blue-cell tumors as (32, 64, 128), (16, 32, 64), and (16, 32, 64), respectively. We applied the leave-one-out method to validate all the three data sets with their respective training samples, and obtained 100% correct classification results for all three data sets using both spectral distortion measures, where the classifications were made by the VQ-based decision rule. As there is no test set for the case of the breast cancer data, only the validation was carried out for this data set. We tested the performance of the proposed methods with the test set of the acute leukemia data and we found that the LPC-LR approach made 3, 1, and 2 classification errors according to the codebook sizes 32, 64, and 128, respectively; whereas the LPC-C made 2, 1, and 1 errors according to the codebook sizes 32, 64, and 128, respectively. Each LPC-LR and LPC-C correctly classified all the results for the test data set of the SRBCT with all three codebooks. Golub et al. [10] classified the acute leukemia data by using a weighted voting (WV) scheme which produced 2 errors based on the training set and 5 errors on the test set. For the same acute leukemia data, the total principal component regression (TPCR) method proposed by Tan et al. [31] made 1 classification error in the ALL samples when 38 training and 34 test samples were pooled together and classified. The same authors also reported that in the study carried out by Nguyen and Rocke [12], at least 6 samples were misclassified using A2 procedure (A2P). For the hereditary breast cancer data, the TPCR produced 4 misclassified samples. For the SRBCT data set, all training and test sets were pooled together (83 samples) for the classification and the

PHAM et al.: SPECTRAL PATTERN COMPARISON METHODS FOR CANCER CLASSIFICATION BASED ON MICROARRAY GENE EXPRESSION DATA

TPCR made no misclassification; whereas Khan et al. [30] applied the neural networks (NNs) which correctly classified all 88 samples including 5 nonSRBCT samples. Yeung et al. [14] filtered out 7129 genes of the leukemia dataset and reduced the number of genes to 3051 which were then thresholded and log-transformed. These authors applied their proposed Bayesian model averaging (BMA) to classify this data set with respect to ALL and AML types and found that the BMA produced 2 classification errors. These authors carried out further testing of the BMA method with the hereditary breast cancer data set and found that 6 samples were misclassified. An earlier study on the classification of the leukemia data set carried out by Nguyen and Rocke [12] using the logistic discrimination (LD) and quadratic discriminant analysis (QDA) together with the dimension reduction methods of the partial least squares (PLS) and principal component analysis (PCA). With different numbers of selected genes, for the LD based on PLS and PC, the numbers of misclassified samples are from 1 to 4; whereas the QDA based on PLS and PC produced 2 to 6 misclassified samples. Bae and Mallick [32] applied a two-level hierachical Bayesian (2L-HB) method for gene selection together with three prior probability distribution models to perform the classification of the same acute leukemia data set. These authors reported that there were 2 misclassifications out of the test set by using probability models I and II. In the second experiment, the same authors applied their method to the hereditary breast cancer data to classify the BRCA1, BRCA2, and sporadic cases and found that all samples were correcltly classified by models I and II; whereas 2 samples were misclassified by model III. Furey et al. [15] applied support vector machines (SVM) to validate training set of the acute leukemia data using the leaveone-out method and found that the SVM produced 100% correct classification. While performing on the test set, the same authors reported that the SVM misclassified between 2 to 4 samples. Lee et al. [16] applied a hierachical Bayesian (HB) model for variable (gene) selection approach to the hereditary breast cancer data set to classify BRCA1, BRCA2, and sporadic cancers. Based on the leave-one-out cross validation test, their Bayesian model correctly classified all the samples. These authors also reported the classification results obtained from the same data set by nine other methods [16, Table III]), including the feedforward neural networks (1.5 average error), Gaussian kernel (1 error), Epanechnikov kernel (1 error), moving window kernel (2 errors), probabilistic neural networks (4 errors), linear SVM (4 errors), (3 errors), -NN with perceptron (5 errors), and nonlinear SVM (6 errors). Based on the validation and test results of the three data sets obtained from many different methods for classification of microarray cancer data, it can be seen that while the LPC likelihood distortion measure is better or competitive with other methods, the LPC cepstral distortion measure is most favorable for the classification task. Regarding the acute leukemia data set, it is noted that although the total principal component regression (TPCR) method proposed by Tan et al. [31] also made 1 classification error but all 38 training and 34 test samples were pooled together for training.

2429

TABLE I CLASSIFICATION RESULTS ON ACUTE LEUKEMIA DATA

TABLE II CLASSIFICATION RESULTS ON BREAST CANCER DATA

TABLE III CLASSIFICATION RESULTS ON SRBCT DATA

We list relevantly comparable results obtained from different methods using the three datasets in Tables I–III, where the three results in brackets for LPC-LR and LPC-C methods indicate the number of misclassified samples using three codebook sizes of 32, 64, and 128 for classifying the acute leukemia dataset; 16, 32, and 64 for both the breast cancer and SRBCT datasets. It appears that the codebook consists of 64 codewords is a reasonable choice for the tradeoff between accuracy and data storage. VI. CONCLUSION While other classification methods select individual genes out of the microarray data, which serve both as a means for data reduction and as features for better classification. The proposed approach extracts the features as either the LPC coefficients or the LPC cepstral coefficients, and then uses the LBG-VQ approach to generate the feature prototypes for decision making. In doing so, we have measured the difference between two gene patterns in terms of spectral distortion. The methodology of our proposed approach appears to be a pioneering work that departs itself from conventional procedures for microarray gene expression classification. Our future investigations will involve the parametric study to determine appropriate ranges of the number of poles for LPC analysis, codebook sizes in VQ design using more extensive datasets. We will also seek to implement more sophisticated pattern classification methods using these LPC spectral features with a hope to further improve the classification. ACKNOWLEDGMENT The authors wish to thank M. Brandl, an intern with the JCU Bioinformatics Applications Research Centre, for her help in

2430

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 53, NO. 11, NOVEMBER 2006

collecting additional literature review on the application of microarray gene expression data. REFERENCES [1] A. Brazma and J. Vilo, “Gene expression data analysis,” FEBS Lett., vol. 480, pp. 17–24, 2000. [2] A. K. Whitchurch, “Gene expression microarrays,” IEEE Potentials, vol. 21, pp. 30–34, 2002. [3] X. Y. Zhang, F. Chen, Y. T. Zhang, S. G. Agner, M. Akay, Z. H. Lu, M. M. Y. Waye, and S. K. W. Tsui, “Signal processing techniques in genomic engineering,” Proc. IEEE, vol. 90, no. 12, pp. 1822–1833, Dec. 2002. [4] T. D. Pham, C. Wells, and D. I. Crane, “Analysis of microarray gene expression data,” Current Bioinformatics, vol. 1, no. 1, pp. 37–53, 2006. [5] P. Kellam P. and X. Liu, “Experimental use of DNA arrays,” in Bioinformatics: Genes, Proteins Struct., C. A. Orengo, D. T. Jones, and J. M. Thornton, Eds. Oxford: , 2003, Bios. [6] R. Nagarajan, “Intensity-based segmentation of microarray images,” IEEE Trans. Med. Imag., vol. 22, no. 7, pp. 882–889, Jul. 2003. [7] A. W.-C. Liew, H. Yan, and M. Yang, “Robust adaptive spot segmentation of DNA microarray images,” Pattern Recogn., vol. 36, pp. 1251–1254, 2003. [8] R. Lukac, K. N. Plataniotis, B. Smolka, and A. N. Venetsanopoulos, “A multichannel order-statistic technique for cDNA microarray image processing,” IEEE Trans. Nanobiosci., vol. 3, pp. 272–285, 2004. [9] S. Dudoit and J. Fridlyand, “Classification in microarray experiments,” in Statistical Analysis of Gene Expression Microarray Data, T. Speed, Ed. Boca Raton, FL: Chapman & Hall, ch. 3, pp. 93–158. [10] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomeld, and E. S. Lander, “Molecular classication of cancer: Class discovery and class prediction by gene expression monitoring,” Science, vol. 286, pp. 531–537, 1999. [11] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, pp. 389–422, 2002. [12] D. V. Nguyen and D. M. Rocke, “Tumor classification by partial least squares using microarray gene expression data,” Bioinformatics, vol. 18, pp. 39–50, 2002. [13] X. Zhou, K.-Y. Liu, and S. T. C. Wong, “Cancer classification and prediction using logistic regression with Bayesian gene selection,” J. Biomed. Inform., vol. 37, pp. 249–259, 2004. [14] K. Y. Yeung, R. E. Bumgarner, and A. E. Raftery, “Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data,” Bioinformatics, vol. 21, pp. 2394–2402, 2005. [15] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, vol. 16, pp. 906–914, 2000. [16] K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick, “Gene selection: A Bayesian variable selection approach,” Bioinformatics, vol. 19, pp. 90–97, 2003. [17] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis,” Bioinforamtics, vol. 21, pp. 631–643, 2005. [18] H. Yan, “Efficient matching and retrieval of gene expression time series data based on spectral information,” in LNCS, O. Gervasi, Ed. New York: Springer-Verlag, 2005, vol. 3482, pp. 357–373. [19] R. Lukac, B. Smolka, K. Martin, K. N. Plataniotis, and A. N. Venetsanopoulos, “Vector filtering for color imaging,” IEEE Signal Processing Mag., vol. 22, no. 1, pp. 74–86, Jan. 2005. [20] B. Smolka, R. Lukac, A. Chydzinski, K. N. Plataniotis, and K. Wojciechowski, “Fast adaptive similarity based impulsive noise reduction filter,” Real-Time Imag., vol. 9, pp. 261–276, 2003. [21] R. M. Nosovsky, “Choice, similarity and the context theory of classification,” J. Experimental Psych. Learning, Memory Cognition, vol. 10, pp. 104–114, 1984. [22] D. Verma and M. Meila, A comparison of spectral clustering algorithms Univ. of Washington, Seattle, WA, Tech. Rep., UW-CSE-03-05-01, 2003.

[23] T. D. Pham, “LPC cepstral distortion measure for protein sequence comparison,” IEEE Trans. NanoBiosci., vol. 5, no. 2, pp. 83–88, Feb. 2006. [24] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffls, NJ: Prentice-Hall, 1993. [25] F. Itakura and S. Saito, “A statistical method for estimation of speech spectral density and formant frequencies,” Electron. Commun. Japan, vol. 53A, pp. 36–43, 1970. [26] D. O’Shaughnessy, Speech Communication—Human and Machine. Reading, MA: Addison-Wesley, 1987. [27] R. M. Gray, “Vector quantization,” IEEE Acoust. Speech Signal Process. Mag., vol. 1, pp. 4–29, 1984. [28] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., vol. COM-28, no. 1, pp. 84–95, Jan. 1980. [29] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, M. Raffeld, Z. Yakhini, A. Ben-Dor, E. Dougherty, J. Kononen, L. Bubendorf, W. Fehrle, S. Pittaluga, S. Gruvberger, N. Loman, O. Johannsson, H. Olsson, B. Wilfond, G. Sauter, O.-P. Kallioniemi, A. Borg, and J. Trent, “Gene-expression profiles in hereditary breast cancer,” The New England J. Medicine, vol. 344, pp. 539–548, 2001. [30] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer, “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, pp. 673–679, 2001. [31] Y. Tan, L. Shi, W. Tong, and C. Wang, “Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data,” Nucleic Acids Res., vol. 33, pp. 56–65, 2005. [32] K. Bae and B. K. Mallick, “Gene selection using a two-level hierarchical Bayesian model,” Bioinformatics, vol. 20, pp. 3423–3430, 2004. Tuan D. Pham (M’95–SM’01) received the Ph.D. degree in civil engineering from the University of New South Wales, Sydney, Australia, in 1995. He is an Associate Professor in the School of Information Technology, James Cook University, Townsville, Australia, where he is the Director of the Bioinformatics Applications Research Centre. His research interests include image processing, pattern recognition, fuzzy-set algorithms, geostatistics, and bioinformatics. He has published two books, more than 100 papers in edited books, peer-reviewed journals, and conference proceedings.

Dominik Beck received the M.Sc. degree in bioinformatics from the University of Applied Sciences Weihenstephan, Munich, Germany, in 2006. He was a Trainee at the Bioinformatics Applications Research Centre (BARC), James Cook University, Townsville, Australia, the Genomics Research Centre, Griffith University, Australia, and the Biomathematics Group, Universidade Nova de Lisboa, Lisbon, Portugal. He is now with BARC as a Research Assistant.

Hong Yan (M’89–SM’93–F’06) received the Ph.D. degree from Yale University, New Haven, CT. He has been a Professor of Imaging Science at the University of Sydney and is currently Professor of Computer Engineering at City University of Hong Kong. His research interests include image processing, pattern recognition, and bioinformatics. Dr. Yan was elected a Fellow of the IEEE for contributions to image recognition techniques and applications and is a Fellow of the IAPR for contributions to document image analysis.

Lihat lebih banyak...

Spectral Pattern Comparison Methods for Cancer Classification Based on Microarray Gene Expression Data

Descripción

Comentarios