A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

Share Embed


Descripción

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R. Razavi1, Hans Gill1, Hans Åhlfeldt1, and Nosrat Shahsavar1,2 1 Department

of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden 2 Regional Oncology Centre, University Hospital, Linköping, Sweden {amirreza.razavi, hans.gill, hans.ahlfeldt, nosrat.shahsavar}@imt.liu.se http://www.imt.liu.se

Abstract. In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extracting rules to predict the outcomes of new patients. However, incompleteness and high dimensionality of stored data are a problem. Canonical Correlation Analysis (CCA) can be used prior to DTI as a dimension reduction technique to preserve the character of the original data by omitting non-essential data. In this study, data from 3949 breast cancer patients were analysed. Raw data were cleaned by running a set of logical rules. Missing values were replaced using the Expectation Maximization algorithm. After dimension reduction with CCA, DTI was employed to analyse the resulting dataset. The validity of the predictive model was confirmed by ten-fold cross validation and the effect of pre-processing was analysed by applying DTI to data without pre-processing. Replacing missing values and using CCA for data reduction dramatically reduced the size of the resulting tree and increased the accuracy of the prediction of breast cancer recurrence.

1 Introduction In recent years, huge amounts of information in the area of medicine have been saved every day in different electronic forms such as Electronic Health Records (EHRs) [1] and registers. These data are collected and used for different purposes. Data stored in registers are used mainly for monitoring and analysing health and social conditions in the population. In Sweden, the long tradition of collecting information on the health and social conditions of the population has provided an excellent base for monitoring disease and social problems. The unique personal identification number of every inhabitant enables linkage of exposure and outcome data spanning several decades and obtained from different sources [2]. The existence of accurate epidemiological registers is a basic prerequisite for monitoring and analysing health and social conditions in the population. Some registers are nation-wide, cover the whole Swedish population, and have been collecting data for decades. They are frequently used for research, evaluation, planning and other purposes by a variety of users [3]. With the wide availability and comprehensivity of registers, they can be used as a good source for knowledge extraction. Data mining methods can be applied to them S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 434 – 443, 2005. © Springer-Verlag Berlin Heidelberg 2005

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

435

in order to extract rules for predicting the outcomes of new patients. Extraction of hidden predictive information from large databases is a potent method with extensive capability to help physicians with their decisions [4]. A Decision Tree is a classifier in the form of a tree structure and is used to classify cases in a dataset [5]. A set of training cases with their correct classifications is used to generate a decision tree that, optimistically, classifies each case in the test set correctly. The resulting tree is a representation that can be verified by humans and can be used by either humans or computer programs [6]. Decision Tree Induction (DTI) has been used in different areas of medicine including oncology [7] and respiratory diseases [8]. Before the data undergo data mining, they must be prepared in a pre-processing step that removes or reduces noise and handles missing values. Relevance analyses for omitting unnecessary and redundant data, as well as data transformation, are needed for generalising the data to higher-level concepts [9]. Pre-processing techniques take the most effort and time, i.e. almost 80% of the whole project time for knowledge discovery in databases (KDD) [10]. However, replacing the missing values and finding a proper method for selecting important variables prior to data mining can make the mining process faster and even more stable and accurate. In this study the Expectation Maximization (EM) method was used for replacing the missing values. This algorithm is a parameter estimation method that falls within the general framework of maximum likelihood estimation and is an iterative optimisation algorithm [11]. Hotelling (1936) developed Canonical Correlation Analysis (CCA) as a method for evaluating linear correlation between sets of variables [12]. The method allows investigation of the relationship between two sets of variables and identification of the important ones. It can be used as a dimension reduction technique to preserve the character of the original data stored in the registers by omitting data that are nonessential. The objective is to find a subset of variables with predictive performance comparable to the full set of variables [13]. CCA was applied in the pre-processing step of Decision Tree Induction. DTI was trained with the cases with known outcome and the resulting model was validated with ten-fold cross validation. In order to show the benefits of pre-processing, DTI was also applied to the same database without pre-processing, as well as after just handling missing values. For performance evaluation, the accuracy, sensitivity and specificity of the three models were compared.

2 Materials and Methods Preparing the data for mining consisted of several steps: choosing proper databases, integrating them to one dataset, cleaning and replacing missing values, data transformation and dimension reduction by CCA. To see the effect of the pre-processing, DTI was also applied to the same dataset without prior handling of missing values and dimension reduction.

436

A.R. Razavi et al.

2.1 Dataset In this study, data from 3949 female patients, mean age 62.7 years, were analysed. The earliest patient was diagnosed in January 1986 and the last one in September 1995, and the last follow-up was performed in June 2003. The data were retrieved from a regional breast cancer register operating in south-east Sweden. In order to cover more predictors and obtain a better assessment of the outcomes, data were retrieved and combined from two other registers, namely the tumour markers and cause of death registers. After combining the information from these registers, the data were anonymised for security reasons and to maintain patient confidentiality. If patients developed symptoms following treatment they were referred to the hospital; otherwise follow-up visits occurred at fixed time intervals for all patients. There were more than 150 variables stored in the resulting dataset after combining the databases. The first criterion for selecting appropriate variables for prediction was consulting domain experts. In this step, sets of suggested predictors and outcomes (Table 1) were selected. Age of the patient and variables regarding tumour specifications based on pathology reports, physical examination, and tumour markers were selected as predictors. There were two variables in the outcome set, distant metastasis and loco-regional recurrence, both observed at different time intervals after diagnosis indicating early and late recurrence. 2.2 Data Pre-processing After selecting appropriate variables, the raw data were cleaned and outliers were removed by running a set of logical rules. For handling missing values, the Expectation Maximization (EM) algorithm was used. The EM algorithm is a computational method for efficient estimation from incomplete data. In any incomplete dataset, the observed values provide indirect evidence about the likely values of the unobserved values. This evidence, when combined with some assumptions, comprises a predictive probability distribution for the missing values that should be averaged in the statistical analysis. The EM algorithm is a general technique for fitting models to incomplete data. EM capitalises on the relationship between missing data and the unknown parameters of a data model. When the parameters of the data model are known, then it is possible to obtain unbiased predictions for the missing values [14]. All continuous and ordinal variables except for age were transformed to dichotomised variables. The fundamental principle behind CCA is the creation of a number of canonical solutions [15], each consisting of a linear combination of one set of variables, Ui, and a linear combination of the other set of variables, Vi. The goal is to determine the coefficients that maximise the correlation between canonical variates Ui and Vi. The number of solutions is equal to the number of variables in the smaller set. The first canonical correlation is the highest possible correlation between any linear combination of the variables in the predictor set and any linear combination of the variables in the outcome set. Only the first CCA solution is used, since it describes most of the variations. The most important variables in each canonical variate were identified based on the magnitude of the structure coefficients (loadings). Using loadings as the criterion for

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

437

finding the important variables has some advantages [16]. As a rule of thumb for meaningful loadings, an absolute value equal to or greater than 0.3 is often used [17]. SPSS version 11 was used for transforming, handling missing values, and implementing CCA [18]. Table 1. List of variables in both sets

Outcome Set ‡

Predictor Set Age

DM, first five years

Quadrant

DM, more than 5 years

Side Tumor size

LRR, first five years *

LN involvement

LRR, more than 5 years *

LN involvement † Periglandular growth * Multiple tumors * Estrogen receptor Progesterone receptor S-phase fraction DNA index DNA ploidy Abbreviations: LN: lymph node, DM: Distant Metastasis, LRR: Loco-regional Recurrence * from pathology report, † N0: Not palpable LN metastasis, ‡ all periods are time after diagnosis

2.3 Decision Tree Induction One of the classification methods in data mining is decision tree induction (DTI). In a decision tree, each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node [19]. The algorithm uses information gain as a heuristic for selecting the variable that will best separate the cases into each outcome [20]. Understandable results of acquired knowledge and fast processing make decision trees one of the most frequently used data mining techniques [21]. Because of the approach used in constructing decision trees, there is a problem of overfitting the training data, which leads to poor accuracy in future predictions. The solution is pruning of the tree, and the most common method is post-pruning. In this method, the tree grows from a dataset until all possible leaf nodes have been reached,

438

A.R. Razavi et al.

and then particular subtrees are removed. Post-pruning causes smaller and more accurate trees [22]. In this study, DTI was applied to the reduced model resulting from CCA after handling missing values to achieve a predictive model for new cases. For the training phase of DTI, the resulting predictors in CCA with the absolute value of loadings ≥ 0.3 were used as input. As the outcome in DTI, important outcomes with the absolute value of loadings ≥ 0.3 in CCA were used, i.e. the occurrence of distant metastasis during a five-year period after diagnosis.

Fig. 1. Canonical structure matrix and loadings for the first solution

As a comparison, DTI was also applied to the dataset without handling the missing values and without dimension reduction by CCA (Table 1). In this analysis of outcome, any recurrence (either distant or loco-regional) at any time during the follow-up after diagnosis was used as outcome. Another similar analysis was done after just handling missing values with the EM algorithm. DTI was carried out using the J48 algorithm in WEKA, which is a collection of machine learning algorithms for data mining tasks [23]. Post-pruning was done to trim the resulting tree. The J48 algorithm is the equivalent of the C4.5 algorithm written by Quinlan [5]. 2.4 Performance Comparison Ten-fold cross validation was done to confirm the performance of predictive models. This method estimates the error that would be produced by a model. All cases were randomly re-ordered, and then the set of all cases was divided into ten mutually disjointed subsets of approximately equal size. The model then was trained and tested ten times. Each time it was trained on all but one subset, and tested on the remaining single subset. The estimate of the overall accuracy was the average of the ten individual accuracy measures [24]. The accuracy, sensitivity and specificity were used for

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

439

comparing the performance of the two DTIs, i.e. the one following pre-processing and the other without any dimension reduction procedure [25]. In addition, the size of the tree and the number of leaf nodes in each tree were also compared.

3 Result CCA was applied to the dataset and in each solution the loadings for predictor and outcome sets were calculated. The first solution and the loadings (in parentheses) for variables are illustrated in Figure 1. The canonical correlation coefficient (rc) is 0.49 (p ≤ .001). The important variables are shown in bold type.

Fig. 2. Extracted model from CCA in the pre-processing step

After the primary hypothetical model was refined by CCA and the important variables were found by considering their loadings, then they were ready to be analysed by DTI in the mining step. The important outcome was distant metastasis during the first five years and it was used as a dichotomous outcome in DTI. This model is illustrated in Fig. 2. DTI was also applied without the pre-processing step that included handling missing values and dimension reduction. Three different pre-processing approaches were used before applying DTI, and the accuracy, sensitivity, specificity, number of leaves in the decision tree and the tree size were calculated for predictive models. The results are shown and compared in Table 2. Table 2. Results for the different approaches

DTI

Without pre-processing

Accuracy Sensitivity Specificity Number of Leaves Tree Size

54% 83% 41% 137 273

With replacing missing values 57% 82% 46% 196 391

With pre-processing 67% 80% 63% 14 27

440

A.R. Razavi et al.

In the analysis done after handling missing values and dimension reduction using CCA, the accuracy and specificity show improvement but the sensitivity is lower. The tree size and number of leaves in the decision tree that was made after the preprocessing are smaller and more accurate.

4 Discussion An important step in discovering the hidden knowledge in databases is effective data pre-processing, since real-world data are usually incomplete, noisy and inconsistent [9]. Data pre-processing consists of several steps, i.e. data cleaning, data integration, data transformation and data reduction. It is often the case that a large amount of information is stored in databases, and the problem is how to analyse this massive quantity of data effectively, especially when the data were not collected for data mining purposes. For extracting specific knowledge from a database, such as predictions of the recurrence of a disease or of the risk of complications in certain patients, a set of important predictors is assumed to be of central importance, while other variables are not essential. These unimportant variables simply cause problems in the main analysis. They may contain noise and may increase the prediction error and increase the need for computational resources [26]. Dimension reduction results in a subset of the original variables, which simplifies the model and reduces the risk of overfitting. The first step in reducing the number of unimportant variables is discussion with domain experts who have been involved in collecting the data. Their involvement in the knowledge discovery and data mining process is essential [27]. In this study, the selection by domain experts was based on their knowledge and experience concerning variables related to the recurrence of breast cancer. In this step the number of variables was reduced from more than 150 to 21 (17 in the predictor set and 4 in the outcome set). One of the problems in real-life databases is low quality data such as where there are missing values. This is frequently encountered in medical databases, since most medical data are collected for purposes other than data mining [28]. One of the most common reasons for missing values is not that they have not been measured or recorded, but that they were not necessary for the main purpose of the database. There are different methods for replacing missing values, the most common of which is simply to omit any case or patient with missing values. However, this is not a good idea because it causes the lost of information. A case with just one missing value has other variables with valid values that can be useful in the analysis. Deleting a case is implemented as default in most statistical packages. The disadvantage is that this may result in a very small dataset and can lead to serious biases if the data are not missing at random. Methods such as imputations, weighting and model-based techniques can be used to replace the missing values [29]. In this study, replacing the missing values was done after the variables were selected by the domain experts. This results in fewer calculations and less effort, and a better analysis with CCA in the dimension reduction step.

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

441

In the CCA, non-important variables were removed from the dataset by considering loadings as criteria. Loadings are not affected by the presence of strong correlations among variables and can also handle two sets of variables. After the dimension reduction by CCA, the number of suggested variables from domain experts was reduced to 9 (8 + 1). The benefits can be more apparent when the number of variables is much higher. This small subset of the initial variables contains most of the information required for predicting recurrence of the cancer. DTI is a popular classification method because its results can be easily changed to rules or illustrated graphically. In the case of small and simple trees, even a paper copy of the tree can be used for predicting purposes. The results can be presented and explained to domain experts and can be easily modified by them for a higher degree of explainability. As an important step in KDD, the pre-processing step is vital for successful data mining. In some papers, no specific pre-processing was noted before mining the data with DTI [8], and in others data discretization was performed [21]. We combined different techniques: substituting missing values, categorisation and dimension reduction. The results of the three DTIs that were performed (Table 2) show that the accuracy and specificity of the analysis with replacement of missing values and CCA prior to DTI are considerably better, but the sensitivity has decreased. The number of leaves and tree size show a considerable decrease. This simplifies the use of the tree by medical personnel and makes it easier for domain experts to study the tree for probable changes. This simplicity is gained along with an increase in the overall accuracy of the prediction, which shows the benefits of a well-studied pre-processing step. In this study, the role of dimension reduction by CCA in preprocessing is more apparent than that of handling missing values. This is shown by the better accuracy and reduced size of the tree after adding CCA to pre-processing. Combining proper methods for handling missing values and a dimension reduction technique before analysing the dataset with DTI is an effective approach for predicting the recurrence of breast cancer.

5 Conclusion High data quality is a primary factor for successful knowledge discovery. Analysing data containing missing values and redundant and irrelevant variables requires a proper pre-processing before application of data mining techniques. In this study, a pre-processing step including CCA is suggested. Data on breast cancer patients stored in the registers of south-west Sweden have been analaysed. We have presented a method that consists of replacing missing values, consulting with domain experts and dimension reduction prior to applying DTI. The result shows an increased accuracy for predicting the recurrence of breast cancer. Using the described pre-processing method prior to DTI results in a simpler decision tree and increases efficiency of predictions. This helps oncologists in identifying high risk patients.

442

A.R. Razavi et al.

Acknowledgments This study was performed in the framework of Semantic Mining, a Network of Excellence funded by EC FP6. It was also supported by grant No. F2003-513 from FORSS, the Health Research Council in the South-East of Sweden. Special thanks to the South-East Swedish Breast Cancer Study Group for fruitful collaboration and support in this study.

References [1] Uckert, F., Ataian, M., Gorz, M., Prokosch, H. U.: Functions of an electronic health record. Int J Comput Dent 5 (2002) 125-32 [2] Sandblom, G., Dufmats, M., Nordenskjold, K., Varenhorst, E.: Prostate carcinoma trends in three counties in Sweden 1987-1996: results from a population-based national cancer register. South-East Region Prostate Cancer Group. Cancer 88 (2000) 1445-53 [3] Rosen, M.: National Health Data Registers: a Nordic heritage to public health. Scand J Public Health 30 (2002) 81-5 [4] Windle, P. E.: Data mining: an excellent research tool. J Perianesth Nurs 19 (2004) 355-6 [5] Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993) [6] Podgorelec, V., Kokol, P., Stiglic, B., Rozman, I.: Decision trees: an overview and their use in medicine. J Med Syst 26 (2002) 445-63 [7] Vlahou, A., Schorge, J. O., Gregory, B. W., Coleman, R. L.: Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Data. J Biomed Biotechnol 2003 (2003) 308-314 [8] Gerald, L. B., Tang, S., Bruce, F., Redden, D., Kimerling, M. E., Brook, N., Dunlap, N., Bailey, W. C.: A decision tree for tuberculosis contact investigation. Am J Respir Crit Care Med 166 (2002) 1122-7 [9] Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann (2001) [10] Duhamel, A., Nuttens, M. C., Devos, P., Picavet, M., Beuscart, R.: A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud Health Technol Inform 95 (2003) 269-74 [11] McLachlan, G. J., Krishnan, T.: The EM algorithm and extensions. John Wiley & Sons (1997) [12] Silva Cardoso, E., Blalock, K., Allen, C. A., Chan, F., Rubin, S. E.: Life skills and subjective well-being of people with disabilities: a canonical correlation analysis. Int J Rehabil Res 27 (2004) 331-4 [13] Antoniadis, A., Lambert-Lacroix, S., Leblanc, F.: Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19 (2003) 563-70 [14] Dempster, A. P., Laird, N. M., Rubin, D. B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Ser B 39 (1977) 1-38 [15] Vogel, R. L., Ackermann, R. J.: Is primary care physician supply correlated with health outcomes? Int J Health Serv 28 (1998) 183-96 [16] Dunlap, W., Landis, R.: Interpretations of multiple regression borrowed from factor analysis and canonical correlation. J Gen Psychol 125 (1998) 397-407 [17] Thompson, B.: Canonical correlation analysis: Uses and interpretation. Sage, Thousand Oaks, CA (1984) [18] SPSS Inc.: SPSS for Windows. SPSS Inc. (2001)

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

443

[19] Pavlopoulos, S. A., Stasis, A. C., Loukis, E. N.: A decision tree--based method for the differential diagnosis of Aortic Stenosis from Mitral Regurgitation using heart sounds. Biomed Eng Online 3 (2004) 21 [20] Luo, Y., Lin, S.: Information gain for genetic parameter estimation with incorporation of marker data. Biometrics 59 (2003) 393-401 [21] Zorman, M., Eich, H. P., Stiglic, B., Ohmann, C., Lenic, M.: Does size really matter-using a decision tree approach for comparison of three different databases from the medical field of acute appendicitis. J Med Syst 26 (2002) 465-77 [22] Esposito, F., Malerba, D., Semeraro, G., Kay, J.: A comparative analysis of methods for pruning decision trees. IEEE Trans Pattern Anal Mach Intell 19 (1997) 476-491 [23] Witten, I. H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000) [24] Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proc. International Joint Conference on Artificial Intelligence (1995) 1137-1145 [25] Delen, D., Walker, G., Kadam, A.: Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med In press (2004) [26] Pfaff, M., Weller, K., Woetzel, D., Guthke, R., Schroeder, K., Stein, G., Pohlmeier, R., Vienken, J.: Prediction of cardiovascular risk in hemodialysis patients by data mining. Methods Inf Med 43 (2004) 106-113 [27] Babic, A.: Knowledge discovery for advanced clinical data management and analysis. Stud Health Technol Inform 68 (1999) 409-13 [28] Cios, K. J., Moore, G. W.: Uniqueness of medical data mining. Artif Intell Med 26 (2002) 1-24 [29] Myrtveit, I., Stensrud, E., Olsson, U. H.: Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27 (2001) 999-1013

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.