An Experimental Comparison of Data Mining Techniques for Paleoecology Data.

July 7, 2017 | Autor: M. Chandrasekaran | Categoría: Machine Learning, Paleoecology, Fauna, Naive Bayes
Share Embed


Descripción







An Experimental Comparison of Data Mining Techniques for Paleoecology.


Mritula Chandrasekaran
Department of Computer Science,
Christ University,
Bangalore, India
[email protected]


Ramesh A
Department of Computer Science,
Christ University,
Bangalore, India
[email protected]


Chandra J
Associate Professor,
Department of Computer Science,
Christ University,
Bangalore, India
[email protected]


Abstract- Classification is a data mining technique that generates item models to target classes. These models act as the source for knowledge discovery and various analysis. Paleoecology a subfield of Archaeology deals with the study of fossil organisms and the relative environment of the past. The goal of this paper aims in classifying the fossil faunal species data from the tDAR (The Digital Archaeological Record) repository. Classification models are generated using various machine learning algorithms. The performances of chosen algorithms are analyzed and the appropriate algorithms for the considered data set are studied. The experiment infers in proving that the decision tree classifier algorithm (J48) to be the best fit for the considered data set.
Keywords- Naive bayes, Machine Learning, J48, Paleoecology, Fauna
I. Introduction
Dodd, J. Robert et al., states Paleoecology as the study of interactions of organisms with one another and with their corresponding environment in the geologic past. In specific it's the branch of biology which infers to the study of ecosystems of the past, based on fossils and sub fossils [1]. It also comprises the study about the fossil organisms and their associated remains which intern assists in studying about the species life, environment, reproduction, mode of death and burial. The aim of paleoecology is to build the possible model, of the life environment of those organisms of ancient past those which exists as fossils in the present. The excavated fossil documentation has been studied to try and understand the relationship animals had to their environment and to understand the current state of biodiversity with that of the ancient past. Sahney, S., Benton, et al., says that a close link can be determined between vertebrate taxonomic and biological diversity, that assists in exploring the details of several species and the habitat in which they would have survived [2]. Therefore, classifying the Paleoecology data can help in revealing the details of major groups of organisms excavated and their corresponding attributes which are closely related between these organisms.
Machine learning is a multidisciplinary field, it has its footprints in all major arenas such as Biology, Neuroscience, Finance, Accounts. The paper focuses in applying Machine Learning techniques and discovers patterns over a Paleoecology data set. Classification plays a major role in segregating data under specific objectives that can be used to extract useful information. Considering a data set of extinct fauna species, taxonomies of these extinct fauna species can help in studying about the various classes of animals existed in historical past, the threat category associated with them and also about the mode of their extinction. Numerous species have gone obsolete due to various reasons like ageing process, environmental changes, natural calamities, human hunting. To enable study on these factors several machine learning algorithms can be used for classification and the obtained results can be compared and the best fit can be considered for further analysis and investigation.
II. THE RELATED WORK
There are many works in the literature that discusses about the applications of data mining. Machine learning enhances the process of data mining in terms of programming that enables in building a learning model from training data. The model can be used as a basic component that can assist in categorizing new test data. Data mining focuses on streamlining the data and extracting a knowledge based output that will serve as a useful learning enabling various valuable outcomes. Main goals that can be renowned in the data mining process, firstly a description objective consisting in creating the persuasive variables and their influences; secondly the most important objective would be the prediction objective. Jiawai Han et al., insists that various techniques can be used in discovering patterns that are hidden in large data sets while the process depends on focusing on the feasibility of the outcome, the usefulness of the task and its effectiveness and the related scalability [11]. As data in any field is a cumulative factor a prototype construction of knowledge over the data will assist to the factor of handling upcoming information and thus enhancing a smooth process of controlling similar data in future.
Tom Mitchel et al., describes Machine Learning as the technical field that seeks answer to the question in construction of systems that can automatically improve based on the assimilated experience [12]. The approaches implemented by using the machine algorithms assists in building such systems and classification techniques contribute in discovering models or patterns over existing data. A. Kothari et sl., suggests that the model thus acquired can be utilized for future predictions and categorization of new upcoming [9]. Various algorithms and techniques can be used in building mentioned models. Specific algorithms servers to be appropriate for specific data sets. This can be gauged with the help of various attributes and analysis methods.
The Waikato Environment for Knowledge Analysis (Weka) is a Java-based Machine Learning tool with GUI features. Frank et al., suggests that the tool's machine learning workbench, is a general-purpose environment for automatic feature selection, classification, clustering, regression and for many other functions [3]. These Weka features supports in implementing various classifiers. Using classification, automatic labelling can also be done for a given example set which does not have a label. Witten, Ian H et al., states that automatic labelling process can be achieved from a finite domain and generate a procedure for labelling unseen samples [7]. In the field of paleoecology, more and more excavations are done and many elements and their parts are excavated, labelling them with merely relevant class or species titles can be made easy using the Machine Learning implementations.
A. Kothari et al., states that an intelligent system deals with the nature of inferential mechanisms and how computer programs allows to discover and produce inferences [9]. Applying intelligence or building intelligent system does not confine specifically to the fields of robotics, programming, or cloud alone. Such systems can also be built to assist in work related to paleoecology and the researchers working towards exploring details about the organisms and their interactions with the environment. Analysis of these data invades the life of past and also gives an idea for the ancient research.
IV. METHODOLOGY
Considering the supervised data set for analysis certain types of classifiers are simple as well as prompt data classifiers. Considering the nature of the data set and forecasting on the expected outcomes four algorithms are considered and the values obtained by executing them are inferred. The algorithms considered are a tree based classifier J48, rule based classifier OneR, regression based classifier ClassifierViaRegression and a Bayes classifier Navie Bayes.
The J48 Decision tree classifier works on the basis of decision tree algorithm. Coyte et al., describes a decision tree algorithm as a predictive machine learning representation which is based on calculating the target value or the dependent variable from the test data based on the various attribute values derived from the training data set. J48 initially constructs a decision tree based on the attribute data of the available data from the training data set. To construct the tree the algorithm scans through the data set and whenever set of items similar to the training data set are identified, the algorithm detects the attribute that discriminates the various instances. A scan is performed over the test data and when there is no ambiguity encountered in fixing a similar feature to that of the training data set or to the target value, then the branch gets terminated and it assigned to the obtained target value.
The One Rule algorithm which is referred in short as OneR is a basic algorithm yet a precise classification algorithm that creates one rule for every single predictor in the data, then picks the rule with the smallest total error as its "one rule". OneR works by creating a frequency table for each predictor value to create the rule against the target. The total error values are determined from the frequency tables. The errors are interpreted and the lower error indicates higher contribution towards the predictability of the model.
There are several machine learning systems that enables and addresses regression problems. Applying regression techniques to solve classification problems can hold good when complex data sets are encountered. Linear Regression and Logistic Regression gives a higher order polynomial with a significant variation in the lines got as output. The algorithm works in discretizing continuous goal variables into set of intervals classified on the basis of the polynomial obtained. By using regression it is also easier to implement predictions based on the classified data.
The Bayes classifier, Naïve Bayes works on the basis of conditional probability of the Bayes. The Naïve Bayes classifier considers the entire attributes contained in the data set and analyses them discretely assuming they are equally important and independent of each other. When a new instance is to be classified the classifier considers each of the attributes in the data set separately. The thumb rule of the classifier is based on an assumption that every single attribute works independently of the other contained in the considered training data set and the test data set.
The experiment is carried out comparing the performance of the considered paleoecology data set of the fauna against the chosen four algorithms. The interpretation of the results obtained justifies the best fit algorithm for the considered data set thus exploring a scope of consideration for that algorithm across similar data sets across various domains.
V. EXPERIMENTAL ANALYSIS
The classification analysis is based on the data set taken from tDAR in Apr-2014. Faunal (FAUNAL) Data sheet contains data on the excavated faunal remains recovered during SSI's testing and data recovery projects (SSI Project No. 97-02) at Pueblo Grande, Unit 11 documented during the year 1997 [4]. From the data sheet, the documented information about a specimen's species designation, anatomical data, condition, identified modifications and size are considered for analysis. Based on the data of the species type, classification is attempted. The information about various species like Canis sp. (mammmals), Accipitridae genus indeterminate (birds), their corresponding sizes like large or small are classified and their equivalent anatomical data like their element types, mode of extinction found are studied. Classification further helps in acquainting knowledge about the specific conditions in which the species have been excavated and various element types discovered during the process of excavation and their respective groups to which they may fit to.
The data set was studied in detail and refinement of this data set is made in order to make the data compatible for analysis. The excavated details of data are pre-processed, analysed and renovated into proper arrangement in order to understand the characteristics of the species and build species classifier models. The original data set contained data with 151 rows and 36 columns. Most of the columns named FRONT_HIND, PROXIMAL_DISTAL, etc had more than 90% of missing values which were ignored.
The missing data was studied. The nearest relevance was spotted. Measures were taken to match to the nearest relevant value, if found then it was replaced with the same value. If no relevance was found then the data was discarded for smooth analysis.
The resulting data set had 151 rows and 15 columns. From this data set the species column was considered as the classifier model and the classification was done based on that.
Machine learning algorithms are used over the pre-processed data in order to classify the excavated species. The data set is classified using Weka 3.6. The tool assists in implementing several machine learning algorithms.
The paleoecology data set used for excavated species are tested on Bayes network classifier classification tree with pruning algorithms (J48) and support vector machine models . Building a classifier model enhances the ability of prediction for future data. Classifying data set and building a classification model on the existing data set introduces the question, "Will the machine learning classifier label the upcoming data set to the specific class type, accurately and automatically?" Applying the machine learning techniques to a database enhances discovering valuable knowledge.
The various attributes of the classifier output are analysed in order to compare and predict the nearest appropriate algorithm that suits best for the considered data set.
The Correctly classified instances show the number and the percentage of considered test instances that were correctly classified. Based on this measure it can be used to determine the strength of the test instances and the possibilities of the training set's accuracy. The kappa statistic is the measure used to predict the agreement with the true class. Viera et al., defines Kappa as the measure is intended to give the reader a quantitative measure of the magnitude of agreement between observers [6].
The classification results based on error during the classification are also interpreted. The mean absolute error, root mean squared error, the relative absolute error and root relative squared error in percentage are observed.
VI. OBSERVATION
The performance of various classifiers such as decision tree, Rule Based, Regression, Support Vector Machine and Bayes are tested with the considered Paleoecology data set.
VI.1 Accuracy
To investigate the classifier performances over the paleoecology data set, the selected algorithms with a 10-fold cross validation for every classifier is used. The results of the experiments under various parameters are populated using tables as shown below indicating the values of correctly classified instances in percentage, the Kappa statistic value and the time taken in seconds.




VI.1 CLASSIFICTION RESULTS
Algorithm
Correctly
Classified
Instances
(%)
Kappa statistic
Time Taken
(sec)
J48
81.33%
0.79
00.05
OneR
73.33%
0.70
00.02
Regression
60.67%
0.56
00.58
Naive Bayes
54.00 %
0.49
00.00

From Table 1 it can be inferred that the classification algorithms with pruning, gives the highest accuracy. Time taken by the J48 is 0.05 seconds and the percentage of instances that are correctly classified is 81% approximately. On considering the Kappa statistic or kappa coefficient, a value of 1.0 signifies a complete agreement of the class. Therefore, a Kappa statistic value of 0.79 is acquired in J48 which is approximately near to one, which is highly considerable.
Next to J48, the rule based OneR algorithm is considered. On examining the accuracy values using OneR, it can be inferred that 73% of the instances are correctly classified within the time of 0.02 seconds. This leaves an appreciable amount instances that are correctly classified that are used in building the classifier model. A Kappa statistic value with OneR is of 0.70 which is also approximately near to one, which is too in the considerable range.
On building classifier using the Classification via Regression algorithm it's evident that 60% of the instances are correctly classified within 0.58 seconds and with a Kappa coefficient of 0.56. This seems to be slightly lower values when compared with J48 and OneR algorithms over the given data set.
On interpreting the accuracy values of the models using a Bayes theorem implementation algorithm, Naive Bayes, it is evident that this algorithm does not hold well over the considered paleoecology data set in terms of accuracy.
Naive Bayes shows 54% of the instances are correctly classified whereas the remaining 46% are incorrectly classified instances. The Kappa statistic value is only 0.49 which is also not near to one.


Classification accuracy among classifiers.
The Figure 1: Classification accuracy among classifier is constructed with the data obtained from Table 1. The graph illustrates the percentage of accuracy output of various classifiers. It is evident that the Decision tree algorithm J48 gives the maximum accurate classification when compared to others. The Bayes classification stands to be the least accurate for classification on the paleoecology data.
VI.2 Training Errors:
Training Errors
Algorithm
Mean Absolute error
Root Mean Squared error
Relative absolute error.
Root Relative squared error.

(%)
(%)
(%)
(%)
J48
0.01
0.09
20.47
56.73
One R
0.01
0.13
28.76
76.1
Regression
0.04
0.14
65.66
79.54
Naive Bayes
0.03
0.16
51.11
90.36

Optimizing the classification rate without considering the cost of the errors often leads to strange results. The success of numeric predictions depends on the evaluations of the various errors possible. The error values are calculated by predicting the classes' prior probabilities using Laplace estimator over training data set. Witten, Ian H et al., defines mean-squared error to exaggerate the effect of outliers-instances when the prediction error is larger than the others-but absolute error does not have this effect: All sizes of error are treated evenly according to their magnitude [7].
Accuracy parameters.
Algorithm
TP Rate
FP Rate
F Measure
ROC Area
J48
0.81
0.01
0.78
0.93
OneR
0.73
0.03
0.64
0.97
Regression
0.61
0.61
0.56
0.79
Naive Bayes
0.54
0.05
0.51
0.81

The measure of sensitivity is based on the TP rate and the FP rate. Sensitivity is the probability that the test says an event to be true when it is actually present. Sensitivity thus corresponds to the true positive rate. Specificity is the probability that the test says an event is not present when it actually does not exist. The True Negative (TN) value corresponds to specificity. Higher the true negative value implies to the accuracy of the classification. Thus the pruning tree algorithm and the OneR algorithm shows higher values of accuracy.
Musicant et al., states F-measure as a relevant goal in any machine learning scenario where the data from one class are present in much greater quantities than data from the other class [8].


ROC Area.

The Receiver Operator Characteristic (ROC) area is one of the most adaptive measure of accuracy in Machine Learning. The values of ROC as illustrated in Figure 2, corresponds to the accuracy levels of the classifier. The more the value is nearer to one the higher is the accuracy rate. From Table 3.and Figure 2 its inferred that the ROC values of all considered algorithms are approximately equal to one.
VII. CONCLUSION
The current research in the field of Paleoecology using data mining and machine learning techniques is an evolving process. Enhancing much more analysis and discovering new approaches to the study of Machine Learning techniques based on the excavated species data sets can further assist current researches enabling in practical applications of the techniques and also in prediction and description tasks which are the basic needs for any analysis. The study also demonstrates that Paleoecology along with machine-learning algorithms has considerable potential for the classification of faunal remains from Pueblo Grande.
The various species in the data set is an evident aspect which is one of the main focus of the researchers. Classifying the details based on this particular arena helps the researchers in grouping the future excavations. From the considered algorithms and their various features and rules of implementation, it can be concluded that the decision tree algorithm J48 fits better in building classifier models and by extending appreciable outcomes.
The research directions in data mining studies with Machine Learning algorithms over the Paleoecology data is further continued concentrating on one of the other diversified technique, clustering. Parallel analysis over the excavated floral data sets is also considered. The further research will assist in deciding the algorithms that can fit appropriate for the similar data types of diverse categories.



REFERENCES
Dodd, J. Robert, and Robert J. Stanton. Paleoecology: concepts and applications. John Wiley & Sons, 1990
Sahney, S., Benton, M.J and Ferry, P.A(2010), "Links between global taxonomic diversity, ecological diversity and the expansion of vertebrates on land"
Frank, Eibe, Mark Hall, Len Trigg, Geoffrey Holmes, and Ian H. Witten. "Data mining in bioinformatics using Weka." Bioinformatics 20, no. 15 (2004): 2479-2481.
http://core.tdar.org/dataset/366759
WEKA at http://www.cs.waikato.ac.nz/~ml/weka
Viera, Anthony J., and Joanne M. Garrett. "Understanding interobserver agreement: the kappa statistic." Fam Med 37, no. 5 (2005): 360-363
Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005
Musicant, David R., Vipin Kumar, and Aysel Ozgur. "Optimizing F-Measure with Support Vector Machines." In FLAIRS Conference, pp. 356-360. 2003
A. Kothari, A. Keskar (2009) "Paper on Rough Set Approach for Overall Performance Improvement of an Unsupervised ANN - Based Pattern Classifier", Journal on Advanced Computational Intelligence and Intelligent Information, Vol. 13, No.4, pp 434-440
Rusu L, Brefelean VP. Management prototype for universities. Annals of the Tiberiu Popoviciu Seminar, International Workshop in Collaborative Systems, Volume 4, 2006, Mediamira Publisher, ClujNapoca, Romania; 2006. p. 287-295
Jiawei, Han, and Micheline Kamber. "Data mining: concepts and techniques."San Francisco, CA, itd: Morgan Kaufmann 5 (2001).
Mitchell, Tom Michael. The discipline of machine learning. Carnegie Mellon University, School of Computer Science, Machine Learning Department, 2006.
Coyte, James L., Boyuan Li, Haiping Du, Weihua Li, David Stirling, and Montserrat Ros. "Decision tree assisted EKF for vehicle slip angle estimation using inertial motion sensors." In Neural Networks (IJCNN), 2014 International Joint Conference on, pp. 940-946. IEEE, 2014.
Cufoglu, Ayse, Mahi Lohi, and Kambiz Madani. "A comparative study of selected classifiers with classification accuracy in user profiling." In Computer Science and Information Engineering, 2009 WRI World Congress on, vol. 3, pp. 708-712. IEEE, 2009.
Ahmad, Nor Bahiah Hj, and Siti Mariyam Shamsuddin. "A comparative analysis of mining techniques for automatic detection of student's learning style." InIntelligent Systems Design and Applications (ISDA), 2010 10th International Conference on, pp. 877-882. IEEE, 2010.
Gansell, Amy Rebecca, Irene K. Tamaru, Aleks Jakulin, and Chris H. Wiggins. "Predicting regional classification of Levantine ivory sculptures: a machine learning approach." arXiv preprint arXiv:0806.4642 (2008).
Barceló, Juan A. "Computational Intelligence in Archaeology. A State-of-the Art." Expert Systems With Applications 25 (2003): 155-164.
Kilany, Rania M. "Efficient Classification and Prediction Algorithms for Biomedical Information." (2013).
Yadav, Arvind R., R. S. Anand, M. L. Dewal, and Sangeeta Gupta. "Analysis and classification of hardwood species based on Coiflet DWT feature extraction and WEKA workbench." In Signal Processing and Integrated Networks (SPIN), 2014 International Conference on, pp. 9-13. IEEE, 2014.









Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.