Classification cost: An empirical comparison among traditional classifier, Cost-Sensitive Classifier, and MetaCost

Share Embed


Descripción

Expert Systems with Applications 39 (2012) 4013–4019

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Classification cost: An empirical comparison among traditional classifier, Cost-Sensitive Classifier, and MetaCost Jungeun Kim a,1, Keunho Choi b,2, Gunwoo Kim c,⇑, Yongmoo Suh b,2 a

437-070, Hyundai Autoever Corp. 576, Sam-dong, Uiwang-si, Gyeonggi-Do, Republic of Korea 136-701, Business School, Korea University, Anam-dong 5-Ga, Sungbuk-Gu, Seoul, Republic of Korea c 305-719, 612, Department of Business Administration, College of Business and Economics, Hanbat National University, San 16-1, Dukmyung-dong, Yuseong-Gu, Daejeon, Republic of Korea b

a r t i c l e

i n f o

Keywords: Fraud detection Cost-sensitive learning Cost-Sensitive Classifier MetaCost

a b s t r a c t Loan fraud is a critical factor in the insolvency of financial institutions, so companies make an effort to reduce the loss from fraud by building a model for proactive fraud prediction. However, there are still two critical problems to be resolved for the fraud detection: (1) the lack of cost sensitivity between type I error and type II error in most prediction models, and (2) highly skewed distribution of class in the dataset used for fraud detection because of sparse fraud-related data. The objective of this paper is to examine whether classification cost is affected both by the cost-sensitive approach and by skewed distribution of class. To that end, we compare the classification cost incurred by a traditional cost-insensitive classification approach and two cost-sensitive classification approaches, Cost-Sensitive Classifier (CSC) and MetaCost. Experiments were conducted with a credit loan dataset from a major financial institution in Korea, while varying the distribution of class in the dataset and the number of input variables. The experiments showed that the lowest classification cost was incurred when the MetaCost approach was used and when non-fraud data and fraud data were balanced. In addition, the dataset that includes all delinquency variables was shown to be most effective on reducing the classification cost. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Most domestic and international financial institutions suffer significant losses from loan fraud. According to Mortgage Banking (2006),3 financial institutions normally lose approximately 35% of total loan principal and almost 20% of loan principal from loan products sold to the secondary financial market, apart from commissions and other transaction costs, as a result of loan fraud. The Federal Bureau of Investigation (2009)4 in the US reported that the number of commercial loan fraud accounts increased from 2126 in 2005 to 4514 in 2009 and the amount of loss grew from 711 million dollars in 2005 to 1698 million dollars in 2009. In addition, the means of borrowing money fraudulently from a financial institution are becoming more diverse. For example, loan fraudsters may submit loan applications that include fake contracts for building leases or ⇑ Corresponding author. Tel.: +82 42 821 1290, mobile: +82 10 5274 2138; fax: +82 42 821 1288. E-mail addresses: [email protected] (J. Kim), [email protected] (K. Choi), [email protected] (G. Kim), [email protected] (Y. Suh). 1 Tel.: +82 2 6296 6359, mobile: +82 10 9064 8301. 2 Tel.: +82 2 3290 1945; fax: +82 2 922 7200. 3 http://www.mortgagebankingmagazine.com/files/2006articleindex.pdf. 4 http://www.fbi.gov/publications/financial/fcs_report2009/financial_crime_2009. htm. 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.09.071

tax bills, thereby establishing a paper company. Another typical type of loan fraud is broker-facilitated. Most of the loan frauds are perpetrated by a small number of brokers, but the amount of loss from these loan frauds is considerable. The diverse method of perpetrating loan fraud and the rapid growth in the amount of loss result in serious financial loss and arouse deep distrust of financial institutions, thereby leading them to face with insolvency after all. Therefore, in order to manage financial risks successfully, financial institutions should reinforce the qualifications for lending and consider building sophisticated prediction models for proactive detection of loan fraud (Desai, Crook, & Overstreet, 1996; Malhotra & Malhotra, 2002; Wheeler & Aitken, 2000). However, the loan fraud detection models leave something to be desired. First, most traditional classification models have used hit ratio as a measure of classification accuracy, but hit ratio takes into account only the number of cases which are correctly classified and assumes that the costs of misclassification resulting from type I and type II errors are all equal. However, since they are not equal in many real-world classification problems, hit ratio does not represent the performance of classification models accurately (Ballis, Falaschi, Ferri, Hernandez-Orallo, & Ramirez-Quintana, 2003; Ciraco, Rogalewski, & Weiss, 2005; Jiang & Cukic, 2009; Ling, Sheng, & Yang, 2006; Lu & Wang, 2008; Zadrozny, 2005).

4014

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

For example, the cost of predicting a cancer patient as a non-cancer patient would be definitely different from the cost of predicting in the other way around. The results from prediction models that do not consider classification cost can mislead financial institutions into making incorrect decisions and thus incurring significant financial loss. So far, few studies have taken the importance of classification cost into account when building their prediction models (Domingos, 1999; Ling & Sheng, 2008; Ling, Sheng, Bruckhaus, & Madhavji, 2006; Witten & Frank, 2005). Second, distribution of class in the dataset used for fraud detection is highly-skewed because the ratio of fraud data to non-fraud data is considerably imbalanced, and thus this may cause poor performance when detecting loan frauds (Domingos, 1999; Ghosh & Reilly, 1994; Jiang & Cukic, 2009; Ling & Sheng, 2008; Ling, Sheng, Bruckhaus, et al., 2006; McCarthy, Zabar, & Weiss, 2005; Phua, Alahakoon, & Lee, 2004). The objective of this paper is to examine whether classification cost is actually affected both by cost-sensitive approach and by skewed distribution of class. We hypothesize that (1) the cost-sensitive approach will result in less classification cost than traditional, cost-insensitive approach and (2) adjusting the ratio of fraud data to non-fraud data will reduce the classification cost. To test these hypotheses, we compare the classification costs incurred by a traditional approach and two cost-sensitive approaches, Cost-Sensitive Classifier (CSC) (Witten & Frank, 2005) and MetaCost (Domingos, 1999). Experiments are conducted with a credit loan dataset from a major financial institution in Korea while varying the distribution of class in the dataset and the number of input variables for each approach. The rest of this paper is organized as follows. Section 2 presents a review of literature on data mining techniques for fraud detection and cost-sensitive learning. Section 3 explains our method including research framework and model building. Section 4 presents an experimental design including dataset, preprocessing, feature selection, and adjustment of class distribution. Section 5 compares the experiment results and discusses their implications. The last section gives a conclusion.

2. Literature review 2.1. Fraud detection research using data mining techniques Fraud detection is an important issue in many domains, including credit loan (Desai et al., 1996; Malhotra & Malhotra, 2002; Wheeler & Aitken, 2000), credit card (Brause, Langsdorf, & Hepp, 1999; Chung & Suh, 2009; Ghosh & Reilly, 1994; Stolfo, Fan, Lee, Prodromidis, & Chan, 1997; Yeh & Lien, 2009), telecommunication (Almeida et al., 2008; Estevez, Held, & Perez, 2006; Fawcett, 2006), and insurance (Phua et al., 2004; Viaene, Dedene, & Derrig, 2005). Several studies have attempted to solve the fraud detection problem using data mining techniques. This section reviews previous studies of data mining for fraud detection. In the credit loan domain, several studies have addressed the issues related to fraud detection. Desai et al. (1996) explored the accuracy rate of credit scoring models using the personal loan information of three credit unions in the US. They empirically compared the performance of various data mining techniques such as Logistic Regression, Discriminant Analysis and two types of Neural Network (i.e., Multilayer Perceptron and Modular Neural Network). They reported that different techniques resulted in different performance in predicting different groups. Wheeler and Aitken (2000) described adaptive Case-Based Reasoning (CBR) technique for reducing the number of fraud investigations during the process of credit approval. Their results showed that multi-algorithmic CBR had the highest accuracy rate, but other adaptive CBR techniques

performed similarly in comparable problem areas. Malhotra and Malhotra (2002) used the loan information of credit unions in the US to build a consumer loan assessment model utilizing the Adaptive Neuro-Fuzzy Inference System (ANFIS) and Back Propagation Network (BPN). Their model showed better performance than Multiple Discriminant Analysis. In the credit card domain, Ghosh and Reilly (1994) constructed a fraud detection system based on Neural Network. The training set consisted of fraudulent credit card transactions and a sample of good credit card transactions roughly in the ratio of 30 good accounts to a fraud account. Their system improved both the accuracy and timeliness of fraud detection compared to the systems in previous researches. Stolfo et al. (1997) conducted experiments with meta-learning techniques to learn models for detecting fraudulent credit card. They argued that, in the fraud detection domain, the fraud-catching rate (True-Positive) and the false-alarm rate (False-Positive) are better measures than the overall accuracy when evaluating the performance of fraud detection models. The meta-classifier using Bayesian Network as meta-learner generated the best performance. Brause et al. (1999) applied the Association Rule technique and the Neural Network to a categorical dataset (to obtain the rules from all misused credit card transactions) and to a numeric dataset for prediction, respectively. The results showed that this combined technique performed better than either technique. Chung and Suh (2009) used classification models to classify customers into three groups of credit card delinquents (i.e., good, bad or potentially good) using Neural Network (NN) and Decision Tree (DT). Then, they constructed NN and DT models to estimate the utility of individual credit card delinquency and demonstrated that the classification model with the best hit rate does not necessarily result in the best utility value. Yeh and Lien (2009) utilized K-Nearest Neighbor, Logistic Regression, Naïve Bayesian, and Neural Network to predict the probability of credit card default and found that Neural Network showed the best performance among them. In the telecommunication domain, Fawcett (2006) utilized JRip which is a rule learning classifier, to detect fraudulent phone call. This research suggested ROC graphs with Instance-Varying costs for estimating performance. Estevez et al. (2006) proposed a system using Fuzzy rules and Neural Network models to prevent subscription fraud in fixed telecommunications. Their system indentified 3.5% of the subscribers as potentially fraudulent and they contained 56.2% of the true fraudsters. Almeida et al. (2008) devised a system that helps the fraud analyst make decisions faster and more accurately using CBR. In the insurance domain, Phua et al. (2004) used meta-learning techniques, applying algorithms like back-propagation neural network, Naïve Bayesian, and C4.5 to automobile insurance dataset in order to detect fraud while manipulating the distribution of class. The stacking-bagging method achieved the highest cost savings. To detect automobile insurance claim fraud, Viaene et al. (2005) compared Bayesian Networks with Logistic Regression and Decision Tree classifiers. These fraud detection studies are summarized in Table 1. Reviewing previous studies on fraud detection reveals that a few studies have taken into account the difference in the cost of misclassification resulting from by type I and type II errors in building their classification or prediction models, but most studies have assumed that the cost of the two errors are equal even though they are not in real business domains. 2.2. Cost-sensitive learning With recognition that the misclassification costs from type I and type II errors could be different, several studies have suggested cost-sensitive learning (Bianca, John, & Naoki, 2003; Ciraco et al.,

4015

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019 Table 1 Fraud detection studies in various domains. Domain

Subject

Technique

Reference

Credit loan

Credit scoring

Desai et al. (1996)

Credit approval process

Logistic Regression, Discriminant Analysis, Multilayer Perceptron network, Modular Neural Network Case-Based Reasoning

Credit loan assessment

Neural Network, Neuro-Fuzzy Inference System

Credit card fraud detection

Neural Network

Detecting fraudulent credit card transactions Credit card fraud detection. Utility value of individual credit card delinquents Probability of default of credit card clients

Meta-Learning Technique using Bayesian Network

Ghosh and Reilly (1994) Stolfo et al. (1997)

Association Rule, Neural Network Neural Network, Decision Tree

Brause et al. (1999) Chung and Suh (2009)

K-Nearest Neighbor, Logistic Regression, Naïve Bayesian, Neural Network

Yeh and Lien (2009)

Telecommunication

Fraudulent phone call detection Subscription fraud Fraud analysis

JRip Fuzzy Rule, Neural Network Case-Based Reasoning

Fawcett (2006) Estevez et al. (2006) Almeida et al. (2008)

Insurance

Detecting fraudulent automobile insurance Fraud in automobile insurance claim

Meta-Learning Technique using Neural Network, Naïve Bayesian, and C4.5, Stacking, Bagging Bayesian Network, Logistic Regression, Decision Tree

Phua et al. (2004)

Credit card

2005; Domingos, 1999; Fawcett, 2006; Jiang and Cukic, 2009; Ling and Sheng, 2008; Ling, Sheng, Bruckhaus, et al., 2006; McCarthy et al., 2005; Witten and Frank, 2005). This section introduces theories of cost-sensitive learning, particularly, such as CSC (Witten and Frank, 2005) and MetaCost (Domingos, 1999), and reviews previous studies related to them. In classification problems with two classes, the classification cost can be represented in a cost matrix, shown in Table 2, where two types of error, false positives (FP) and false negatives (FN), and two types of correct classification, true positives (TP) and true negatives (TN), have different costs and benefits. In the table, Cost(i, j) represents the cost of classifying an instance belonging to class j into class i. Cost(0, 0) and Cost(1, 1) are usually considered as a benefit (or a negated cost), while the other two cases are as a cost. Given the cost matrix, an instance should be classified into a class that generates the minimum expected cost. The expected cost of classifying an instance x into class i, RðijxÞ can be expressed as follows:

RðijxÞ ¼

X

PðjjxÞCostði; jÞ;

ð1Þ

j

where P(j|x) is the probability of class j being the actual class of instance x (Ling and Sheng, 2008). Then, the classifier will classify instance x into a positive class (we assume class 1 is positive) if and only if:

Pð0jxÞCostð1; 0Þ þ Pð1jxÞCostð1; 1Þ 6 Pð0jxÞCostð0; 0Þ þ Pð1jxÞCostð0; 1Þ:

ð2Þ

CSC can convert the traditional cost-insensitive classification models into cost-sensitive ones using the expected cost Eq. (1) (Witten and Frank, 2005). However, there is a limitation in that the costinsensitive classifiers have to yield the class probability of each test

Table 2 An example of a cost matrix for the two-class case.

Actual negative Actual positive

Predict negative

Predict positive

Cost(0, 0) Cost(0, 1)

Cost(1, 0) Cost(1, 1)

Wheeler and Aitken (2000) Malhotra and Malhotra (2002)

Viaene et al. (2005)

instance, as Naïve Bayes classifiers does. After obtaining the probability of each test instance, Eq. (1) is used to predict the class label of test examples. MetaCost also changes the traditional cost-insensitive classification models into cost-sensitive ones utilizing Eq. (1) (Domingos, 1999), but it differs from CSC in that, while the classifiers used in CSC have to estimate the class probability, MetaCost can make use of classifiers that do not require the class probability and instead learn it from multiple classifiers. In brief, MetaCost works as follows: (1) sampling multiple bootstrap replicates of the training set; (2) learning a classifier on each set; (3) producing the average of class probabilities that are directly yielded by the classifiers or calculating the fraction of votes received from the ensemble; (4) utilizing Eq. (1) to relabel each training example; (5) reapplying the classifier to the relabeled training set. The model learned from MetaCost is more robust and generally applicable because of the ensemble approach. There have been many researches which took cost-sensitive learning approach, as is summarized in Table 3. Bianca et al. (2003) proposed the conversion methods to convert the classifier learning algorithms and classification theory into cost-sensitive algorithms and theory, based on cost-proportionate weighting of the training examples. A method based on cost-proportionate rejection sampling and ensemble aggregation achieved excellent predictive performance. Ciraco et al. (2005) considered the misclassification cost ratio associated with a two-class case in order to improve the classifier utility. They varied the cost ratio in two stages, learning stage and testing stage, and found that the utility of a classifier can be maximized by applying different cost ratios in the two stages. McCarthy et al. (2005) mentioned that the classifier is usually constructed to predict the majority class much more often than it predicts the minority class when the classifier learns from the highly-skewed distribution of class. With the cost-sensitive learning approach, they used two sampling techniques, up-sampling and down-sampling, and found that there is no clear or consistent winner for maximizing classifier performance. Ling et al. (2006) suggested a data-mining solution to predict the escalation risks of defects in order to aid human experts in the review process of software development. The distinct feature of their study is that although the general objective of cost-sensitive learning is to minimize the expected cost, they studied on the

4016

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

Table 3 Previous researches taking cost-sensitive learning approach. Subjects

References

Suggestion of MetaCost Conversion methods by cost-proportionate example weighting Improvement of the utility of the classification model by altering the misclassification cost ratio A comparison of two cost-sensitive learning approaches Minimization of the total cost for test examples Adapted ROC graphs for accommodating individual-varying costs in the domain of fraudulent phone call detection Cost-sensitive classification of a faulty module in software development

Domingos (1999) Bianca et al. (2003) Ciraco et al. (2005) McCarthy et al. (2005) Ling et al. (2006) Fawcett (2006) Jiang and Cukic (2009)

Fig. 1. Research framework.

maximum net profit problem and showed that the cost-sensitive decision tree produced the highest positive net profit. In order to estimate the performance of the classifiers, Fawcett (2006) proposed ROC graphs with Instance-Varying costs (ROCIV) and demonstrated this method on three domains. The results of ROCIV differed considerably from those of traditional ROC. In the development of software fault prediction models, Jiang and Cukic (2009) analyzed the benefits and misclassification costs of various techniques using the project information from a public repository. They showed that cost-sensitive modeling does not outperform the overall classification models in terms of hit ratio, but it minimizes the overall cost of misclassification. Interesting is that many researches attempted to take utility mining approach to solve diverse problems as we review the literature. It seems to be worthwhile to examine other problems in many application areas for which utility mining approach is the best fit. 3. Research method As is mentioned earlier, this paper aims to examine whether classification cost is actually affected both by cost-sensitive approach and by skewed distribution of class. This section explains our research framework and builds three classification models, each of which takes one traditional cost-insensitive approach (i.e., C4.5 decision tree) or cost-sensitive approaches (i.e., CSC and MetaCost). 3.1. Research framework As is shown in Fig. 1, our proposed research framework consists of several steps. In the first step, we preprocess data acquired from a major loan company in Korea by removing outliers and missing data. In the second step, we select a set of features associated with delinquency. In the third step, we alter the distribution of class by adjusting the sampling ratio of fraud data to non-fraud data to get nine different datasets (i.e., three datasets with different ratios for each of three feature sets generated in the second step). In the fourth step, we first define a simple cost matrix with the help of domain experts and then build three different classification models using the matrix: one traditional cost-insensitive classification

model using C4.5 and two cost-sensitive classification models using CSC and MetaCost. In the final step, we compare the resulting classification costs among those three models.

3.2. Building models using a cost matrix Prior to building a cost-sensitive classification model, a cost matrix should be defined in order to measure the degree to which the model is cost-sensitive. Table 4 shows a matrix of simple formulas for calculating classification cost. Although many more factors should be taken into account to define classification cost in a real business, we defined the formulas as simply as possible with the help from the domain experts in order to simplify the comparison of misclassification cost among cost-sensitive and cost-insensitive models. In Table 4, fraudulent and non-fraudulent customers are regarded as the positive class and the negative class, respectively. If the model classifies a fraudulent customer (i.e., actual positive) as a non-fraud customer (i.e., predicted negative), the amount of the loan and its interest are incurred as the misclassification cost. If the model classifies a non-fraudulent customer (i.e., actual negative) as a fraudulent customer (i.e., predicted positive), the institution will not approve the loan request and the misclassification cost is the amount of interest that would have been earned by the loan. In the cases of correct classification, however, the profits are generated as the negated misclassification cost. To define the actual costs or benefits used to measure the costsensitivity of a classification model, we used the average amount of a credit loan (i.e., 5 million Korean won) and the average interest rate (e.g., 29%) of the company that provided our dataset. Table 5 shows the resulting cost matrix, where cost is presented as a positive number and the benefit is presented as a negative number. Referring to the cost matrix, we built two cost-sensitive classification models using CSC and MetaCost, and a cost-insensitive classification model using C4.5.5 We evaluated the effect of these cost-sensitive classification models in our experiments by comparing the classification cost of the three models. Next section provides a detailed description of the experiments. 5 To build these three models, we use a data mining tool, Weka version 3.6 (http:// www.cs.waikato.ac.nz/ml/weka/).

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019 Table 4 Formula matrix.

Actual negative Actual positive

Predict negative

Predict positive

(Loan ⁄ interest rate) Loan + (loan ⁄ interest rate)

Loan ⁄ interest rate {Loan + (loan ⁄ interest rate)}

Table 5 Cost matrix (unit: 10,000 Korean won).

Actual negative Actual positive

Predict positive

145 654

145 645

4.1. Dataset The raw data used in our experiment is credit loan data acquired from a major loan company in Korea. The data contains loan transactions of 48,525 customers from November 2007 to May 2008 and includes various kinds of information used for mangers to make a decision on the loan approval for customers, including customer profile information (e.g., home address, office address), loan information (e.g., the date of loan, loans), and credit information (e.g., credit inquiries, open credit accounts). The distribution between fraudulent customers and non-fraudulent customers was highly skewed.

4.2. Pre-processing When pre-processing the dataset, we eliminated 459 transactions that have many missing values or were outliers and also elim-

Table 6 Some representative features from each category of delinquency information. Information

Features

Total delinquency

The number of total delinquencies The number of total delinquency accounts The longest period of delinquency among total delinquencies The longest period of delinquency account among total delinquency accounts The largest amount of delinquency among total delinquencies The largest amount of delinquency among remaining total delinquencies

Loan delinquency

inated highly correlated and duplicated features by analyzing the correlation coefficient among them. Some features were derived from existing features with the help of domain experts. For example, the number of times the applicant changed address, the number of times the applicant changed jobs and the number of times loan inquiries made were derived from the information about customers’ house addresses, office addresses and the number of credit inspection requests, respectively. At the end, 48,066 transactions remained and 112 features were obtained for experiments from them. 4.3. Feature selection

Predict negative

4. Experimental design

Credit card delinquency

4017

The number of credit card delinquencies The number of credit card delinquency accounts The longest period of delinquency among credit card delinquencies The longest period of delinquency account among credit card delinquency accounts The largest amount of delinquency among credit card delinquencies The largest amount of delinquency among remaining credit card delinquencies The number of loan delinquencies The number of loan delinquency accounts The longest period of delinquency among loan delinquencies The longest period of delinquency account among loan delinquency accounts The largest amount of delinquency among loan delinquencies

After a discussion with domain experts, we prepared three sets of features. The first contains 112 features from all categories of delinquency information (i.e., total delinquency, credit card delinquency and loan delinquency), as you can see in Table 6. The second contains 99 features from two categories (i.e., credit card delinquency and loan delinquency). The third contains 78 features from only a single category (i.e., total delinquency). Next, we evaluated each set of features using the Chi-square feature selection method to select the variables that are most influential on the target variable. In the end, we obtained 33 features from the first set, 31 from the second, and 24 from the third. 4.4. Adjusting class distribution Our dataset represented a highly-skewed distribution of class between fraud data and non-fraud data, that is, only 226 transactions belong to the class of the fraudulent customers and remaining 47,840 transactions belong to the class of non-fraudulent customers. Such a highly-skewed distribution of class normally causes a serious problem that when a classification model learns from a skewed data, the model will usually predict the majority class correctly more often than the minority class because most classification models are designed to maximize accuracy (McCarthy et al., 2005). To mitigate this problem, we employed under-sampling method6 by varying the ratios of fraudulent customers to non-fraudulent customers (i.e., 40:60, 50:50 and 60:40) although under-sampling is known to have several disadvantage (Domingos, 1999; Ghosh and Reilly, 1994; Jiang and Cukic, 2009; Ling and Sheng, 2008; Ling, Sheng, Bruckhaus, et al., 2006; McCarthy et al., 2005; Phua et al., 2004). As a consequence, we obtained nine datasets by generating three datasets of different fraudulent to non-fraudulent ratios in each category of features, as is shown in Table 7.

5. Experimental results In this section, explained are the results from our experiments on how the cost-sensitive approach, the ratio-variant datasets, and the number of features affect the classification cost of credit fraud detection models and suggest the implications of the results. Experiment 1. In order to evaluate whether a cost-sensitive classification model generates less classification cost than a costinsensitive classification model, we calculated the classification costs from each of the three classification models, that is, a costinsensitive model using C4.5 (hereafter we call it C4.5 model) and two cost-sensitive model using CSC and MetaCost (hereafter we call them CSC model and MetaCost model, respectively) for each of the nine datasets described in Section 4.4.

6

Oversampling is also known to have the disadvantage of over-fitting.

4018

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

Table 7 The nine datasets. Datasets (non-fraud:fraud)

1 2 3

33 variables (from three categories of features)

4 5 6

31 variables (from two categories of features)

7 8 9

24 variables (from one category of features)

40:60 50:50 60:40

40:60 50:50 60:40

Instance Actual Prediction Loan TP

FN

TN

FP

Cost

1 2 3 4

0 308.3 0 0

59.5 0 0 0

0 0 0 198.1

59.5 308.3 334.1 198.1

N N Y Y

205 239 259 683

0 0 334.1 0

Table 9 The classification costs of the three models for each dataset (unit: Korean won). Dataset no.

C4.5

MetaCost

CSC

1 2 3 4 5 6 7 8 9

28025 27570 17845 23155 21830 14525 24155 20380 19585

38605 45730 28945 35605 30670 26945 32605 34830 28265

29315 27570 19395 39475 26410 15365 25155 25800 19075

To calculate the classification cost of C4.5 model, we randomly assigned7 a loan amount to each customer using a beta distribution with 1 million Korean won as the minimum amount of loan, 50 million Korean won as the maximum, and 5 million Korean won as the average. To calculate the classification costs of CSC model and MetaCost model, we used the cost matrix defined in Table 5 since both models need only the average amount of a loan as an input value of the matrix. All three models used the average interest rate of 29% in calculating classification cost. For example, Table 8 shows the classification costs of C4.5 model, where the costs of two misclassified instances and the benefits of two correctly classified instances were calculated using the formula given in Table 4. After calculating the cost or the benefit for each customer as is illustrated in the above, we computed the average of all customers’ costs as the cost of C4.5 model. Then, the costs of CSC and MetaCost models were calculated using the cost matrix defined in Table 5. Table 9 shows the classification costs of the three models for each dataset and Fig. 2 compares them graphically, where MetaCost generates the lowest classification cost (that is, the highest benefit) in all dataset except in dataset 4. We tested whether each pair of models is statistically significant at the 5% level of significance and found that there was a significant difference between C4.5 model and MetaCost model. However, the difference between C4.5 model and CSC model, and the difference between CSC model and MetaCost model were not significant.

7 We could not get the actual loan amount of each customer because of issues related to customer confidentiality.

1 2 3 4 5 6 7 8 9

-20000

C4.5

-30000

CSC

-40000

MetaCost

-50000

40:60 50:50 60:40

Table 8 Classification costs of some instances from C4.5 model (unit: 10,000 Korean won).

N Y Y N

Cost

No.

0 -10000

Dataset Number

Fig. 2. Comparison of classification costs among three models.

Table 10 Average classification costs according to the ratios of non-fraud data to fraud data (unit: Korean won). Class distribution ratio (non-fraud:fraud)

C4.5

MetaCost

CSC

40:60 50:50 60:40

25117 23260 17318

35605 37077 28052

31315 26593 17945

Table 11 Average classification cost according to the different numbers of features (unit: Korean won). Number of features

C4.5

MetaCost

CSC

33 variables 31 variables 24 variables

24480 19837 21373

37760 31073 31900

25427 27083 23343

Consequently, the result of this experiment suggests that the classification model built from cost-sensitive approach can reduce the misclassification cost or increase the benefit compared to the classification model built from cost-insensitive approach. Especially, since MetaCost model performed better than CSC model in most datasets, we suggest that MetaCost model could be taken into account as the first cost-sensitive learning technique to use when building credit loan fraud detection models. Experiment 2. To evaluate the effect of the class distribution on classification cost, we calculated the average classification costs of the three classification models from each dataset with different ratios between the fraud and the non-fraud classes. Table 10 shows that C4.5 and CSC models produced the lowest classification cost when the ratio between the two classes was 40:60, and MetaCost model yielded the lowest classification cost when the ratio was 50:50. The results weakly indicate that less classification cost results when the dataset contains more data in fraud class than in non-fraud class or the ratio of data between the two classes is balanced.

Experiment 3. To evaluate the effect on classification cost of the number of features used when building classification models, we calculated the average classification cost of the three classification models for each of the three datasets with different numbers of features. Table 11 shows that C4.5 and MetaCost models produced the lowest classification cost when the number of features was 33 and the CSC generated the lowest classification cost when the number of features was 31. It is also shown that using 33 features from all categories of delinquency information (i.e., total delinquency, credit card delinquency and loan delinquency) is more effective in reducing the classification cost. The results weakly indicate that less classification cost results when as many features and as many kinds of delinquency information as possible are used.

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

6. Conclusions The loss to financial institutions from various kinds of fraud is so enormous that studies on fraud detection have been conducted in many business domains such as credit loan, credit card, insurance and telecommunication. In particular, most loan companies suffer from losses incurred by fraudulent customers. In this study, we built three fraud detection models from a dataset obtained from a loan company in Korea. One was built using a decision tree algorithm, C4.5, where we assumed as is usual that the classification cost was equal in all cases of the confusion matrix. The other two were built using two cost-sensitive algorithms, CSC and MetaCost, where it was assumed that the classification cost was not equal for each case of the confusion matrix. As is common with other datasets for fraud detection, our dataset had a highly skewed distribution of class. We conducted experiments using nine datasets, each of which is different from the others in the number of variables, category of variables and the ratio between fraud and non-fraud classes. The objective of this study is three-fold. One is to verify that the cost-sensitive approach to fraud detection results in less classification cost than the cost-insensitive approach. Another is to look into whether data the distribution between the fraud class and the nonfraud class in the training dataset affect on the classification cost. The third is to find out the set of features that results in the least classification cost. The experiments showed that (1) MetaCost returned the lowest classification cost in eight datasets out of nine; (2) less classification cost results when the dataset had more fraud data than non-fraud data; (3) a feature set of all delinquency information was the most effective on reducing the classification cost. These findings have several implications. First, MetaCost should be considered as an important cost-sensitive learning technique to use when building fraud detection models. Second, by using the credit loan fraud detection model introduced in this study, financial institutions can reduce the amount of financial loss caused by fraudsters. Finally, the misclassification cost, rather than the accuracy rate of the classifiers, should be taken into account as a criterion for selecting fraud detection models. The cost-sensitive approach taken in this study will help managers in financial institutions make more effective decisions for loan approval. However, this study has some limitations. First, the assumption is too strong which we made in this study in order to simplify the calculation of classification cost. Therefore, it would be worthwhile to do the experiment with a formula that is actually used in financial institutions. Second, the data used in this study was acquired from only one financial institution in Korea. It would bring better results if we conduct similar experiments using datasets from more financial institutions so that the experimental results can be generalized.

References Almeida, P., Jorge, M., Cortesão, L., Martins, F., Vieira, M., & Gomes, P. (2008). Supporting fraud analysis in mobile telecommunications using case-based reasoning. In Proceedings of the 9th European conference on advances in case-

4019

based reasoning (pp. 562–572). . Ballis, D., Falaschi, M., Ferri, C., Hernandez-Orallo, J., & Ramirez-Quintana, M. J. (2003). Cost-sensitive diagnosis of declarative programs. Electronic Notes in Theoretical Computer Science, 86(3), 85–104. Bianca, Z., John, L., & Naoki, A. (2003). Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the third IEEE international conference on data mining (pp. 435–442). Brause, R., Langsdorf, T., & Hepp, M. (1999). Neural data mining for credit card fraud detection. In Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence (pp. 103–106). Chung, S. H., & Suh, Y. (2009). Estimating the utility value of individual credit card delinquents. Expert Systems with Applications, 36(2), 3975–3981. doi:10.1016/ j.eswa.2008.02.031. Ciraco, M., Rogalewski, M., & Weiss, G. (2005). Improving classifier utility by altering the misclassification cost ratio. In Proceedings of the 1st international workshop on Utility-based data mining (pp. 46–52). . Desai, V. S., Crook, J. N., & Overstreet, G. A. J. (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operational Research, 95(1), 24–37. Domingos, P. (1999). MetaCost: A general method for making classifiers costsensitive. In Proceedings of the 15th international conference on knowledge discovery and data mining (pp. 155–164). Estevez, P. A., Held, C. M., & Perez, C. A. (2006). Subscription fraud prevention in telecommunications using fuzzy rules and neural networks. Expert Systems with Applications, 31(2), 337–344. doi:10.1016/j.eswa.2005.09.028. Fawcett, T. (2006). ROC graphs with instance-varying costs. Pattern Recognition Letters, 27(8), 882–891. doi:10.1016/j.patrec.2005.10.012. Ghosh, S., & Reilly, D. L. (1994). Credit card fraud detection with a neural-network. Proceedings of the Annual International Conference on System Science, 3, 621–630. Jiang, Y., & Cukic, B. (2009). Misclassification cost-sensitive fault prediction models. In Proceedings of the 5th international conference on predictor models in software engineering (pp. 1–10). . Ling, C. X., & Sheng, V. S. (2008). Cost-sensitive learning and the class imbalance problem. Springer. Ling, C. X., Sheng, V. S., Bruckhaus, T., & Madhavji, N. H. (2006). Maximum profit mining and its application in software development. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 929–934). . Ling, C. X., Sheng, V. S., & Yang, Q. (2006). Test strategies for cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1055–1067. Lu, W.-Z., & Wang, D. (2008). Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Science of the Total Environment, 395, 109–116. Malhotra, R., & Malhotra, D. K. (2002). Differentiating between good credits and bad credits using neuro-fuzzy systems. European Journal of Operational Research, 136, 190–211. McCarthy, K., Zabar, B., & Weiss, G. (2005). Does cost-sensitive learning beat sampling for classifying rare classes? In Proceedings of the 1st international workshop on Utility-based data mining (pp. 69–77). . Phua, C., Alahakoon, D., & Lee, V. (2004). Minority report in fraud detection: Classification of skewed data. ACM SIGKDD Explorations Newsletter, 6(1), 50–59. . Stolfo, S., Fan, W., Lee, W., Prodromidis, A., & Chan, P. (1997). Credit card fraud detection using meta-learning: Issues and initial results. In Proceedings of the workshop on AI methods in fraud and risk management predictor models in software engineering (pp. 83–90). Viaene, S., Dedene, G., & Derrig, R. A. (2005). Auto claim fraud detection using Bayesian learning neural networks. Expert Systems with Applications, 29(3), 653–666. Wheeler, R., & Aitken, S. (2000). Multiple algorithms for fraud detection. KnowledgeBased Systems, 13(2-3), 93–99. Witten, I. H., & Frank, E. (2005). Data mining – Practical machine learning tools and techniques. Morgan Kaufmann. Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473–2480. doi:10.1016/j.eswa.2007.12.020. Zadrozny, B. (2005). One-benefit learning: Cost-sensitive learning with restricted cost information. Paper presented at the In proceedings of KDD workshop on utility-based data mining, Chicago.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.