Classification cost: An empirical comparison among traditional classifier, Cost-Sensitive Classifier, and MetaCost

September 17, 2017 | Autor: Keunho Choi | Categoría: Mathematical Sciences, Financial Institutions, Fraud Detection, Prediction Model, Cost sensitive learning, Relational data

Share Embed

Laporkan tautan ini

Descripción

Expert Systems with Applications 39 (2012) 4013–4019

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Classiﬁcation cost: An empirical comparison among traditional classiﬁer, Cost-Sensitive Classiﬁer, and MetaCost Jungeun Kim a,1, Keunho Choi b,2, Gunwoo Kim c,⇑, Yongmoo Suh b,2 a

437-070, Hyundai Autoever Corp. 576, Sam-dong, Uiwang-si, Gyeonggi-Do, Republic of Korea 136-701, Business School, Korea University, Anam-dong 5-Ga, Sungbuk-Gu, Seoul, Republic of Korea c 305-719, 612, Department of Business Administration, College of Business and Economics, Hanbat National University, San 16-1, Dukmyung-dong, Yuseong-Gu, Daejeon, Republic of Korea b

a r t i c l e

i n f o

Keywords: Fraud detection Cost-sensitive learning Cost-Sensitive Classiﬁer MetaCost

a b s t r a c t Loan fraud is a critical factor in the insolvency of ﬁnancial institutions, so companies make an effort to reduce the loss from fraud by building a model for proactive fraud prediction. However, there are still two critical problems to be resolved for the fraud detection: (1) the lack of cost sensitivity between type I error and type II error in most prediction models, and (2) highly skewed distribution of class in the dataset used for fraud detection because of sparse fraud-related data. The objective of this paper is to examine whether classiﬁcation cost is affected both by the cost-sensitive approach and by skewed distribution of class. To that end, we compare the classiﬁcation cost incurred by a traditional cost-insensitive classiﬁcation approach and two cost-sensitive classiﬁcation approaches, Cost-Sensitive Classiﬁer (CSC) and MetaCost. Experiments were conducted with a credit loan dataset from a major ﬁnancial institution in Korea, while varying the distribution of class in the dataset and the number of input variables. The experiments showed that the lowest classiﬁcation cost was incurred when the MetaCost approach was used and when non-fraud data and fraud data were balanced. In addition, the dataset that includes all delinquency variables was shown to be most effective on reducing the classiﬁcation cost. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Most domestic and international ﬁnancial institutions suffer signiﬁcant losses from loan fraud. According to Mortgage Banking (2006),3 ﬁnancial institutions normally lose approximately 35% of total loan principal and almost 20% of loan principal from loan products sold to the secondary ﬁnancial market, apart from commissions and other transaction costs, as a result of loan fraud. The Federal Bureau of Investigation (2009)4 in the US reported that the number of commercial loan fraud accounts increased from 2126 in 2005 to 4514 in 2009 and the amount of loss grew from 711 million dollars in 2005 to 1698 million dollars in 2009. In addition, the means of borrowing money fraudulently from a ﬁnancial institution are becoming more diverse. For example, loan fraudsters may submit loan applications that include fake contracts for building leases or ⇑ Corresponding author. Tel.: +82 42 821 1290, mobile: +82 10 5274 2138; fax: +82 42 821 1288. E-mail addresses: [email protected] (J. Kim), [email protected] (K. Choi), [email protected] (G. Kim), [email protected] (Y. Suh). 1 Tel.: +82 2 6296 6359, mobile: +82 10 9064 8301. 2 Tel.: +82 2 3290 1945; fax: +82 2 922 7200. 3 http://www.mortgagebankingmagazine.com/ﬁles/2006articleindex.pdf. 4 http://www.fbi.gov/publications/ﬁnancial/fcs_report2009/ﬁnancial_crime_2009. htm. 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.09.071

tax bills, thereby establishing a paper company. Another typical type of loan fraud is broker-facilitated. Most of the loan frauds are perpetrated by a small number of brokers, but the amount of loss from these loan frauds is considerable. The diverse method of perpetrating loan fraud and the rapid growth in the amount of loss result in serious ﬁnancial loss and arouse deep distrust of ﬁnancial institutions, thereby leading them to face with insolvency after all. Therefore, in order to manage ﬁnancial risks successfully, ﬁnancial institutions should reinforce the qualiﬁcations for lending and consider building sophisticated prediction models for proactive detection of loan fraud (Desai, Crook, & Overstreet, 1996; Malhotra & Malhotra, 2002; Wheeler & Aitken, 2000). However, the loan fraud detection models leave something to be desired. First, most traditional classiﬁcation models have used hit ratio as a measure of classiﬁcation accuracy, but hit ratio takes into account only the number of cases which are correctly classiﬁed and assumes that the costs of misclassiﬁcation resulting from type I and type II errors are all equal. However, since they are not equal in many real-world classiﬁcation problems, hit ratio does not represent the performance of classiﬁcation models accurately (Ballis, Falaschi, Ferri, Hernandez-Orallo, & Ramirez-Quintana, 2003; Ciraco, Rogalewski, & Weiss, 2005; Jiang & Cukic, 2009; Ling, Sheng, & Yang, 2006; Lu & Wang, 2008; Zadrozny, 2005).

4014

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

For example, the cost of predicting a cancer patient as a non-cancer patient would be deﬁnitely different from the cost of predicting in the other way around. The results from prediction models that do not consider classiﬁcation cost can mislead ﬁnancial institutions into making incorrect decisions and thus incurring significant ﬁnancial loss. So far, few studies have taken the importance of classiﬁcation cost into account when building their prediction models (Domingos, 1999; Ling & Sheng, 2008; Ling, Sheng, Bruckhaus, & Madhavji, 2006; Witten & Frank, 2005). Second, distribution of class in the dataset used for fraud detection is highly-skewed because the ratio of fraud data to non-fraud data is considerably imbalanced, and thus this may cause poor performance when detecting loan frauds (Domingos, 1999; Ghosh & Reilly, 1994; Jiang & Cukic, 2009; Ling & Sheng, 2008; Ling, Sheng, Bruckhaus, et al., 2006; McCarthy, Zabar, & Weiss, 2005; Phua, Alahakoon, & Lee, 2004). The objective of this paper is to examine whether classiﬁcation cost is actually affected both by cost-sensitive approach and by skewed distribution of class. We hypothesize that (1) the cost-sensitive approach will result in less classiﬁcation cost than traditional, cost-insensitive approach and (2) adjusting the ratio of fraud data to non-fraud data will reduce the classiﬁcation cost. To test these hypotheses, we compare the classiﬁcation costs incurred by a traditional approach and two cost-sensitive approaches, Cost-Sensitive Classiﬁer (CSC) (Witten & Frank, 2005) and MetaCost (Domingos, 1999). Experiments are conducted with a credit loan dataset from a major ﬁnancial institution in Korea while varying the distribution of class in the dataset and the number of input variables for each approach. The rest of this paper is organized as follows. Section 2 presents a review of literature on data mining techniques for fraud detection and cost-sensitive learning. Section 3 explains our method including research framework and model building. Section 4 presents an experimental design including dataset, preprocessing, feature selection, and adjustment of class distribution. Section 5 compares the experiment results and discusses their implications. The last section gives a conclusion.

2. Literature review 2.1. Fraud detection research using data mining techniques Fraud detection is an important issue in many domains, including credit loan (Desai et al., 1996; Malhotra & Malhotra, 2002; Wheeler & Aitken, 2000), credit card (Brause, Langsdorf, & Hepp, 1999; Chung & Suh, 2009; Ghosh & Reilly, 1994; Stolfo, Fan, Lee, Prodromidis, & Chan, 1997; Yeh & Lien, 2009), telecommunication (Almeida et al., 2008; Estevez, Held, & Perez, 2006; Fawcett, 2006), and insurance (Phua et al., 2004; Viaene, Dedene, & Derrig, 2005). Several studies have attempted to solve the fraud detection problem using data mining techniques. This section reviews previous studies of data mining for fraud detection. In the credit loan domain, several studies have addressed the issues related to fraud detection. Desai et al. (1996) explored the accuracy rate of credit scoring models using the personal loan information of three credit unions in the US. They empirically compared the performance of various data mining techniques such as Logistic Regression, Discriminant Analysis and two types of Neural Network (i.e., Multilayer Perceptron and Modular Neural Network). They reported that different techniques resulted in different performance in predicting different groups. Wheeler and Aitken (2000) described adaptive Case-Based Reasoning (CBR) technique for reducing the number of fraud investigations during the process of credit approval. Their results showed that multi-algorithmic CBR had the highest accuracy rate, but other adaptive CBR techniques

performed similarly in comparable problem areas. Malhotra and Malhotra (2002) used the loan information of credit unions in the US to build a consumer loan assessment model utilizing the Adaptive Neuro-Fuzzy Inference System (ANFIS) and Back Propagation Network (BPN). Their model showed better performance than Multiple Discriminant Analysis. In the credit card domain, Ghosh and Reilly (1994) constructed a fraud detection system based on Neural Network. The training set consisted of fraudulent credit card transactions and a sample of good credit card transactions roughly in the ratio of 30 good accounts to a fraud account. Their system improved both the accuracy and timeliness of fraud detection compared to the systems in previous researches. Stolfo et al. (1997) conducted experiments with meta-learning techniques to learn models for detecting fraudulent credit card. They argued that, in the fraud detection domain, the fraud-catching rate (True-Positive) and the false-alarm rate (False-Positive) are better measures than the overall accuracy when evaluating the performance of fraud detection models. The meta-classiﬁer using Bayesian Network as meta-learner generated the best performance. Brause et al. (1999) applied the Association Rule technique and the Neural Network to a categorical dataset (to obtain the rules from all misused credit card transactions) and to a numeric dataset for prediction, respectively. The results showed that this combined technique performed better than either technique. Chung and Suh (2009) used classiﬁcation models to classify customers into three groups of credit card delinquents (i.e., good, bad or potentially good) using Neural Network (NN) and Decision Tree (DT). Then, they constructed NN and DT models to estimate the utility of individual credit card delinquency and demonstrated that the classiﬁcation model with the best hit rate does not necessarily result in the best utility value. Yeh and Lien (2009) utilized K-Nearest Neighbor, Logistic Regression, Naïve Bayesian, and Neural Network to predict the probability of credit card default and found that Neural Network showed the best performance among them. In the telecommunication domain, Fawcett (2006) utilized JRip which is a rule learning classiﬁer, to detect fraudulent phone call. This research suggested ROC graphs with Instance-Varying costs for estimating performance. Estevez et al. (2006) proposed a system using Fuzzy rules and Neural Network models to prevent subscription fraud in ﬁxed telecommunications. Their system indentiﬁed 3.5% of the subscribers as potentially fraudulent and they contained 56.2% of the true fraudsters. Almeida et al. (2008) devised a system that helps the fraud analyst make decisions faster and more accurately using CBR. In the insurance domain, Phua et al. (2004) used meta-learning techniques, applying algorithms like back-propagation neural network, Naïve Bayesian, and C4.5 to automobile insurance dataset in order to detect fraud while manipulating the distribution of class. The stacking-bagging method achieved the highest cost savings. To detect automobile insurance claim fraud, Viaene et al. (2005) compared Bayesian Networks with Logistic Regression and Decision Tree classiﬁers. These fraud detection studies are summarized in Table 1. Reviewing previous studies on fraud detection reveals that a few studies have taken into account the difference in the cost of misclassiﬁcation resulting from by type I and type II errors in building their classiﬁcation or prediction models, but most studies have assumed that the cost of the two errors are equal even though they are not in real business domains. 2.2. Cost-sensitive learning With recognition that the misclassiﬁcation costs from type I and type II errors could be different, several studies have suggested cost-sensitive learning (Bianca, John, & Naoki, 2003; Ciraco et al.,

4015

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019 Table 1 Fraud detection studies in various domains. Domain

Subject

Technique

Reference

Credit loan

Credit scoring

Desai et al. (1996)

Credit approval process

Logistic Regression, Discriminant Analysis, Multilayer Perceptron network, Modular Neural Network Case-Based Reasoning

Credit loan assessment

Neural Network, Neuro-Fuzzy Inference System

Credit card fraud detection

Neural Network

Detecting fraudulent credit card transactions Credit card fraud detection. Utility value of individual credit card delinquents Probability of default of credit card clients

Meta-Learning Technique using Bayesian Network

Ghosh and Reilly (1994) Stolfo et al. (1997)

Association Rule, Neural Network Neural Network, Decision Tree

Brause et al. (1999) Chung and Suh (2009)

K-Nearest Neighbor, Logistic Regression, Naïve Bayesian, Neural Network

Yeh and Lien (2009)

Telecommunication

Fraudulent phone call detection Subscription fraud Fraud analysis

JRip Fuzzy Rule, Neural Network Case-Based Reasoning

Fawcett (2006) Estevez et al. (2006) Almeida et al. (2008)

Insurance

Detecting fraudulent automobile insurance Fraud in automobile insurance claim

Meta-Learning Technique using Neural Network, Naïve Bayesian, and C4.5, Stacking, Bagging Bayesian Network, Logistic Regression, Decision Tree

Phua et al. (2004)

Credit card

2005; Domingos, 1999; Fawcett, 2006; Jiang and Cukic, 2009; Ling and Sheng, 2008; Ling, Sheng, Bruckhaus, et al., 2006; McCarthy et al., 2005; Witten and Frank, 2005). This section introduces theories of cost-sensitive learning, particularly, such as CSC (Witten and Frank, 2005) and MetaCost (Domingos, 1999), and reviews previous studies related to them. In classiﬁcation problems with two classes, the classiﬁcation cost can be represented in a cost matrix, shown in Table 2, where two types of error, false positives (FP) and false negatives (FN), and two types of correct classiﬁcation, true positives (TP) and true negatives (TN), have different costs and beneﬁts. In the table, Cost(i, j) represents the cost of classifying an instance belonging to class j into class i. Cost(0, 0) and Cost(1, 1) are usually considered as a beneﬁt (or a negated cost), while the other two cases are as a cost. Given the cost matrix, an instance should be classiﬁed into a class that generates the minimum expected cost. The expected cost of classifying an instance x into class i, RðijxÞ can be expressed as follows:

RðijxÞ ¼

X

PðjjxÞCostði; jÞ;

ð1Þ

j

where P(j|x) is the probability of class j being the actual class of instance x (Ling and Sheng, 2008). Then, the classiﬁer will classify instance x into a positive class (we assume class 1 is positive) if and only if:

Pð0jxÞCostð1; 0Þ þ Pð1jxÞCostð1; 1Þ 6 Pð0jxÞCostð0; 0Þ þ Pð1jxÞCostð0; 1Þ:

ð2Þ

CSC can convert the traditional cost-insensitive classiﬁcation models into cost-sensitive ones using the expected cost Eq. (1) (Witten and Frank, 2005). However, there is a limitation in that the costinsensitive classiﬁers have to yield the class probability of each test

Table 2 An example of a cost matrix for the two-class case.

Actual negative Actual positive

Predict negative

Predict positive

Cost(0, 0) Cost(0, 1)

Cost(1, 0) Cost(1, 1)

Wheeler and Aitken (2000) Malhotra and Malhotra (2002)

Viaene et al. (2005)

instance, as Naïve Bayes classiﬁers does. After obtaining the probability of each test instance, Eq. (1) is used to predict the class label of test examples. MetaCost also changes the traditional cost-insensitive classiﬁcation models into cost-sensitive ones utilizing Eq. (1) (Domingos, 1999), but it differs from CSC in that, while the classiﬁers used in CSC have to estimate the class probability, MetaCost can make use of classiﬁers that do not require the class probability and instead learn it from multiple classiﬁers. In brief, MetaCost works as follows: (1) sampling multiple bootstrap replicates of the training set; (2) learning a classiﬁer on each set; (3) producing the average of class probabilities that are directly yielded by the classiﬁers or calculating the fraction of votes received from the ensemble; (4) utilizing Eq. (1) to relabel each training example; (5) reapplying the classiﬁer to the relabeled training set. The model learned from MetaCost is more robust and generally applicable because of the ensemble approach. There have been many researches which took cost-sensitive learning approach, as is summarized in Table 3. Bianca et al. (2003) proposed the conversion methods to convert the classiﬁer learning algorithms and classiﬁcation theory into cost-sensitive algorithms and theory, based on cost-proportionate weighting of the training examples. A method based on cost-proportionate rejection sampling and ensemble aggregation achieved excellent predictive performance. Ciraco et al. (2005) considered the misclassiﬁcation cost ratio associated with a two-class case in order to improve the classiﬁer utility. They varied the cost ratio in two stages, learning stage and testing stage, and found that the utility of a classiﬁer can be maximized by applying different cost ratios in the two stages. McCarthy et al. (2005) mentioned that the classiﬁer is usually constructed to predict the majority class much more often than it predicts the minority class when the classiﬁer learns from the highly-skewed distribution of class. With the cost-sensitive learning approach, they used two sampling techniques, up-sampling and down-sampling, and found that there is no clear or consistent winner for maximizing classiﬁer performance. Ling et al. (2006) suggested a data-mining solution to predict the escalation risks of defects in order to aid human experts in the review process of software development. The distinct feature of their study is that although the general objective of cost-sensitive learning is to minimize the expected cost, they studied on the

4016

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

Table 3 Previous researches taking cost-sensitive learning approach. Subjects

References

Suggestion of MetaCost Conversion methods by cost-proportionate example weighting Improvement of the utility of the classiﬁcation model by altering the misclassiﬁcation cost ratio A comparison of two cost-sensitive learning approaches Minimization of the total cost for test examples Adapted ROC graphs for accommodating individual-varying costs in the domain of fraudulent phone call detection Cost-sensitive classiﬁcation of a faulty module in software development

Domingos (1999) Bianca et al. (2003) Ciraco et al. (2005) McCarthy et al. (2005) Ling et al. (2006) Fawcett (2006) Jiang and Cukic (2009)

Fig. 1. Research framework.

maximum net proﬁt problem and showed that the cost-sensitive decision tree produced the highest positive net proﬁt. In order to estimate the performance of the classiﬁers, Fawcett (2006) proposed ROC graphs with Instance-Varying costs (ROCIV) and demonstrated this method on three domains. The results of ROCIV differed considerably from those of traditional ROC. In the development of software fault prediction models, Jiang and Cukic (2009) analyzed the beneﬁts and misclassiﬁcation costs of various techniques using the project information from a public repository. They showed that cost-sensitive modeling does not outperform the overall classiﬁcation models in terms of hit ratio, but it minimizes the overall cost of misclassiﬁcation. Interesting is that many researches attempted to take utility mining approach to solve diverse problems as we review the literature. It seems to be worthwhile to examine other problems in many application areas for which utility mining approach is the best ﬁt. 3. Research method As is mentioned earlier, this paper aims to examine whether classiﬁcation cost is actually affected both by cost-sensitive approach and by skewed distribution of class. This section explains our research framework and builds three classiﬁcation models, each of which takes one traditional cost-insensitive approach (i.e., C4.5 decision tree) or cost-sensitive approaches (i.e., CSC and MetaCost). 3.1. Research framework As is shown in Fig. 1, our proposed research framework consists of several steps. In the ﬁrst step, we preprocess data acquired from a major loan company in Korea by removing outliers and missing data. In the second step, we select a set of features associated with delinquency. In the third step, we alter the distribution of class by adjusting the sampling ratio of fraud data to non-fraud data to get nine different datasets (i.e., three datasets with different ratios for each of three feature sets generated in the second step). In the fourth step, we ﬁrst deﬁne a simple cost matrix with the help of domain experts and then build three different classiﬁcation models using the matrix: one traditional cost-insensitive classiﬁcation

model using C4.5 and two cost-sensitive classiﬁcation models using CSC and MetaCost. In the ﬁnal step, we compare the resulting classiﬁcation costs among those three models.

3.2. Building models using a cost matrix Prior to building a cost-sensitive classiﬁcation model, a cost matrix should be deﬁned in order to measure the degree to which the model is cost-sensitive. Table 4 shows a matrix of simple formulas for calculating classiﬁcation cost. Although many more factors should be taken into account to deﬁne classiﬁcation cost in a real business, we deﬁned the formulas as simply as possible with the help from the domain experts in order to simplify the comparison of misclassiﬁcation cost among cost-sensitive and cost-insensitive models. In Table 4, fraudulent and non-fraudulent customers are regarded as the positive class and the negative class, respectively. If the model classiﬁes a fraudulent customer (i.e., actual positive) as a non-fraud customer (i.e., predicted negative), the amount of the loan and its interest are incurred as the misclassiﬁcation cost. If the model classiﬁes a non-fraudulent customer (i.e., actual negative) as a fraudulent customer (i.e., predicted positive), the institution will not approve the loan request and the misclassiﬁcation cost is the amount of interest that would have been earned by the loan. In the cases of correct classiﬁcation, however, the proﬁts are generated as the negated misclassiﬁcation cost. To deﬁne the actual costs or beneﬁts used to measure the costsensitivity of a classiﬁcation model, we used the average amount of a credit loan (i.e., 5 million Korean won) and the average interest rate (e.g., 29%) of the company that provided our dataset. Table 5 shows the resulting cost matrix, where cost is presented as a positive number and the beneﬁt is presented as a negative number. Referring to the cost matrix, we built two cost-sensitive classiﬁcation models using CSC and MetaCost, and a cost-insensitive classiﬁcation model using C4.5.5 We evaluated the effect of these cost-sensitive classiﬁcation models in our experiments by comparing the classiﬁcation cost of the three models. Next section provides a detailed description of the experiments. 5 To build these three models, we use a data mining tool, Weka version 3.6 (http:// www.cs.waikato.ac.nz/ml/weka/).

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019 Table 4 Formula matrix.

Actual negative Actual positive

Predict negative

Predict positive

(Loan ⁄ interest rate) Loan + (loan ⁄ interest rate)

Loan ⁄ interest rate {Loan + (loan ⁄ interest rate)}

Table 5 Cost matrix (unit: 10,000 Korean won).

Actual negative Actual positive

Predict positive

145 654

145 645

4.1. Dataset The raw data used in our experiment is credit loan data acquired from a major loan company in Korea. The data contains loan transactions of 48,525 customers from November 2007 to May 2008 and includes various kinds of information used for mangers to make a decision on the loan approval for customers, including customer proﬁle information (e.g., home address, ofﬁce address), loan information (e.g., the date of loan, loans), and credit information (e.g., credit inquiries, open credit accounts). The distribution between fraudulent customers and non-fraudulent customers was highly skewed.

4.2. Pre-processing When pre-processing the dataset, we eliminated 459 transactions that have many missing values or were outliers and also elim-

Table 6 Some representative features from each category of delinquency information. Information

Features

Total delinquency

The number of total delinquencies The number of total delinquency accounts The longest period of delinquency among total delinquencies The longest period of delinquency account among total delinquency accounts The largest amount of delinquency among total delinquencies The largest amount of delinquency among remaining total delinquencies

Loan delinquency

inated highly correlated and duplicated features by analyzing the correlation coefﬁcient among them. Some features were derived from existing features with the help of domain experts. For example, the number of times the applicant changed address, the number of times the applicant changed jobs and the number of times loan inquiries made were derived from the information about customers’ house addresses, ofﬁce addresses and the number of credit inspection requests, respectively. At the end, 48,066 transactions remained and 112 features were obtained for experiments from them. 4.3. Feature selection

Predict negative

4. Experimental design

Credit card delinquency

4017

The number of credit card delinquencies The number of credit card delinquency accounts The longest period of delinquency among credit card delinquencies The longest period of delinquency account among credit card delinquency accounts The largest amount of delinquency among credit card delinquencies The largest amount of delinquency among remaining credit card delinquencies The number of loan delinquencies The number of loan delinquency accounts The longest period of delinquency among loan delinquencies The longest period of delinquency account among loan delinquency accounts The largest amount of delinquency among loan delinquencies

After a discussion with domain experts, we prepared three sets of features. The ﬁrst contains 112 features from all categories of delinquency information (i.e., total delinquency, credit card delinquency and loan delinquency), as you can see in Table 6. The second contains 99 features from two categories (i.e., credit card delinquency and loan delinquency). The third contains 78 features from only a single category (i.e., total delinquency). Next, we evaluated each set of features using the Chi-square feature selection method to select the variables that are most inﬂuential on the target variable. In the end, we obtained 33 features from the ﬁrst set, 31 from the second, and 24 from the third. 4.4. Adjusting class distribution Our dataset represented a highly-skewed distribution of class between fraud data and non-fraud data, that is, only 226 transactions belong to the class of the fraudulent customers and remaining 47,840 transactions belong to the class of non-fraudulent customers. Such a highly-skewed distribution of class normally causes a serious problem that when a classiﬁcation model learns from a skewed data, the model will usually predict the majority class correctly more often than the minority class because most classiﬁcation models are designed to maximize accuracy (McCarthy et al., 2005). To mitigate this problem, we employed under-sampling method6 by varying the ratios of fraudulent customers to non-fraudulent customers (i.e., 40:60, 50:50 and 60:40) although under-sampling is known to have several disadvantage (Domingos, 1999; Ghosh and Reilly, 1994; Jiang and Cukic, 2009; Ling and Sheng, 2008; Ling, Sheng, Bruckhaus, et al., 2006; McCarthy et al., 2005; Phua et al., 2004). As a consequence, we obtained nine datasets by generating three datasets of different fraudulent to non-fraudulent ratios in each category of features, as is shown in Table 7.

5. Experimental results In this section, explained are the results from our experiments on how the cost-sensitive approach, the ratio-variant datasets, and the number of features affect the classiﬁcation cost of credit fraud detection models and suggest the implications of the results. Experiment 1. In order to evaluate whether a cost-sensitive classiﬁcation model generates less classiﬁcation cost than a costinsensitive classiﬁcation model, we calculated the classiﬁcation costs from each of the three classiﬁcation models, that is, a costinsensitive model using C4.5 (hereafter we call it C4.5 model) and two cost-sensitive model using CSC and MetaCost (hereafter we call them CSC model and MetaCost model, respectively) for each of the nine datasets described in Section 4.4.

6

Oversampling is also known to have the disadvantage of over-ﬁtting.

4018

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

Table 7 The nine datasets. Datasets (non-fraud:fraud)

1 2 3

33 variables (from three categories of features)

4 5 6

31 variables (from two categories of features)

7 8 9

24 variables (from one category of features)

40:60 50:50 60:40

40:60 50:50 60:40

Instance Actual Prediction Loan TP

FN

TN

FP

Cost

1 2 3 4

0 308.3 0 0

59.5 0 0 0

0 0 0 198.1

59.5 308.3 334.1 198.1

N N Y Y

205 239 259 683

0 0 334.1 0

Table 9 The classiﬁcation costs of the three models for each dataset (unit: Korean won). Dataset no.

C4.5

MetaCost

CSC

1 2 3 4 5 6 7 8 9

28025 27570 17845 23155 21830 14525 24155 20380 19585

38605 45730 28945 35605 30670 26945 32605 34830 28265

29315 27570 19395 39475 26410 15365 25155 25800 19075

To calculate the classiﬁcation cost of C4.5 model, we randomly assigned7 a loan amount to each customer using a beta distribution with 1 million Korean won as the minimum amount of loan, 50 million Korean won as the maximum, and 5 million Korean won as the average. To calculate the classiﬁcation costs of CSC model and MetaCost model, we used the cost matrix deﬁned in Table 5 since both models need only the average amount of a loan as an input value of the matrix. All three models used the average interest rate of 29% in calculating classiﬁcation cost. For example, Table 8 shows the classiﬁcation costs of C4.5 model, where the costs of two misclassiﬁed instances and the beneﬁts of two correctly classiﬁed instances were calculated using the formula given in Table 4. After calculating the cost or the beneﬁt for each customer as is illustrated in the above, we computed the average of all customers’ costs as the cost of C4.5 model. Then, the costs of CSC and MetaCost models were calculated using the cost matrix deﬁned in Table 5. Table 9 shows the classiﬁcation costs of the three models for each dataset and Fig. 2 compares them graphically, where MetaCost generates the lowest classiﬁcation cost (that is, the highest beneﬁt) in all dataset except in dataset 4. We tested whether each pair of models is statistically signiﬁcant at the 5% level of signiﬁcance and found that there was a signiﬁcant difference between C4.5 model and MetaCost model. However, the difference between C4.5 model and CSC model, and the difference between CSC model and MetaCost model were not signiﬁcant.

7 We could not get the actual loan amount of each customer because of issues related to customer conﬁdentiality.

1 2 3 4 5 6 7 8 9

-20000

C4.5

-30000

CSC

-40000

MetaCost

-50000

40:60 50:50 60:40

Table 8 Classiﬁcation costs of some instances from C4.5 model (unit: 10,000 Korean won).

N Y Y N

Cost

No.

0 -10000

Dataset Number

Fig. 2. Comparison of classiﬁcation costs among three models.

Table 10 Average classiﬁcation costs according to the ratios of non-fraud data to fraud data (unit: Korean won). Class distribution ratio (non-fraud:fraud)

C4.5

MetaCost

CSC

40:60 50:50 60:40

25117 23260 17318

35605 37077 28052

31315 26593 17945

Table 11 Average classiﬁcation cost according to the different numbers of features (unit: Korean won). Number of features

C4.5

MetaCost

CSC

33 variables 31 variables 24 variables

24480 19837 21373

37760 31073 31900

25427 27083 23343

Consequently, the result of this experiment suggests that the classiﬁcation model built from cost-sensitive approach can reduce the misclassiﬁcation cost or increase the beneﬁt compared to the classiﬁcation model built from cost-insensitive approach. Especially, since MetaCost model performed better than CSC model in most datasets, we suggest that MetaCost model could be taken into account as the ﬁrst cost-sensitive learning technique to use when building credit loan fraud detection models. Experiment 2. To evaluate the effect of the class distribution on classiﬁcation cost, we calculated the average classiﬁcation costs of the three classiﬁcation models from each dataset with different ratios between the fraud and the non-fraud classes. Table 10 shows that C4.5 and CSC models produced the lowest classiﬁcation cost when the ratio between the two classes was 40:60, and MetaCost model yielded the lowest classiﬁcation cost when the ratio was 50:50. The results weakly indicate that less classiﬁcation cost results when the dataset contains more data in fraud class than in non-fraud class or the ratio of data between the two classes is balanced.

Experiment 3. To evaluate the effect on classiﬁcation cost of the number of features used when building classiﬁcation models, we calculated the average classiﬁcation cost of the three classiﬁcation models for each of the three datasets with different numbers of features. Table 11 shows that C4.5 and MetaCost models produced the lowest classiﬁcation cost when the number of features was 33 and the CSC generated the lowest classiﬁcation cost when the number of features was 31. It is also shown that using 33 features from all categories of delinquency information (i.e., total delinquency, credit card delinquency and loan delinquency) is more effective in reducing the classiﬁcation cost. The results weakly indicate that less classiﬁcation cost results when as many features and as many kinds of delinquency information as possible are used.

J. Kim et al. / Expert Systems with Applications 39 (2012) 4013–4019

6. Conclusions The loss to ﬁnancial institutions from various kinds of fraud is so enormous that studies on fraud detection have been conducted in many business domains such as credit loan, credit card, insurance and telecommunication. In particular, most loan companies suffer from losses incurred by fraudulent customers. In this study, we built three fraud detection models from a dataset obtained from a loan company in Korea. One was built using a decision tree algorithm, C4.5, where we assumed as is usual that the classiﬁcation cost was equal in all cases of the confusion matrix. The other two were built using two cost-sensitive algorithms, CSC and MetaCost, where it was assumed that the classiﬁcation cost was not equal for each case of the confusion matrix. As is common with other datasets for fraud detection, our dataset had a highly skewed distribution of class. We conducted experiments using nine datasets, each of which is different from the others in the number of variables, category of variables and the ratio between fraud and non-fraud classes. The objective of this study is three-fold. One is to verify that the cost-sensitive approach to fraud detection results in less classiﬁcation cost than the cost-insensitive approach. Another is to look into whether data the distribution between the fraud class and the nonfraud class in the training dataset affect on the classiﬁcation cost. The third is to ﬁnd out the set of features that results in the least classiﬁcation cost. The experiments showed that (1) MetaCost returned the lowest classiﬁcation cost in eight datasets out of nine; (2) less classiﬁcation cost results when the dataset had more fraud data than non-fraud data; (3) a feature set of all delinquency information was the most effective on reducing the classiﬁcation cost. These ﬁndings have several implications. First, MetaCost should be considered as an important cost-sensitive learning technique to use when building fraud detection models. Second, by using the credit loan fraud detection model introduced in this study, ﬁnancial institutions can reduce the amount of ﬁnancial loss caused by fraudsters. Finally, the misclassiﬁcation cost, rather than the accuracy rate of the classiﬁers, should be taken into account as a criterion for selecting fraud detection models. The cost-sensitive approach taken in this study will help managers in ﬁnancial institutions make more effective decisions for loan approval. However, this study has some limitations. First, the assumption is too strong which we made in this study in order to simplify the calculation of classiﬁcation cost. Therefore, it would be worthwhile to do the experiment with a formula that is actually used in ﬁnancial institutions. Second, the data used in this study was acquired from only one ﬁnancial institution in Korea. It would bring better results if we conduct similar experiments using datasets from more ﬁnancial institutions so that the experimental results can be generalized.

References Almeida, P., Jorge, M., Cortesão, L., Martins, F., Vieira, M., & Gomes, P. (2008). Supporting fraud analysis in mobile telecommunications using case-based reasoning. In Proceedings of the 9th European conference on advances in case-

4019

based reasoning (pp. 562–572). . Ballis, D., Falaschi, M., Ferri, C., Hernandez-Orallo, J., & Ramirez-Quintana, M. J. (2003). Cost-sensitive diagnosis of declarative programs. Electronic Notes in Theoretical Computer Science, 86(3), 85–104. Bianca, Z., John, L., & Naoki, A. (2003). Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the third IEEE international conference on data mining (pp. 435–442). Brause, R., Langsdorf, T., & Hepp, M. (1999). Neural data mining for credit card fraud detection. In Proceedings of the 11th IEEE International Conference on Tools with Artiﬁcial Intelligence (pp. 103–106). Chung, S. H., & Suh, Y. (2009). Estimating the utility value of individual credit card delinquents. Expert Systems with Applications, 36(2), 3975–3981. doi:10.1016/ j.eswa.2008.02.031. Ciraco, M., Rogalewski, M., & Weiss, G. (2005). Improving classiﬁer utility by altering the misclassiﬁcation cost ratio. In Proceedings of the 1st international workshop on Utility-based data mining (pp. 46–52). . Desai, V. S., Crook, J. N., & Overstreet, G. A. J. (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operational Research, 95(1), 24–37. Domingos, P. (1999). MetaCost: A general method for making classiﬁers costsensitive. In Proceedings of the 15th international conference on knowledge discovery and data mining (pp. 155–164). Estevez, P. A., Held, C. M., & Perez, C. A. (2006). Subscription fraud prevention in telecommunications using fuzzy rules and neural networks. Expert Systems with Applications, 31(2), 337–344. doi:10.1016/j.eswa.2005.09.028. Fawcett, T. (2006). ROC graphs with instance-varying costs. Pattern Recognition Letters, 27(8), 882–891. doi:10.1016/j.patrec.2005.10.012. Ghosh, S., & Reilly, D. L. (1994). Credit card fraud detection with a neural-network. Proceedings of the Annual International Conference on System Science, 3, 621–630. Jiang, Y., & Cukic, B. (2009). Misclassiﬁcation cost-sensitive fault prediction models. In Proceedings of the 5th international conference on predictor models in software engineering (pp. 1–10). . Ling, C. X., & Sheng, V. S. (2008). Cost-sensitive learning and the class imbalance problem. Springer. Ling, C. X., Sheng, V. S., Bruckhaus, T., & Madhavji, N. H. (2006). Maximum proﬁt mining and its application in software development. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 929–934). . Ling, C. X., Sheng, V. S., & Yang, Q. (2006). Test strategies for cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1055–1067. Lu, W.-Z., & Wang, D. (2008). Ground-level ozone prediction by support vector machine approach with a cost-sensitive classiﬁcation scheme. Science of the Total Environment, 395, 109–116. Malhotra, R., & Malhotra, D. K. (2002). Differentiating between good credits and bad credits using neuro-fuzzy systems. European Journal of Operational Research, 136, 190–211. McCarthy, K., Zabar, B., & Weiss, G. (2005). Does cost-sensitive learning beat sampling for classifying rare classes? In Proceedings of the 1st international workshop on Utility-based data mining (pp. 69–77). . Phua, C., Alahakoon, D., & Lee, V. (2004). Minority report in fraud detection: Classiﬁcation of skewed data. ACM SIGKDD Explorations Newsletter, 6(1), 50–59. . Stolfo, S., Fan, W., Lee, W., Prodromidis, A., & Chan, P. (1997). Credit card fraud detection using meta-learning: Issues and initial results. In Proceedings of the workshop on AI methods in fraud and risk management predictor models in software engineering (pp. 83–90). Viaene, S., Dedene, G., & Derrig, R. A. (2005). Auto claim fraud detection using Bayesian learning neural networks. Expert Systems with Applications, 29(3), 653–666. Wheeler, R., & Aitken, S. (2000). Multiple algorithms for fraud detection. KnowledgeBased Systems, 13(2-3), 93–99. Witten, I. H., & Frank, E. (2005). Data mining – Practical machine learning tools and techniques. Morgan Kaufmann. Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473–2480. doi:10.1016/j.eswa.2007.12.020. Zadrozny, B. (2005). One-beneﬁt learning: Cost-sensitive learning with restricted cost information. Paper presented at the In proceedings of KDD workshop on utility-based data mining, Chicago.

Lihat lebih banyak...

Classification cost: An empirical comparison among traditional classifier, Cost-Sensitive Classifier, and MetaCost

Descripción

Comentarios