Bayesian kernel based classification for financial distress detection

June 9, 2017 | Autor: Johan Suykens | Categoría: Credit Scoring, Financial Distress, Multidisciplinary, Logistic Regression, Financial Analysis, Bayesian Inference, Financial Institutions, Bit Error Rate, European, Decision making process, Financial Ratios, Statistical Model, Bayesian Inference, Financial Institutions, Bit Error Rate, European, Decision making process, Financial Ratios, Statistical Model

Share Embed

Laporkan tautan ini

Descripción

FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 B-9000 GENT Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER Bayesian Kernel-Based Classification for Financial Distress Detection Tony Van Gestel 1 Bart Baesens 2 Johan A.K. Suykens 3 Dirk Van den Poel 4 Dirk-Emma Baestaens 5 Marleen Willekens 6 May 2004 2004/247 1

DEXIA Group, Credit Risk Modelling, RMG, Square Meeus 1, B-1000 Brussel, Belgium. E-mail: [email protected] & Katholieke Universiteit Leuven, Dept. of Electrical Engineering, ESAT, SCD-SISTA, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium. E-mail: [email protected] 2 Katholieke Universiteit Leuven, Dept. of Applied Economic Sciences, LIRIS, Naamsestraat 69, B-3000 Leuven, Belgium. E-mail: [email protected] 3 Katholieke Universiteit Leuven, Dept. of Electrical Engineering, ESAT, SCD-SISTA, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium. E-mail: [email protected] 4 Ghent University, Faculty of Economics and Business Administration, Department of Marketing, Hoveniersberg 24, B-9000 Gent, Belgium, [email protected] 5 Fortis Bank Brussels, Financial Markets Research, Warandeberg 3, B-1000, Brussels, Belgium. E-mail: [email protected] 6 Katholieke Universiteit Leuven, Dept. of Applied Economic Sciences, Naamsestraat 69, B-3000 Leuven, Belgium. E-mail: [email protected]

D/2004/7012/33

Bayesian Kernel Based Classification for Financial Distress Detection Tony Van Gestel a,b,1 , Bart Baesens c,1 , Johan A.K. Suykens b,1 , Dirk Van den Poel d,1 , Dirk-Emma Baestaens e , and Marleen Willekens c,1 a DEXIA

Group, Credit Risk Modelling, RMG, Square Meeus 1, B-1000 Brussel, Belgium. E-mail: [email protected]. b Katholieke Universiteit Leuven, Dept. of Electrical Engineering, ESAT, SCD-SISTA, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium. E-mail: {tony.vangestel;johan.suykens}@esat.kuleuven.ac.be. c Katholieke Universiteit Leuven, Dept. of Applied Economic Sciences, LIRIS, Naamsestraat 69, B-3000 Leuven, Belgium. E-mail: {bart.baesens;marleen.willekens}@econ.kuleuven.ac.be. d Ghent University, Dept. of Marketing, Hoveniersberg 24, 9000 Gent, Belgium. E-mail: [email protected]. e Fortis

Bank Brussels, Financial Markets Research, Warandeberg 3, B-1000, Brussels, Belgium. E-mail: [email protected].

Abstract Corporate credit granting is a key commercial activity of financial institutions nowadays. A critical first step in the credit granting process usually involves a careful financial analysis of the creditworthiness of the potential client. Wrong decisions result either in foregoing valuable clients or, more severely, in substantial capital losses if the client subsequently defaults. It is thus of crucial importance to develop models that estimate the probability of corporate bankruptcy with a high degree of accuracy. Many studies focused on the use of financial ratios in linear statistical models, such as linear discriminant analysis and logistic regression. However, the obtained error rates are often high. In this paper, Least Squares Support Vector Machine (LS-SVM) classifiers, also known as kernel Fisher discriminant analysis, are applied within the Bayesian evidence framework in order to automatically infer and analyze the creditworthiness of potential corporate clients. The inferred posterior class probabilities of bankruptcy are then used to analyze the sensitivity of the classifier output with respect to the given inputs and to assist in the credit assignment decision making process. The suggested nonlinear kernel based classifiers yield better performances than linear discriminant analysis and logistic regression when applied to a real-life data set concerning commercial credit granting to mid-cap Belgian and Dutch firms. Key words: Credit Scoring, Kernel Fisher Discriminant Analysis, Least Squares Support Vector Machine Classifiers, Bayesian Inference

1

Introduction

Corporate bankruptcy does not only cause substantial losses to the business community, but also to society as a whole. Therefore, accurate bankruptcy prediction models are of critical importance to various stakeholders (i.e. management, investors, employees, shareholders and other interested parties) as it provides them with timely warnings. From a managerial perspective, financial failure forecasting tools allow to take timely strategic actions such that financial distress can be avoided. For other stakeholders, such as banks, efficient and automated credit rating tools allow to detect clients that are to default their obligations at an early stage. Hence, accurate bankruptcy prediction tools will enable them to increase the efficiency of one of their core activities, i.e. commercial credit assignment. Financial failure occurs when the firm has chronic and serious losses and/or when the firm becomes insolvent with liabilities that are disproportionate to assets. Widely identified causes and symptoms of financial failure include poor management, autocratic leadership and difficulties in operating successfully in the market. The common assumption underlying bankruptcy prediction is that a firm’s financial statements appropriately reflect all these characteristics. Several classification techniques have been suggested to predict financial distress using ratios and data originating from these statements. While early univariate approaches used ratio analysis, multivariate approaches combine multiple ratios and characteristics to predict potential financial distress [1–3]. Linear multiple discriminant approaches (LDA), like Altman’s Z-Scores, attempt to identify the most efficient hyperplane to linearly separate between successful and non-successful firms. At the same time, the most significant combination of predictors is identified by using a stepwise selection procedure. However, these techniques typically rely on the linear separability assumption, as well as normality assumptions. Motivated by their universal approximation property, multilayer perceptron (MLP) neural networks [4] have been applied to model nonlinear decision boundaries in bankruptcy prediction and credit assignment problems [5–11]. Although advanced learning methods like Bayesian inference [12,13] have been developed for MLPs, their practical design suffers from drawbacks like the non1

This research was supported by Dexia, Fortis, the K.U.Leuven, the Belgian federal government (IUAP V, GOA-Mefisto 666) and the national science foundation (FWO) with project G.0407.02. This research was initiated when TVG was at the K.U.Leuven and continued at Dexia. TVG is a honorary postdoctoral researcher and JS is a postdoctoral researcher with the FWO-Flanders. The authors wish to thank Peter Van Dijcke, Joao Garcia, Luc Leonard, Eric Hermann, Thomas Alderweireld, Marc Itterbeek, Daniel Saks, Daniel Feremans, Geert Kindt and Jos De Brabanter for helpful comments.

2

convex optimization problem and the choice of the number of hidden units. In Support Vector Machines (SVMs), Least Squares SVMs (LS-SVMs) and related kernel based learning techniques [14–17], the inputs are first mapped into a high dimensional kernel induced feature space in which the regressor or classifier are constructed by minimizing an appropriate convex cost function. Applying Mercer’s theorem, the solution is obtained in the dual space from a finite dimensional convex quadratic programming problem for SVMs or a linear Karush-Kuhn-Tucker system in the case of LS-SVMs, avoiding explicit knowledge of the high dimensional mapping and using only the related positive (semi) definite kernel function. In this paper, we apply LS-SVM classifiers [16,18], also known as kernel Fisher Discriminant Analysis [19,20], within the Bayesian evidence framework [20,21] to predict financial distress of Belgian and Dutch firms with middle market capitalization. After having inferred the hyperparameters of the LS-SVM classifier on different levels of inference, we apply a backward input selection procedure by ranking the model evidence of the different input sets. Posterior class probabilities are obtained by marginalizing over the model parameters in order to infer the probability of making a correct decision and to detect difficult cases that should be referred to further investigation. The obtained results are compared with linear discriminant analysis and logistic regression using leave-one-out cross-validation [22]. This paper is organized as follows. The linear and nonlinear kernel based classification techniques are reviewed in Sections 2, 3 and 4. Bayesian learning for LS-SVMs is outlined in Section 5. Empirical results on financial distress prediction are reported in Section 6.

2

Empirical Linear Discriminant Analysis

Given a number n of explanatory variables or inputs x = [x1 ; . . . ; xn ] ∈ Rn of a firm, the problem we are concerned with is to predict whether this firm will default its obligations (y = −1) or not (y = +1). This problem corresponds to a binary classification problem with class C− (y = −1) denoting the class of (future) bankrupt firms and class C+ (y = +1) the class of solvent firms. Let p(x|y) denote the class probability density of observing the inputs x given the class label y and let π+ = P(y = +1), π− = P(y = −1) denote the prior class probabilities, then the Bayesian decision rule to predict yˆ is as follows yˆ = sign[P(y = +1|x) − P(y = −1|x)] yˆ = sign[log(P(y = +1|x)) − log(P(y = −1|x))] yˆ = sign[log(p(x|y = +1)) − log(p(x|y = −1)) + log(π+ /π− )], 3

(1) (2) (3)

where the third expression is obtained by applying Bayes’ formula p(y|x) =

p(y)p(x|y) , P(y = +1)p(x|y = +1) + P(y = −1)p(x|y = −1)

and omitting the normalizing constant in the denominator. This Bayesian decision rule is known to yield optimal performance as it minimizes the risk of misclassification for each instance x. In the case of Gaussian class densities with means m− , m+ and equal covariance matrix Σx , the Bayesian decision rule becomes [4,23,24] yˆ = sign[wT x + b] = sign[z],

(4)

with latent variable z = wT x + b and where w = Σ−1 x (m+ − m− ) and b = wT (m+ + m− )/2 + log(π+ /π− ). This is known as Linear Discriminant Analysis (LDA). In the case of unequal class covariance matrices, a quadratic discriminant is obtained. As the class densities p(x|y) are typically unknown in practice, one has to estimate the decision rule from given training data D = {(xi , yi )}N i=1 . A common way to estimate the linear discriminant (4) is by solving ˆ ˆb) = arg min (w, w,b

N 1X (yi − (wT xi + b))2 . 2 i=1

(5)

ˆ ˆb) follows from a linear set of equations of dimension (n + The solution (w, 1) × (n + 1) and corresponds 2 to the Fisher Discriminant solution [25], which has been used in the pioneering paper of Altman [1]. The least squares formulation with binary targets (−1, +1) has the additional interpretation as an asymptotical optimal least squares approximation to the Bayesian discriminant function P(y = +1|x) − P(y = −1|x) [23]. This formulation is also often used for training neural network classifiers [4,16]. Instead of minimizing a least squares cost function or estimating the covariance matrices, one may also relate the probability P(y = +1) to the latent variable z via the logistic link function [26]. The probabilistic interpretation of the ˆ and inverse link function P(y = +1) = 1/(1 + exp(−z)) allows to estimate w ˆb from maximum likelihood [26] ˆ ˆb) = arg min (w, w,b

N X i=1

³

´

log 1 + exp(−yi (wT xi + b)) .

2

(6)

More precisely, Fisher related the maximization of the Rayleigh quotient to a + + − regression approach with targets (−N/n− D , N/nD ), with nD and nD the number of positive and negative training instances. The solution only differs in the choice of the bias term b and a scaling of the coefficients w.

4

No analytic solution exists, but the solution can be obtained by applying Newton’s method corresponding to an iteratively reweighted least squares algorithm [24]. The first use of applying logistic regression for bankruptcy prediction has been reported in [27].

3

Support Vector Machines and Kernel Based Learning

The Multilayer Perceptron (MLP) neural network is a popular neural network for both regression and classification and has often been used for bankruptcy prediction and credit scoring in general [28–30,6]. Although there exist good training algorithms (e.g. Bayesian inference) to design the MLP, there are still a number of drawbacks like the choice of the architecture of the MLP and the existence of multiple local minima, which implies that the estimated parameters may not be uniquely determined. Recently, a new learning technique emerged, called Support Vector Machines (SVMs) and kernel based learning in general, in which the solution is unique and follows from a convex optimization problem [16,31,32,15]. The regression formulations are also related to Gaussian processes and regularization networks [33], where the latter have been applied to modelling option prices [34]. Although the general nonlinear version of Support Vector Machines (SVM) is quite recent, the roots of the SVM approach for constructing an optimal separating hyperplane for pattern recognition date back to 1963 and 1964 [35,36]. 3.1 Linear SVM Classifier: separable case n Consider a training set of N data points {(xi , yi )}N i=1 , with input data xi ∈ R and corresponding binary class labels yi ∈ {−1, +1}. When the data of the two classes are separable (Figure 1a), one can say that

   wT xi + b ≥ +1,

if yi = +1,

  wT x + b ≤ −1, i

if yi = −1.

This set of two inequalities can be combined into one single set as follows ³

´

yi wT xi + b ≥ +1,

i = 1, . . . , N.

(7)

As can be seen from Figure 1a, multiple solutions are possible. From a generalization perspective, it is best to choose the solution with largest margin 2/||w||2 . 5

x2

x2

+ 2/||w||

+

+ + x

x x

x

x

x Class C 1 x x

+ +

+ Class C 2

+ Class C 2 +

+ + + + +

x x

x

+ x Class C 1

wT x + b = +1 wT x

x x

wT x

x

+b=0

x

+ b = −1

x x

x

+

+ + + + x +

+ x x

wT x + b = +1 wT x wT x

+b=0

+ b = −1

x1

x1

a) Separable case

b) Non-separable case

Fig. 1. Illustration of linear SVM classification in a two dimensional input space. Left: separable case; right: non-separable case.

Support vector machines are modelled within a context of convex optimization theory [37]. The general methodology is to start formulating the problem in the primal weight space as a constrained optimization problem, next formulate the Lagrangian, take the conditions for optimality and finally solve the problem in the dual space of Lagrange multipliers, which are also called support values. The optimization problem for the separable case aims at maximizing the margin 2/||w||2 subject to the constraint that all training data points need to be correctly classified. This gives the following primal (P) problem in w: 1 min JP (w) = wT w w,b 2 such that yi (wT xi + b) ≥ 1,

(8) i = 1, . . . , N.

The Lagrangian for this constraint optimization problem is L(w, b; α) = P T 1/2wT w − N i=1 αi (yi (w xi + b) − 1), with Lagrange multipliers αi ≥ 0 (i = 1, . . . , N ). The solution is the saddle point of the Lagrangian: max min L. α

(9)

w,b

The conditions for optimality for w and b are     

∂L ∂w ∂L ∂b

→w= →

PN

PN

i=1

αi yi xi

(10)

i=1 αi yi = 0.

From the first condition in (10), the classifier (4) expressed in terms of the Lagrange multipliers (support values) becomes N X

y(x) = sign(

αi yi xTi x + b).

(11)

i=1

Replacing (10) into (9), the dual (D) problem in the Lagrange multipliers α 6

is the following Quadratic Programming problem (QP) max JD (α) = − α such that

N N X 1 X 1 yi yj xTi xj αi αj + αi = − αT Ωα + 1T α 2 i,j=1 2 i=1

N X

(12)

αi yi = 0

i=1

αi ≥ 0,

i = 1, . . . , N,

with α = [α1 , . . . , αN ]T , 1 = [1, . . . , 1]T ∈ RN and Ω ∈ RN ×N , where Ωij = yi yj xTi xj (i, j = 1, . . . , N ). The matrix Ω is positive (semi-) definite by construction. In the case of a positive definite matrix, the solution to this QP problem is global and unique. In the case of a positive semi-definite matrix, the solution is global, but not necessarily unique in terms of the Lagrange P multipliers αi , while still a unique solution in terms of w = N i=1 αi yi xi is obtained [37]. An interesting property, called the sparseness property, is that many of the resulting αi values are equal to zero. The training data points xi corresponding to non-zero αi are called support vectors. These support vectors are located close to the decision boundary. From a non-zero support value αi > 0, b is obtained from yi (wT xi + b) − 1 = 0. 3.2 Linear SVM Classifier: non-separable case In most practical, real-life classification problems, the data are non-separable in linear or nonlinear sense, due to the overlap between the two classes (see Figure 1b). In such cases, one aims at finding a classifier that separates the data as much as possible. The SVM classifier formulation (8) is extended to the non-separable case by introducing slack variables ξi > 0 in order to tolerate misclassifications [38]. The inequalities are changed into ³

´

yi wT xi + b ≥ 1 − ξi ,

, i = 1, . . . , N,

(13)

where the i-th inequality is violated when ξi > 0. In the primal weight space, the optimization problem becomes N X 1 min JP (w) = wT w + c ξi w,b,ξ 2 i=1

such that

yi (wT xi + b) ≥ 1 − ξi , ξi ≥ 0,

(14) i = 1, . . . , N i = 1, . . . , N,

where c is a positive real constant that determines the trade-off between the P large margin term 1/2wT w and error term N is equal i=1 ξi . The Lagrangian PN PN PN T T to L = 1/2w w + c i=1 ξi − i=1 αi (yi (w xi + b) − 1 + ξi ) − i=1 νi ξi , with 7

Lagrange multipliers αi ≥ 0, νi ≥ 0 (i = 1, . . . , N ). The solution is given by the saddle point of the Lagrangian maxα,ν minw,b,ξ L(w, b, ξ; α, ν), with conditions for optimality             

∂L ∂w ∂L ∂b ∂L ∂ξi

→w= →

PN

i=1

PN

i=1

αi yi xi (15)

αi yi = 0

→ 0 ≤ αi ≤ c,

i = 1, . . . , N.

Replacing (15) in (14) yields the following dual QP-problem max JD (α) = − α such that

N X

N N X 1 X 1 yi yj xTi xj αi αj + αi = − αT Ωα + 1T α 2 i,j=1 2 i=1

(16)

αi yi = 0

i=1

0 ≤ αi ≤ c,

i = 1, . . . , N.

The bias term b is obtained as a by-product of the QP-calculation or from a non-zero support value.

3.3 Kernel Trick and Mercer Condition The linear SVM classifier is extended to a nonlinear SVM classifier by first mapping the inputs in a nonlinear way x 7→ ϕ(x) into a high dimensional space, called feature space in SVM terminology. In this high dimensional feature space, a linear separating hyperplane wT ϕ(x)+b = 0 is constructed using (12), as is depicted in Figure 2. A key element of nonlinear SVMs is that the nonlinear mapping ϕ(·) : x 7→ ϕ(x) may not be explicitly known, but is defined implicitly in terms of the positive (semi-) definite kernel function satisfying the Mercer condition K(x1 , x2 ) = ϕ(x1 )T ϕ(x2 ).

(17)

Given the kernel function K(x1 , x2 ), the nonlinear classifier is obtained by solving the dual QP-problem, in which the product xTi xj is replaced by ϕ(xi )T ϕ(xj ) = K(xi , xj ), e.g., Ω = [yi yj ϕ(xi )T ϕ(xj )]. The nonlinear SVM classifier is then obtained as N X

y(x) = sign[wT ϕ(x) + b] = sign[

αi yi K(xi , x) + b].

(18)

i=1

In the dual space, the score z =

PN

i=1

αi yi K(xi , x)+b is obtained as a weighted 8

Feature Space x 7→ ϕ(x)

Input Space

K(x1 , x2 ) = ϕ(x1 )T ϕ(x2 ) Fig. 2. Illustration of kernel based classification. The inputs are first mapped in a nonlinear way to a high-dimensional feature space (x 7→ ϕ(x)), in which a linear separating hyperplane is constructed. Applying the Mercer condition (K(x1 , x2 ) = ϕ(x1 )T ϕ(x2 )), a nonlinear classifier in the input space is obtained.

sum of the kernel functions evaluated in the support vectors and the evaluated point x, with weights αi yi . A popular choice for the kernel function is the radial basis function (RBF) kernel K(xi , xj ) = exp{−kxi −xj k22 /σ 2 }, where σ is a tuning parameter. Other typical kernel functions are the linear kernel K(xi , xj ) = xTi xj ; the polynomial kernel K(xi , xj ) = (τ + xTi xj )d with degree d and tuning parameter τ ≥ 0; and MLP kernel K(xi , xj ) = tanh(κ1 xTi xj + κ2 ). The latter is not positive semi-definite for all choices of the tuning parameters κ1 and κ2 .

4

Least Squares Support Vector Machines

The LS-SVM classifier formulation can be obtained by modifying the SVM classifier formulation as follows: N γX 1 e2 min JP (w) = wT w + w,b,e 2 2 i=1 C,i

subject to

yi [wT ϕ(xi ) + b] = 1 − eC,i ,

(19) i = 1, . . . , N.

(20)

Besides the quadratic cost function, an important difference with standard SVMs is that the formulation consists now of equality instead of inequality constraints [16]. The LS-SVM classifier formulation (19)-(20) implicitly corresponds to a regression interpretation (22)-(23) with binary targets yi = ±1. By multiplying P 2 the error eC,i with yi and using yi2 = 1, the sum of squared error term N i=1 eC,i 9

becomes: PN

2 i=1 eC,i

=

PN

2 i=1 (yi eC,i )

=

PN

2 i=1 ei

= (yi − (wT ϕ(xi ) + b))2 ,

(21)

with the regression error ei = yi − (wT ϕ(xi ) + b) = yi eC,i . The LS-SVM classifier is then constructed as follows N 1X 1 min JP = wT w + γ e2 w,b,e 2 2 i=1 i

s.t.

(22)

ei = yi − (wT ϕ(xi ) + b),

i = 1, . . . , N.

(23)

Observe that the cost function is a weighted sum of a regularization term P 2 Jw = 1/2wT w and an error term Je = 1/2 N i=1 ei .

One then solves the constrained optimization problem (22)-(23) by constructP PN 2 T ing the Lagrangian L(w, b, e; α) = wT w + γ 12 N i=1 ei − i=1 αi (w ϕ(xi ) +b + ei − yi ), with Lagrange multipliers αi ∈ R, (i = 1, . . . , N ). The conditions for optimality are given by                   

∂L ∂w ∂L ∂b ∂L ∂e ∂L ∂αi

=0 →w= =0 →

PN

i=1

PN

i=1

αi xi

αi yi = 0

(24)

= 0 → α = γe

= 0 → wT ϕ(xi ) + b + ei − yi ,

i = 1, . . . , N.

After elimination of the variables w and e, one gets the following linear Karush-Kuhn-Tucker (KKT) system of dimension (N + 1) × (N + 1) in the dual space [16,18,20]: 

1T

0









 b  0    =  , 1 Ω + γ −1 I N α y

(25)

with y = [y1 ; ...; yN ], 1 = [1; ...; 1], and α = [α1 ; ...; αN ] ∈ RN and where Mercer’s theorem [14,15,17] is applied within the Ω matrix: Ωij = ϕ(xi )T ϕ(xj ) = K(xi , xj ). The LS-SVM classifier is then obtained as follows: N X

yˆ = sign[wT ϕ(x) + b] = sign[

αi K(x, xi ) + b],

(26)

i=1

P

with latent variable z = N i=1 αi K(x, xi ) + b. The support values αi (i = 1, . . . , N ) in the dual classifier formulation determine the relative weight of each data point xi in the classifier decision (26). 10

5

Bayesian Interpretation and Inference

The LS-SVM classifier formulation allows to estimate the classifier support values α and bias term b from the data D, given the regularization parameter γ and the kernel function K, e.g., an RBF kernel with parameter σ. Together with the set of explanatory ratios/inputs I ⊆ {1, . . . , n}, the kernel function and its parameters define the model structure M. These regularization and kernel parameters and input set need to be estimated from the data as well. This is achieved within the Bayesian evidence framework [12,13,20,21] that applies Bayes’ formula on three levels of inference [20,21]:

Posterior =

Likelihood × Prior Evidence

(27)

(1) The primal and dual model parameters w, b and α, b are inferred on the first level. (2) The regularization parameter γ = ζ/µ is inferred on the second level, where µ and ζ are additional parameters in the probabilistic inference. (3) The parameter of the kernel function, e.g., σ, the (choice of) the kernel function K and the optimal input set are represented in the structural model description M, which is inferred on level 3. A schematic overview of the three levels of inference is depicted in Figure 3, from which the hierarchical approach is observed in which the likelihood of level i is obtained from level i − 1, (i = 2, 3). Given the least squares formulation, the model parameters are multivariate normal distributed allowing for analytic expressions 3 on all levels of inference. In each subsection, Bayes’ formula is explained first, while practical expressions, computations and interpretations are given afterwards. All complex derivations are given in the appendix.

3

Matlab implementations for the dual space expressions are available from http:\\www.esat.kuleuven.ac.be\lssvmlab.

11

Bayesian inference of LS-SVM models Level 1

(w, b)

p(D|w, b, log µ, log ζ, Mσ ) p(w, b|D, log µ, log ζ, Mσ )

p(w, b| log µ, log ζ, Mσ ) Likelihood

Maximize Posterior

Prior Evidence

Level 2

(log µ, log ζ) p(D| log µ, log ζ, Mσ )

=

p(log µ, log ζ|D, Mσ )

p(log µ, log ζ|Mσ ) Likelihood Prior

Maximize Posterior Evidence

Level 3

(σ) p(D|Mσ )

= p(Mσ )

p(Mσ |D) Likelihood Maximize Posterior

Prior Evidence

p(D)

Fig. 3. Different levels of Bayesian inference. The posterior probability of the model parameters w and b is inferred from the data D by applying Bayes formula on the first level for given hyperparameters µ (prior) and ζ (likelihood) and the model structure M. The model parameters are obtained by maximizing the posterior. The evidence on the first level becomes the likelihood on the second level when applying Bayes formula to infer µ and ζ (with γ = ζ/µ) from the given data D. The optimal hyperparameters µmp and ζmp are obtained by maximizing the corresponding posterior on level 2. Model comparison is performed on the third level in order to compare different model structures, e.g., with different candidate input sets and/or different kernel parameters. The likelihood on the third level is equal to the evidence from level 2. Comparing different model structures M, that model structure with the highest posterior probability is selected.

12

5.1 Inference of model parameters (Level 1) 5.1.1 Bayes’ formula Applying Bayes’ formula on level 1, one obtains the posterior probability of the model parameters w and b p(w, b|D, log µ, log ζ, M) p(D|w, b, log µ, log ζ, M)p(w, b| log µ, log ζ, M) = p(D| log µ, log ζ, M) ∝ p(D|w, b, log µ, log ζ, M)p(w, b| log µ, log ζ, M), (28) where the last step is obtained since the evidence p(D| log µ, log ζ, M) is a normalizing constant that does not depend upon w and b. For the prior, no correlation between w and b is assumed: p(w, b| log µ, M) = p(w| log µ, M)p(b|M) ∝ p(w| log µ, M), with a multivariate Gaussian prior on w with zero mean and covariance matrix µ−1 I nϕ and an uninformative, flat prior on b: ¶ nf

µ

µ 2 µ p(w| log µ, M) = exp(− wT w) 2π 2 p(b|M) = constant

(29)

The uniform prior distribution on b can be approximated by a Gaussian distribution with standard deviation σb → ∞. The prior states a belief that without any learning from data, the coefficients are zero with an uncertainty denoted by the variance 1/µ. It is assumed that the data are independently identically distributed for expressing the likelihood Q

p(D|w, b, log ζ, M) ∝ N i=1 p(yi , xi |w, b, log ζ, M) QN ∝ i=1 p(ei |w, b, log ζ, M) ∝

³

ζ 2π

´N 2

exp(− ζ2

PN

2 i=1 ei ),

(30)

where the last step is by assumption. This corresponds to the assumption that the z-score wT ϕ(x) + b is Gaussian distributed around the targets +1 and −1. Given that the prior (29) and likelihood (30) are multivariate normal distributions, the posterior (28) is a multivariate normal distribution 4 in [w; b] with mean [wmp ; bmp ] ∈ Rnϕ +1 and covariance matrix Q ∈ R(nϕ +1)×(nϕ +1) . An 4

The notation [x; y] = [x, y]T is used here.

13

alternative expression for the posterior is obtained by substituting (29) and (30) into (28). These approaches yield p(w, b| log µ, log ζ, M) =

r

det(Q−1 ) (2π)nϕ +1

µ

µ ∝ 2π

¶ nf 2

exp(− 12 [w − wmp ; b − bmp ]Q−1 [w − wmp ; b − bmp ]) Ã

ζ µ exp(− wT w) 2 2π

!N 2

exp(−

N ζX e2 ), 2 i=1 i

(31) (32)

respectively. The evidence is a normalizing constant in (28) independent of w and b such RR R that . . . p(w, b|D, log µ, log ζ, M)dw1 . . . dwnϕ db = 1. Substituting the expressions for the prior (29), likelihood (30) and posterior (32) into (28), one obtains p(D| log µ, log ζ, M) =

p(wmp | log µ, M)p(D|wmp , bmp , log ζ, M) . p(wmp , bmp |D, log µ, log ζ, M)

(33)

5.1.2 Computation and Interpretation The model parameters with maximum posterior probability are obtained by minimizing the negative logarithm of (31) and (32): (wmp , bmp ) = arg min JP,1 (w, b) w,b

= JP,1 (wmp , bmp ) + 12 ([w − wmp ; b − bmp ]Q−1 [w − wmp ; b − bmp ]) = µ2 wT w +

ζ 2

PN

2 i=1 ei ,

(34) (35)

where constants are neglected in the optimization problem. Both expressions yield the same optimization problem and the covariance matrix Q is equal to the inverse of the Hessian H of JP,1 . The Hessian is expressed in terms of the matrix Φ = [ϕ(x1 ), . . . ϕ(xN )]T with regressors, as derived in the appendix. Comparing (35) with (22), one obtains the same optimization problem for γ = ζ/µ up to a constant scaling. The optimal wmp and bmp are computed in the dual space from the linear KKT-system (25) with γ = ζ/µ and the scoring function z = wTmp ϕ(x) + bmp is expressed in terms of the dual parameters α and bias term bmp via (26). Substituting (29), (30) and (32) into (33), one obtains p(D| log µ, log ζ, M) ∝

Ã

µnϕ ζ N det H

14

!1

2

exp(−JP,1 (wmp , bmp )).

(36)

As JP,1 (w, b) = µJw(w) + ζJe(w, b), the evidence can be rewritten as p(D| log µ, log ζ, M) ∝ p(D|wmp , bmp , log ζ, M) p(wmp | log µ, M)(det H)−1/2 .

|

{z

evidence

}

|

{z

likelihood|wmp ,bmp

}|

{z

}

Occam factor

The model evidence consists of the likelihood of the data and an Occam factor that penalizes for too complex models. The Occam factor consists of the regularization term 1/2wTmp wmp and the ratio (µnϕ / det H)1/2 which is a measure for the volume of the posterior probability divided by the volume of the prior probability. Strong contractions of the posterior versus prior space indicates too many free parameters and, hence, overfitting on the training data. The evidence will be maximized on level 2, where also dual space expressions are derived. 5.2 Inference of hyper-parameters (Level 2) 5.2.1 Bayes’ formula The optimal regularization parameters µ and ζ are inferred from the given data D by applying Bayes’ rule on the second level [20,21] p(log µ, log ζ|D, M) =

p(D| log µ, log ζ, M)p(log µ, log ζ) . p(D|M)

(37)

The prior p(log µ, log ζ|M) = p(log µ|M)p(log ζ|M) = constant is taken to be a flat uninformative prior (σlog µ , σlog ζ → ∞). The level 2 likelihood p(D| log µ, log ζ, M) is equal to the level 1 evidence (36). In this way, Bayesian inference implicitly embodies Occam’s razor: on level 2 the evidence of the level 1 is optimized so as to find a trade-off between the model fit and a complexity term to avoid overfitting [12,13]. The level 2 evidence is obtained in a similar way as on level 1 as the likelihood for the maximum a posteriori times the ratio of the volume of the posterior probability and the volume of the prior probability: p(D|M) ≃ p(D| log µmp , log ζmp , M)

σlog µ|D σlog ζ|D , σlog µ σlog ζ

(38)

where one typically approximates the posterior probability by a multivari2 ate normal probability function with diagonal covariance matrix diag([σlog µ|D , 2 2×2 σlog µ|D ]) ∈ R . Neglecting all constants, Bayes’ formula (37) becomes p(log µ, log ζ|D, M) ∝ p(D| log µ, log ζ, M)

(39)

where the expressions for the level 1 evidence are given by (33) and (36). 15

5.2.2 Computation and Interpretation In the primal space, the hyperparameters are obtained by minimizing the negative logarithm of (36) and (39) (µmp , ζmp ) = arg min JP,2 (µ, ζ) = µJw(wmp ) + ζJe(wmp , bmp ) µ,ζ

+ 21 log det H −

nϕ 2

log µ −

N 2

log ζ. (40)

Observe that in order to evaluate (40) one needs also to calculate wmp and bmp for the given µ and ζ and evaluate the level 1 cost function. The determinant of H is equal to (see appendix for details) det(H) = (ζN ) det(µI nϕ + ζΦT M c Φ), with the idempotent centering matrix M c = I N − 1/N 11T = M 2c ∈ RN ×N . The determinant is also equal to the product of the eigenvalues. The ne nonzero eigenvalues λ1 , . . . , λne of ΦT M c Φ are equal to the ne non-zero eigenvalues of M c ΦΦT M c = M c ΩM c ∈ RN ×N , which can be calculated in the dual Q e space. Substituting the determinant det(H) = ζN µnϕ −ne ni=1 (µ + ζλi ) into (40), one obtains the optimization problem in the dual space JD,2 (µ, ζ) = µJw(wmp ) + ζJe(wmp , bmp ) 1 Pne ne i=1 log(µ + ζλi ) − 2 log µ − 2

ne −1 2

log ζ,

(41)

where it can be shown by matrix algebra that µJw(wmp ) + ζJe(wmp , bmp ) = 1 T y M c ( µ1 M c ΩM c + ζ1 I N )−1 M c y. 2 An important concept in neural networks and Bayesian learning in general is the effective number of parameters. Although there are nϕ + 1 free parameters w1 , . . . , wnϕ , b in the primal space, the use of these parameters (35) is restricted by the use of the regularization term 1/2wT w. The effective numP ber of parameters deff is equal to deff = i λi,u /λi,r , where λi,u , λi,r denote the eigenvalues of the Hessian of the unregularized cost function J1,u = ζED and the regularized cost function J1,r = µEW + ζED [4,12]. For LS-SVMs, the effective number of parameters is equal to deff

ne X γλi ζλi =1+ , =1+ i=1 1 + γλi i=1 µ + ζλi ne X

(42)

with γ = ζ/µ ∈ R+ . The term +1 appears because no regularization is applied on the bias term b. As shown in the appendix, one has that ne ≤ N − 1 and, hence, also that deff ≤ N , even in the case of high dimensional feature spaces. The conditions for optimality for (41) are obtained by putting ∂J2 /∂µ = 16

∂J2 /∂ζ = 0. One obtains 5 ∂J2 /∂µ = 0 → 2µmp Jw(wmp ; µmp , ζmp ) = deff (µmp , ζmp ) − 1 ∂J2 /∂ζ = 0 → 2ζmp Je(wmp , bmp ; µmp , ζmp ) = N − deff ,

(43) (44)

where the latter equation corresponds to the unbiased estimate of the noise P 2 variance 1/ζmp = 12 N i=1 ei /(N − deff ).

Instead of solving the optimization problem in µ and ζ, one may also reformulate (41) using (43)-(44) in terms of γ = ζ/µ and solve the following scalar optimization problem min γ

N −1 X i=1

1 log(λi + ) + (N − 1) log(Jw(wmp ) + γJe(wmp , bmp )), γ

(45)

with 1 T y M c V (Λ + I N /γ)−2 V T M c y 2γ 2 1 Jw(wmp ) = y T M c V Λ(Λ + I/γ)−2 V T M c y 2 1 T Jw(wmp ) + γJe(wmp , bmp ) = y M c V (Λ + I N /γ)−1 V T M c y 2 Je(wmp , bmp ) =

(46) (47) (48)

with the eigenvalue decomposition M c ΩM c = V T ΛV . Given the optimal γmp from (45) one finds the effective number of parameters deff from deff = P e γλi /(1 + γλi ). The optimal µmp and ζmp are obtained from µmp = 1 + ni=1 (deff − 1)/(2Jw(wmp )) and ζmp = (N − deff )/(2Je(wmp , bmp )). 5.3 Model Comparison (Level 3) 5.3.1 Bayes’ formula The model structure M of the model determines the remaining parameters of the kernel based model: the selected kernel function (linear, RBF, . . . ), the kernel parameter (RBF kernel parameter σ) and selected explanatory inputs. The model structure is inferred on level 3. Consider, e.g., the inference of the RBF-kernel parameter σ, where the model structure is denoted by Mσ . Bayes’ formula for the inference of Mσ is equal 5

In this derivation, one uses that ∂(JP,1 (wmp , bmp ))/∂µ = δ(JP,1 (wmp , bmp ))/δµ +δ(JP,1 (wmp , bmp ))/δ[w; b]|[wmp ;bmp ] ×δ([wmp ; bmp ])/δµ = Jw(wmp ), since δ(JP,1 (wmp , bmp ))/δ[w; b]|[wmp ;bmp ] = 0 [13,16,31].

17

to p(Mσ |D) ∝ p(D|Mσ )p(Mσ ),

(49)

where no evidence p(D) is used in the expression on level 3 as it is in practice impossible to integrate over all model structures. The prior probability p(Mσ ) is assumed to be constant. The likelihood is equal to the level 2 evidence (38).

5.3.2 Computation and Interpretation Substituting the evidence (38) into (49) and taking into account the constant prior, the Bayes’ rule (38) becomes p(M|D) ≃ p(D| log µmp , log ζmp , M)

σlog µ|D σlog ζ|D . σlog µ σlog ζ

(50)

As uninformative priors are used on level 2, the standard deviations σlog µ and σlog ζ of the prior distribution both tend to infinity and are omitted in the comparisons of different models in (50). The posterior error bars can be 2 2 approximated analytically as σlog µ|D ≃ 2/(deff − 1) and σlog ζ|D ≃ 2/(N − deff ), respectively [13]. The level 3 posterior becomes p(Mσ |D) ≃ p(D| log µmp , log ζmp , Mσ ) v u u ∝t

(deff

σlog µ|D σlog ζ|D σlog µ σlog ζ

e ζ N −1 µnmp mp , Q e (µmp + ζmp λi ) − 1)(N − deff ) ni=1

(51)

where all expressions can be calculated in the dual space. A practical way to infer the kernel parameter σ is to calculate (51) for a grid of possible kernel parameters σ1 , . . . , σm and to compare the corresponding posterior model parameters p(Mσ1 |D), . . . , p(Mσm |D). An additional observation is that the RBF-LS-SVM classifier may not always yield a monotonic relation between the evolution of the ratio (e.g., solvency ratio) and the default risk. This is due to the nonlinearity of the classifier and/or multivariate correlations. In case monotonous relations are important, one may choose to use a combined kernel function K(x1 , x2 ) = κKlin (x1 , x2 ) + (1 − κ)KRBF (x1 , x2 ), where the parameter κ ∈ [0, 1] can be determined on level 3. In this paper, the use of an RBF-kernel is illustrated. Model comparison is also used to infer the set of most relevant inputs [21] out of the given set of candidate explanatory variables by making pairwise comparisons of models with different input sets. In a backward input selection procedure, one starts from the full candidate input set and removes in each input pruning step that input that yields the best model improvement (or smallest decrease) in terms of the model probability (51). The procedure is 18

Table 1 Evidence against H0 (no improvement of Mi over Mj ) for different values of the Bayes factor Bij [39]. 2 ln Bij

Bij

0 to 2

1 to 3

Not worth more than a bare mention

2 to 5

3 to 12

Positive

5 to 10

12 to 150

Strong

> 10

> 150

Evidence against H0

Decisive

stopped when no significant decrease of the model probability is observed. In the case of equal prior model probabilities p(Mi ) = p(Mj ) (∀i, j) the models Mi and Mj are compared according to their Bayes factor Bij =

p(D| log µi , log ζi , Mi ) σlog µi |D σlog ζi |D p(D|Mi ) = . p(D|Mj ) p(D| log µj , log ζj , Mj ) σlog µj |D σlog ζj |D

(52)

According to [39], one uses the values in Table 1 in order to report and interpret the significance of model Mi improving on model Mj .

5.4 Moderated Output of the Classifier

5.4.1 Moderated Output Based on the Bayesian interpretation, an expression is derived for the likelihood p(x|y, w, b, ζ, M) of observing x given the class label y and the parameters w, b, ζ, M. However, the parameters 6 w and b are multivariate normal distributed. Hence, the moderated likelihood is obtained as p(x|y, ζ, M) =

Z

p(x|y, w, b, ζ, M)p(w, b|y, µ, ζ, M)dw1 . . . dwnϕ db.

(53)

This expression will then be used in Bayes’ rule (3).

6

The uncertainty on ζ only has a minor influence in a limited number of directions [13] and is neglected.

19

5.4.2 Computation and Interpretation In the level 1 formulation, it was assumed that the errors e are normally distributed around the targets ±1 with variance ζ −1 , i.e., p(x|y = +1, w, b, ζ, M) = (2π/ζ)−1/2 exp(−1/2ζe2+ ),

p(x|y = −1, w, b, ζ, M) = (2π/ζ)

−1/2

exp(−1/2ζe2− ),

(54) (55)

with e+ = +1 − (wT ϕ(x) + b) and e− = −1 − (wT ϕ(x) + b), respectively. The assumption that the mean z-scores per class are equal to +1 and −1 will be relaxed and for the calculation of the moderated output, it is assumed that the scores z are normally distributed with centers t+ (Class +1) and t− (Class -1). Defining the boolean vectors 1+ = [yi = +1] ∈ RN and 1− = [yi = −1] ∈ RN , with elements 0 and 1 whether the observation i is an element of C− and C+ for 1+ and vice versa for 1− . The centers are estimated as t+ = wT mϕ+ + b and t− = wT mϕ− + b with the feature vector class means P P mϕ,+ = 1/N+ yi =+1 ϕ(xi ) = 1/N+ ΦT 1+ and mϕ,− = 1/N− yi =−1 ϕ(xi ) = 1/N− ΦT 1− . The variances are denoted by 1/ζ+ and 1/ζ− , respectively and represent the uncertainty around the projected class centers t+ and t− . It is typically assumed that ζ+ = ζ− = ζ± . The parameters w and b are estimated from the data with resulting probability density function (31). Due to the uncertainty on w (and b), the errors e+ and e− have expected value 7 eˆ• = wTmp (ϕ(x) − mϕ• ) =

PN

i=1

K(x, xi ) − tˆ•

where tˆ• = wTmp mϕ• is obtained in the dual space as tˆ• = 1/N• αT Ω1• . The expression for the variance is 2 = [ϕ(x) − mϕ• ]T Q11 [ϕ(x) − mϕ• ]. σe•

(56)

The dual formulations for the variance are derived in the appendix based on the singular value decomposition (A.7) of Q11 and is equal to 2 σe• = µ1 K(x, x) −

− µζ (θ(x) −

2 θ(x)T 1• + µN1 •2 1T• Ω1• µN• 1 1T )T M c (µI N + ζM c ΩM c )−1 M c (θ(x) N+ •

−

1 T 1 Ω), N• •

(57)

with • either + or -. The vector θ(x) ∈ RN has elements θi (x) = K(x, xi ). Applying Bayes’ formula, the posterior class probability of the LS-SVM clas7

The • notation is used to denote either + or −, since analogous expressions are obtained for classes C+ and C− , respectively.

20

sifier is obtained p(y|x, D, M) =

p(y)p(x|y, D, M) , P(y = +1)p(x|y = +1, D, M) + P(y = −1)p(x|y = −1, D, M)

where we omitted the hyperparameters µ, ζ, ζ± for notational convenience. Approximate analytic expressions exist for marginalizing over the hyperparameters, but can be neglected in practice as the additional variance is rather small [13]. The moderated likelihood (53) is then equal to 2 2 p(x|y = •1, ζ, M) = (2π/(ζ± + σe• ))−1/2 exp(−1/2ˆ e2• /(ζ±−1 + σe• )).

(58)

Substituting (58) into the Bayesian decision rule (3), one obtains a quadratic 2 2 decision rule as the class variances ζ±−1 + σe− and ζ±−1 + σe+ are not equal. √ 2 2 Assuming that σe+ ≃ σe− and defining σe = σe+ σe− , the Bayesian decision rule becomes yˆ = sign[

N md+ + md− ζ±−1 + σe2 (x) P(y = +1) 1X αi K(x, xi ) − + ]. (59) log µ i=1 2 md+ − md− P(y = −1)

The variance ζ±−1 = on level 2.

PN

2 i=1 e±,i /(N

− deff ) is estimated in the same way as ζmp

The prior probabilities P(y = +1) and P(y = −1) are typically estimated as π ˆ+ = N+ /(N+ + N− ) and π ˆ− = N− /(N+ + N− ), but can also be adjusted to reject a given percentage of applicants or to optimize the total profit taking into account misclassification costs. As (59) depends explicitly on the prior probabilities, it also allows to make point-in-time credit decisions where the default probabilities and recovery rates depend upon the point in the business cycle. Difficult cases having almost equal posterior class probabilities P(y = +1|x, D, M) ≃ P(y = −1|x, D, M) can be decided to not being automatically processed and to being referred to a human expert for further investigation. 5.5 Bayesian Classifier Design Based on the previous theory, the following practical design scheme to design the LS-SVM classifier in the Bayesian framework is suggested: (1) Preprocess the data by completing missing values and handling outliers. Standardize the inputs to zero mean and unit variance. (2) Define models Mi by choosing a candidate input set Ii , a kernel function Ki and kernel parameter, e.g., σi in the RBF kernel case. For all models Mi , with i = 1, . . . , nM (with nM the number of models to be compared), compute the level 3 posterior : 21

(a) Find the optimal hyperparameters µmp and ζmp by solving the scalar optimization problem (45) in γ = ζ/µ related to maximizing the level 2 posterior 8 . With the resulting γmp , compute the effective number of parameters, the hyperparameters µmp and ζmp . (b) Evaluate the level 3 posterior (51) for model comparison. (3) Select the model Mi with maximal evidence. If desired, refine the model tuning parameters Ki , σi , Ii to further optimize the classifier and go back to Step 2; else: go to step 4. (4) Given the optimal M⋆i , calculate α and b from (25), with kernel Ki , parameter σi and input set Ii . Calculate ζ⋆ and select π ˆ+ and π ˆ− to evaluate (59). For illustrative purposes, the design scheme is illustrated for a kernel function with one parameter σ like the RBF-kernel. The design scheme is easily extended to other kernel functions or combinations of kernel functions.

6

Financial Distress Prediction for Mid-Cap Firms in the Benelux

6.1 Data Set Description

The bankruptcy data, obtained from a major Benelux financial institution, were used to build an internal rating system [40] for firms with middle-market capitalization (mid-cap firms) in the Benelux countries (Belgium, The Netherlands, Luxembourg) using linear modelling techniques. Firms in the mid-cap segment are defined as follows: they are not stocklisted, the book value of their total assets exceeds 10 mln euro, and they generate a turnover that is smaller than 0.25 bln euro. Note that more advanced methods like option based valuation models are not applicable since these companies are not listed. Together with small and medium enterprises, mid-cap firms represent a large proportion of the economy in the Benelux. The mid-cap market segment is especially important as it reflects an important business orientation of the bank. + The data set consists of N = 422 observations, n− D = 74 bankrupt and nD = 348 solvent companies. The data on the bankrupt firms were collected from 1991-1997, while the other data were extracted from the period 1997 only (for reasons of data retrieval difficulties). One out of 5 non-bankrupt observations of the 1997 database was used to train the model. Observe that a larger sample of solvent firms could have been selected, but involves training on an even more

8

Observe that this implies in each iteration step maximizing the level 1 posterior in w and b.

22

unbalanced 9 training set. A total number of 40 candidate input variables was selected from financial statement data, using standard liquidity, profitability and solvency measures. As can be seen from Table 2, both ratios as well as trends of ratios are considered. The data were preprocessed as follows. Median imputation was applied to missing values. Outliers outside the interval [m ˆ − 2.5 × s, m ˆ + 2.5 × s] were put equal to the upper limit and lower limit, respectively; where m ˆ is the sample mean and s the sample standard deviation. A similar procedure is, e.g., used in the calculation of the Winsorized mean [41]. The log transformation was applied to size variables.

6.2 Performance Measures The performance of all classifiers will be quantified using both the classification accuracy and the area under the receiver operating characteristic curve (AUROC). The classification accuracy simply measures the percentage of correctly classified (PCC) observations. Two closely related performance measures are the sensitivity which is the percentage of positive observations being classified as positive (PCCp ) and the specificity which is the percentage of negative observations being classified as negative (PCCn ). The receiver operating characteristic curve (ROC) is a 2-dimensional graphical illustration of the sensitivity on the y-axis versus 1-specificity on the x-axis for various values of the classifier threshold [42]. It basically illustrates the behaviour of a classifier without regard to class distribution or misclassification cost. The AUROC then provides a simple figure-of-merit for the performance of the constructed classifier. We will use McNemar’s test to compare the PCC, PCCp and PCCn of different classifiers [43] and the test of De Long, De Long and Clarke-Pearson [44] to compare the AUROCs. The ROC curve is also closely related to the Cumulative Accuracy Profile which is in turn related to the power statistic and Gini-coefficient [45].

6.3 Models with Full Candidate Input Set The Bayesian framework was applied to infer the hyper- and kernel parameters. The kernel parameter σ of the RBF kernel 10 was inferred on level 3 9

In practice, one typically observes that the percentage of defaults in training databases varies from 50% to about 70% or 80% [29]. 10 The use of an RBF-kernel is illustrated here because of its consistently good performance on 20 benchmark data sets [31]. The other kernel functions can be applied in a similar way.

23

Table 2 Benelux data set: description of the 40 candidate inputs. The inputs include various liquidity (L) , solvency (S), profitability (P) and size (V) measures. Trends (Tr) are used to describe the evolution of the ratios (R). The results of backward input selection are presented by reporting the number of remaining inputs in the LDA, LOGIT and LS-SVM model when an input is removed. These ranking numbers are underlined when the corresponding input is used in the model having optimal leave-one-out cross-validation performance. Hence, inputs with low importance have a high number, while the most important input has rank 1. Input Variable Description L: Current ratio (R) L: Current ratio (Tr) L: Quick ratio (R) L: Quick ratio (Tr) L: Numbers of days to customer credit (R) L: Numbers of days to customer credit (Tr) L: Numbers of days of supplier credit (R) L: Numbers of days of supplier credit (Tr) S:Capital and reserves (% TA) S: Capital and reserves (Tr) S: Financial debt payable after one year ( %TA) S: Financial debt payable after one year (Tr) S: Financial debt payable within one year (% TA) S: Financial debt payable within one year (Tr) S: Solvency Ratio (%)(R) S: Solvency Ratio (%)(Tr) P: Turnover (% TA) P: Turnover (Trend) P: Added value (% TA) P: Added value (Tr) V: Total assets (Log) P: Total assets (Tr) P: Current profit/Current loss before taxes (R) P: Current profit/Current loss before taxes (Tr) P: Gross operation margin (%)(R) P: Gross operation margin (%)(Tr) P: Current profit/Current loss (R) P: Current profit/Current loss (Tr) P: Net operation margin (%)(R) P: Net operation margin (%)(Tr) P: Added value/sales (%)(R) P: Added value/sales (%)(Tr) P: Added value/pers. employed (R) P: Added value/pers. employed (Tr) P: Cash-flow/equity (%)(R) P: Cash-flow/equity (%)(Tr) P: Return on equity (%)(R) P: Return on equity (%)(Tr) P: Net return on total assets before taxes and debt charges (%)(R) P: Net return on total assets before taxes and debt charges (%)(Tr)

LDA

LOGIT

LS-SVM

36 34 22 35 29 6 21 25 5 20 37 40 38 39 3 14 2 19 18 24 4 7 28 33 32 15 27 30 31 26 13 10 23 17 16 11 8 9 1 12

1 27 26 30 19 14 21 33 5 18 37 39 38 40 2 16 4 12 28 36 6 11 25 31 3 23 35 34 20 32 17 9 29 10 8 24 7 22 13 15

23 28 24 29 11 19 27 21 2 35 31 8 18 17 1 10 5 32 13 40 3 20 38 30 25 7 36 37 26 15 6 9 39 34 33 14 4 12 16 22

√ by selecting the parameter from the grid n × [0.1, 0.5, 1, 1.2, 1.5, 2, 3, 4, 10]. For each of these bandwidth parameters, the kernel matrix was constructed and its eigenvalue decomposition computed. The optimal hyperparameter γ was determined from the scalar optimization problem (45) and then, µ, ζ, deff and the level 3 cost were calculated. As the number of default data is low, no separate test data set was used. The generalization performance is assessed by means of the leave-one-out cross-validation error, which is a common measure in the bankruptcy prediction literature [22]. In Table 3, we have contrasted the PCC, PCCp , PCCn and AUROC performance of the LS-SVM (26) and the Bayesian LS-SVM decision rule (59) classifier with the performance of the 24

Table 3 Leave-one-out classification performances (percentages) for the LDA, Logit and LSSVM model using the full candidate input set. The corresponding p-values (percentages) are denoted between parentheses. LDA

LOGIT

LS-SVM

LS-SVMBay

PCC

84.83% (0.13%)

85.78 (6.33%)

88.39% (100%)

88.39% (100%)

PCCp

95.98% (0.77%)

93.97% (0.02%)

98.56% (100%)

98.56% (100%)

PCCn

32.43% (0.01%)

47.30% (100%)

40.54% (26.7%)

40.54% (26.7%)

AUROC

79.51% (0.02%)

80.07% (0.36%)

86.58% (43.27%)

86.65% (100%)

linear LDA and Logit classifiers. The numbers between brackets represent the p-values of the tests between each classifier and the classifier scoring best on the particular performance measure. It is easily observed that both the LSSVM and LS-SVMBay classifiers yield very good performances when compared to the LDA and Logit classifiers. The corresponding ROC curves are depicted in the left pane of Figure 4.

6.4 Models with Optimized Input Set

1

1

0.9

0.9

0.8

0.8

0.7

0.7

sensitivity

sensitivity

Given the models with full candidate input set, a backward input selection procedure is applied to infer the most relevant inputs from the data. For the LDA and Logit classifiers, each time the input i was removed for which the coefficient had the highest p-value to test whether the coefficient is significantly different from zero. The procedure was stopped when all coefficients were significantly different from zero at the 1% level. A backward input selection procedure was applied with the LS-SVM model, computing each time the model probability (on level 3) with one of the inputs removed. The input that

0.6

0.5

0.4

0.3

0.6

0.5

0.4

0.3

0.2

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 - specificity

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 - specificity

Fig. 4. Receiver Operating Characteristic curves for the full input set (left) and pruned input set (right): LS-SVM (full line), Logit (dashed-dotted line) and LDA (dashed line).

25

Table 4 Leave-one-out classification performances for the LDA, Logit and LS-SVM model using the optimized input sets. LDA

LOGIT

LS-SVM

LS-SVMBay

PCC

86.49 (3.76)

86.49 (4.46)

89.34 (100)

89.34 (100)

PCCp

98.28% (100%)

97.13% (34.28%)

98.28% (100%)

98.28% (100%)

PCCn

31.08% (1.39%)

36.49% (9.90%)

47.30% (100%)

47.30% (100%)

AUROC

83.32% (0.81%)

83.13% (0.58%)

89.46% (100%)

89.35% (47.38%)

yielded the best decrease (or smallest increase) in the level 3 cost function was then selected. The procedure was stopped just before the difference with the optimal model became decisive according to Table 1. In order to reduce the numbers of inputs as much as possible, but still retain a liquidity ratio in the model, 11 inputs are selected, which is one before the limit of becoming decisively different. The level 3 cost function and the corresponding leave-oneout PCC are depicted in Figure 5 with respect to the number of removed inputs. Notice the similarities between both curves during the input removal process. Table 4 reports the performances of all classifiers using the optimally pruned set of inputs. Again it can be observed that the LS-SVM and LSSVMBay classifiers yield very good performances when compared to the LDA and Logit classifiers. The ROC curves on the optimized input sets are reported in the right pane of Figure 4. The order of input removal is reported in Table 2. It can be seen that the pruned LS-SVM classifier has 11 inputs, the pruned LDA classifier 10 inputs and the pruned Logit classifier 6 inputs. Starting from a total set of 40 inputs, this clearly illustrates the efficiency of the suggested input selection procedure. All classifiers seem to agree on the importance of the turnover variable and the solvency variable. Consistent with prior studies [1,2], the inputs of the LS-SVM classifier consist of a mixture of profitability, solvency and liquidity ratios; but the exact ratios that are selected differ. Also, liquidity ratios seem to be less decisive as compared to prior bankruptcy studies. The number of days to customer credit is the only liquidity ratio that is withheld and only classifies as the 11th input; its trend is the second most important liquidity input in the backward input selection procedure. The three most important inputs for the LS-SVM classifier are the 2 solvency measures (solvency ratio, capital and reserves (percentage of total assets)), the size variable total assets and the profitability measures return on equity and turnover (percentage of total assets). Note that the five most important inputs for the LS-SVM classifier are also present in the optimally pruned LDA classifier. The posterior class probabilities were computed for the evaluation of the decision rule (59) in a leave-one-out procedure, as mentioned above. These probabilities can also be used to identify the most difficult cases, which can be classified in an alternative way requiring e.g. human intervention. Referring 26

1100

0.9

1075

0.875

PCC

−2 log p(M|D)

1050

1025

1000

0.85

975

950

0.825

925

900

0

5

10

15

20

25

Number of inputs removed

30

35

0.8 40

Fig. 5. Evolution of the level 3 cost function − log p(M|D) and the leave-one-out cross-validation classification performance. The dashed line denotes where the model becomes different from the optimal model in a decisive way.

the 10% most difficult cases to further analysis, the following classification performances were obtained on the remaining cases: PCC 93.12%, PCCp 99.69%, PCCn 52.83%. In the case of 25% removal, we obtained PCC 94.64%, PCCp 99.65%, PCCn 52.94%. These results clearly motivate the use of posterior class probabilities to allow the system to detect whether it should remark that its decision is too uncertain and needs further investigation. In order to gain insight in the performance improvements of the different models, the full data sample was used, oversampling the non-defaults 7 times so as to obtain a more realistic sample because 7 years of defaults were combined with 1 year of non-defaults. The corresponding average default/bankruptcy rate is equal to 0.60% or 60 bps (basis points). The graph depicted in Figure 6 reports the remaining default rate on the full portfolio as a function of the percentage of the ordered portfolio. In the ideal case, the curve would be a straight line from (0%, 60bps) to (0.6%, 0bps); a random scoring function that does not succeed in discriminating between weak and strong firms results into a diagonal line. The slope of the curve is a measure for the default rate at that point. Consider, e.g., the case where one decides not to grant credits to the 10% counterparts with the worst scores. The default rates on the full 100% portfolio (with 10% liquidities) are 26 bps (LDA), 27 bps (Logit) and 16 bps (LS-SVM), respectively. Taking into account the fact that the number of counterparts is reduced from 100% to 90%, the default rates on the invested part of the portfolio are obtained by multiplication with 1/0.90 and are equal to 29 bps (LDA), 30 bps (Logit) and 18 bps (LS-SVM), respectively, corresponding to the slope between the points at 10% and 100% (x-axis). From this graph, the better performance of the LS-SVM classifier becomes obvious from a practical perspective. 27

60 LDA Logit LS−SVM 50

Default rate (bps)

40

30

20

10

0

0

10

20

30

40

50

60

70

80

Percentage of counterparts removed (%)

90

100

Fig. 6. Default rates (leave-one-out) on the full portfolio as a function of the percentage of refused counterparts for the LDA (dotted line), Logit (dashed line) and LS-SVM (full line).

7

Conclusions

Prediction of business failure is becoming more and more a key component of risk management for financial institutions nowadays. In this paper, we illustrated and evaluated the added value of Bayesian LS-SVM classifiers in this context. We conducted experiments using a bankruptcy data set on the Benelux mid-cap market. The suggested Bayesian nonlinear kernel based classifiers yield better performances than the more traditional methods, such as logistic regression and linear discriminant analysis, in terms of classification accuracy and area under the receiver operating characteristic curve. The set of relevant explanatory variables was inferred from the data by applying Bayesian model comparison in a backward input-selection procedure. By adopting the Bayesian way of reasoning, one easily obtains posterior class probabilities that can be of high importance to credit managers for analysing the sensitivities of the classifier decisions with respect to the given inputs. 28

References [1] E. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, Journal of Finance 23 (1968) 589–609. [2] E. Altman, Corporate Financial Distress and Bankruptcy: a complete guide to predicting and avoiding distress and profiting from bankruptcy, Wiley finance edition, 1993. [3] W. Beaver, Financial ratios as predictors of failure, Empirical Research in Accounting Selected Studies, supplement to the Journal of Accounting Research 5 (1966) 71–111. [4] C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [5] E. Altman, G. Marco, F. Varetto, Corporate distress diagnosis: comparisons using linear discriminant analysis and neural networks (the Italian experience), Journal of Banking and Finance 18 (1994) 505–529. [6] A. Atiya, Bankruptcy prediction for credit risk using neural networks: A survey and new results, IEEE Transactions on Neural Networks 12 (4) (2001) 929–935. [7] D.-E. Baestaens, W.-M. van den Bergh, D. Wood, Neural Network Solutions for Trading in Financial Markets, Pitman, London, 1994. [8] C. Kun, H. Ingoo, K. Youngsig, Hybrid neural network models for bankruptcy predictions, Decision Support Systems 18 (1996) 63–72. [9] S. Piramuthu, H. Ragavan, M. Shaw, Using feature construction to improve the performance of neural networks, Management Science 44 (3) (1998) 416–430. [10] C. Serrano Cinca, Self organizing neural networks for financial diagnosis, Decision Support Systems 17 (1996) 227–238. [11] B. Wong, T. Bodnovich, Y. Selvi, Neural network applications in business: a review and analysis of the literature (1988-1995), Decision Support Systems 19 (4) (1997) 301–320. [12] D. MacKay, Bayesian interpolation, Neural Computation 4 (1992) 415–447. [13] D. MacKay, Probable networks and plausible predictions - a review of practical bayesian methods for supervised neural networks, Network: Computation in Neural Systems 6 (1995) 469–505. [14] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000. [15] B. Sch¨olkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. [16] J. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific, New Jersey, 2002.

29

[17] V. Vapnik, Statistical Learning Theory, Wiley, New-York, 1998. [18] J. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (3) (1999) 293–300. [19] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation 12 (2000) 2385–2404. [20] T. Van Gestel, J. Suykens, G. Lanckriet, A. Lambrechts, B. De Moor, J. Vandewalle, A Bayesian framework for Least Squares Support Vector Machine Classifiers, Gaussian Processes and kernel Fisher Discriminant Analysis, Neural Computation 14 (2002) 1115–1147. [21] T. Van Gestel, J. Suykens, D.-E. Baestaens, A. Lambrechts, G. Lanckriet, B. Vandaele, B. De Moor, J. Vandewalle, Predicting financial time series using least squares support vector machines within the evidence framework, IEEE Transactions on Neural Networks (Special Issue on Financial Engineering) 12 (2001) 809–821. [22] R. Eisenbeis, Pitfalls in the application of discriminant analysis in business, The Journal of Finance 32 (3) (1977) 875–900. [23] R. Duda, P. Hart, Pattern Classification and Scene Analysis, John Wiley, New York, 1973. [24] B. Ripley, Pattern Classification and Neural Networks, Cambridge University Press, 1996. [25] R. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936) 179–188. [26] P. McCullagh, J. Nelder, Generalized Linear Models, Chapman & Hall, London, 1989. [27] J. Ohlson, Financial ratios and the probabilistic prediction of bankruptcy, Journal of Accounting Research 18 (1980) 109–131. [28] B. Baesens, R. Setiono, C. Mues, J. Vanthienen, Using neural network rule extraction and decision tables for credit-risk evaluation, Management Science 49 (3) (2003) 312–329. [29] B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen, Benchmarking state of the art classification algorithms for credit scoring, Journal of the Operational Research Society 54 (6) (2003) 627–635. [30] B. Baesens, Developing intelligent systems for credit scoring using machine learning techniques, Ph.D. thesis, Department of Applied Economic Sciences, Katholieke Universiteit Leuven (2003). [31] T. Van Gestel, J. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, J. Vandewalle, Benchmarking least squares support vector machine classifiers, Machine Learning 54 (2004) 5–32.

30

[32] V. Vapnik, Statistical learning theory, John Wiley, New-York, U.S., 1998. [33] T. Evgeniou, M. Pontil, , T. Poggio, Regularization networks and support vector machines, Advances in Computational Mathematics 13 (2001) 1–50. [34] J. Hutchinson, A. Lo, T. Poggio, A nonparametric approach to pricing and hedging derivative securities via learning networks, Journal of Finance 49 (1994) 851–889. [35] V. Vapnik, A. Lerner, Pattern recognition using generalized portrait method, Automation and Remote Control 24 (1963) 774–780. [36] V. Vapnik, A. J. Chervonenkis, On the one class of the algorithms of pattern recognition, Automation and Remote Control 25 (6). [37] R. Fletcher, Practical Methods of Optimization, John Wiley & Sons, Chichester and New York, 1987. [38] C. Cortes, V. Vapnik, Support vector networks, Machine Learning 20 (1995) 273–297. [39] H. Jeffreys, Theory of Probability, Oxford University Press, 1961. [40] D.-E. Baestaens, Credit risk modelling strategies: The road to serfdom, International Journal of Intelligent Systems in Accounting, Finance & Management 8 (1999) 225–235. [41] A. Van der Vaart, Asymptotic Statistics, Cambridge University Press, 1998. [42] J. Egan, Signal Detection Theory and ROC analysis. Series in Cognition and Perception, Academic Press, New York, 1975. [43] B. Everitt, The analysis of contingency tables, Chapman and Hall, London, 1977. [44] E. De Long, D. De Long, D. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics 44 (1988) 837–845. [45] J. Soberhart, S. Keenan, R. Stein, Validation methodologies for default risk models, Credit magazine 1 (4) (2000) 51–56.

A

Primal-Dual Formulations for Bayesian Inference

A.1 Expression for the Hessian and Covariance Matrix The level 1 posterior probability p([w; b]|D, µ, ζ, M) is a multivariate normal distribution in Rnϕ with mean [wmp ; bmp ] and covariance matrix Q = H −1 , where H is the Hessian of the least squares cost function (19). Defining the 31

matrix of regressors ΦT = [ϕ(x1 ), . . . , ϕ(xnϕ )], the identity matrix I and the vector with all ones 1 of appropriate dimension; the Hessian is equal to 

H= 







T

 µI nϕ + ζΦ Φ ζΦ 1  , =

H 11 h12  h21 h22

T

ζ1T Φ

ζN

(A.1)

with corresponding block matrices H 11 = µI nϕ +ζΦT Φ, h12 = hT21 = ΦT 1 and h22 = N . The inverse Hessian H −1 is then obtained via a Schur complement type argument: 









−1

I nϕ X   I nϕ −X   H 11 h12   I nϕ 0   I nϕ 0  H −1 =  

0T 1



0T 1





0T





1

hT12 h22



−X T 1





XT 1

−1

−1 T  I nϕ X   H 11 − h12 h22 h12 0   I nϕ 0  =



= 

(H 11 −



0T

h22

T −1 h12 h−1 22 h12 )

T −1 −h−1 22 h12 F 11

h−1 22



XT 1

−1 −F −1 11 h12 h22

+





T −1 −1 h−1 22 h12 F 11 h12 h22

  

(A.2)

(A.3)

−1 T with X = h12 h−1 22 and F 11 = H 11 − h12 h22 h12 . In matrix expressions, it is T T T 1 T useful to express Φ Φ− N Φ 11 Φ as Φ M c Φ with the idempotent centering matrix M c = I N − N1 11T ∈ RN ×N having M c = M 2c . Given that F −1 11 = T −1 −1 (µI nϕ + ζΦ M c Φ) , the inverse Hessian H = Q is equal to



(µI nϕ + ζΦT M c Φ)−1

Q=  − N1 1T Φ(µI nϕ + ζΦT M c Φ)−1

1 ζN

− N1 (µI nϕ + ζΦT M c Φ)−1 ΦT 1 +

1 T 1 Φ(µI n N2

T

T

+ ζΦ M c Φ)−1 Φ 1



 .

A.2 Expression for the determinant The determinant of H is obtained from (A.2) using the fact that the determinant of a product is equal to the product of the determinants and is thus equal to det(H) = det(H 11 − hT12 h−1 22 h12 ) × det(h22 ) T = det(µI nϕ + ζΦ M c Φ) × (ζN ),

(A.4)

which is obtained as the product of ζN and the eigenvalues λi (i = 1, . . . , nϕ ) of µI nϕ + ζΦT M c Φ, noted as λi (µI nϕ + ζΦT M c Φ). Because the matrix ΦT M c Φ ∈ Rnϕ ×nϕ is rank deficient with rank ne ≤ N − 1, nϕ − ne eigenvalues are equal to µ. 32

The dual space expressions can be obtained in terms of the singular value decomposition ·

¸





·  S1 0   V

ΦT M c = U SV T = U 1 U 2 

0 0

1

¸

V2 ,

(A.5)

with U ∈ Rnϕ ×nϕ , S ∈ Rnϕ ×N , V ∈ RN ×N and with the block matrices U 1 ∈ Rnϕ ×ne , U 2 ∈ Rnϕ ×(nϕ −ne ) , S 1 = diag([s1 , s2 , . . . , sne ]) ∈ Rne ×ne , V 1 ∈ RN ×ne and V 2 ∈ RN ×(N −ne ) , with 0 ≤ ne ≤ N − 1. Due to the orthonormality property we have U U T = U 1 U T1 + U 2 U T2 = I nϕ and V V T = V 1 V T1 + V 2 V T2 = I N . Hence, one obtains the primal and dual eigenvalue decompositions ΦT M c Φ = U 1 S 21 U T1 M c ΦΦT M c = M c ΩM c = V 1 S 21 V T1

(A.6) (A.7)

The nϕ eigenvalues of µI nϕ + ζΦT M c Φ are equal to λ1 = µ + ζs21 , . . . , λne = µ + ζs2ne , λne +1 = µ, . . . , λnϕ = µ, where the non-zero eigenvalues s2i (i = 1, . . . , ne ) are obtained from the eigenvalue decomposition of M c ΦΦT M c Q e from (A.7). The expression for the determinant is equal to N ζµN −ne ni=1 (µ + ζλi (M c ΩM c ), with M c ΩM c = V 1 diag([λ1 , . . . , λne ])V T1 and λi = s2i , i = 1, . . . , ne . A.3 Expression for the level 1 cost function The dual space expression for J1 (wmp , bmp ) is obtained by substituting [wmp ; bmp ] = H −1 [ΦT y; 1T y] in (19). Applying a similar reasoning and algebra as for the calculation of the determinant, one obtains the dual space expression: J1 (w, b) = µJw(wmp ) + ζJe(wmp , bmp ) 1 = y T M c (µ−1 M c ΩM c + ζ −1 I N )−1 M c y.. 2

(A.8)

Given that M c ΩM c = V ΛV T , with Λ = diag([s21 , . . . , s2ne , 0, . . . , 0]), one obtains that (48). In a similar way, one obtains (46) and (47). A.4 Expression for the moderated likelihood The primal space expression for the variance in the moderated output is obtained from (56) and is equal to 2 = [ϕ(x) − 1/N• ΦT 1• ]T Q11 [ϕ(x) − 1/N• ΦT 1• ]. σe•

33

(A.9)

Substituting (A.5) into the expression for Q11 from (A.3), one can write Q11 as Q11 = (µI nϕ + ζΦT M c Φ)−1 = (µU 2 U T2 + U 1 (µI ne + ζS 21 )U T1 )−1 = µ−1 U 2 U T2 + U 1 (µI ne + ζS 21 )−1 U T1 ´

³

= µ−1 I nϕ + ΦT M c V 1 S −1 (µI ne + ζS 21 )−1 − µ−1 U T1 1 ³

´

T = µ−1 I nϕ + ΦT M c V 1 S −1 (µI ne + ζS 21 )−1 − µ−1 S −1 1 1 V 1 M cΦ

= 1/µInϕ − ζ/µΦT M c (µI N + ζM c ΩM c )−1 M c Φ.

(A.10)

Substituting (A.9) into (A.10), one obtains (57) given that ΦΦT = Ω, ϕ(xi )T ϕ(xj ) = K(xi , xj ) and Φϕ(x) = θ(x).

34

FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT

Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER SERIES

9

02/159 M. VANHOUCKE, Optimal due date assignment in project scheduling, December 2002, 18 p. 02/160 J. ANNAERT, M.J.K. DE CEUSTER, W. VANHYFTE, The Value of Asset Allocation Advice. Evidence from the Economist’s Quarterly Portfolio Poll, December 2002, 35p. (revised version forthcoming in Journal of Banking and Finance, 2004) 02/161 M. GEUENS, P. DE PELSMACKER, Developing a Short Affect Intensity Scale, December 2002, 20 p. (published in Psychological Reports, 2002). 02/162 P. DE PELSMACKER, M. GEUENS, P. ANCKAERT, Media context and advertising effectiveness: The role of context appreciation and context-ad similarity, December 2002, 23 p. (published in Journal of Advertising, 2002). 03/163 M. GEUENS, D. VANTOMME, G. GOESSAERT, B. WEIJTERS, Assessing the impact of offline URL advertising, January 2003, 20 p. 03/164 D. VAN DEN POEL, B. LARIVIÈRE, Customer Attrition Analysis For Financial Services Using Proportional Hazard Models, January 2003, 39 p. (published in European Journal of Operational Research, 2004) 03/165 P. DE PELSMACKER, L. DRIESEN, G. RAYP, Are fair trade labels good business ? Ethics and coffee buying intentions, January 2003, 20 p. 03/166 D. VANDAELE, P. GEMMEL, Service Level Agreements – Een literatuuroverzicht, Januari 2003, 31 p. (published in Tijdschrift voor Economie en Management, 2003). 03/167 P. VAN KENHOVE, K. DE WULF AND S. STEENHAUT, The relationship between consumers’ unethical behavior and customer loyalty in a retail environment, February 2003, 27 p. (published in Journal of Business Ethics, 2003). 03/168

P. VAN KENHOVE, K. DE WULF, D. VAN DEN POEL, Does attitudinal commitment to stores always lead to behavioural loyalty? The moderating effect of age, February 2003, 20 p.

03/169 E. VERHOFSTADT, E. OMEY, The impact of education on job satisfaction in the first job, March 2003, 16 p. 03/170 S. DOBBELAERE, Ownership, Firm Size and Rent Sharing in a Transition Country, March 2003, 26 p. (forthcoming in Labour Economics, 2004) 03/171 S. DOBBELAERE, Joint Estimation of Price-Cost Margins and Union Bargaining Power for Belgian Manufacturing, March 2003, 29 p. 03/172 M. DUMONT, G. RAYP, P. WILLEMÉ, O. THAS, Correcting Standard Errors in Two-Stage Estimation Procedures with Generated Regressands, April 2003, 12 p. 03/173 L. POZZI, Imperfect information and the excess sensitivity of private consumption to government expenditures, April 2003, 25 p. 03/174 F. HEYLEN, A. SCHOLLAERT, G. EVERAERT, L. POZZI, Inflation and human capital formation: theory and panel data evidence, April 2003, 24 p. 03/175 N.A. DENTCHEV, A. HEENE, Reputation management: Sending the right signal to the right stakeholder, April 2003, 26 p. (published in Journal of Public Affairs, 2004). 03/176 A. WILLEM, M. BUELENS, Making competencies cross business unit boundaries: the interplay between inter-unit coordination, trust and knowledge transferability, April 2003, 37 p. 03/177 K. SCHOORS, K. SONIN, Passive creditors, May 2003, 33 p. 03/178 W. BUCKINX, D. VAN DEN POEL, Customer Base Analysis: Partial Defection of Behaviorally-Loyal Clients in a Non-Contractual FMCG Retail Setting, May 2003, 26 p. (forthcoming in European Journal of Operational Research)

FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT

Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER SERIES

10

03/179 H. OOGHE, T. DE LANGHE, J. CAMERLYNCK, Profile of multiple versus single acquirers and their targets : a research note, June 2003, 15 p. 03/180 M. NEYT, J. ALBRECHT, B. CLARYSSE, V. COCQUYT, The Cost-Effectiveness of Herceptin® in a Standard Cost Model for Breast-Cancer Treatment in a Belgian University Hospital, June 2003, 20 p. 03/181 M. VANHOUCKE, New computational results for the discrete time/cost trade-off problem with time-switch constraints, June 2003, 24 p. 03/182 C. SCHLUTER, D. VAN DE GAER, Mobility as distributional difference, June 2003, 22 p. 03/183 B. MERLEVEDE, Reform Reversals and Output Growth in Transition Economies, June 2003, 35 p. (published in Economics of Transition, 2003) 03/184 G. POELS, Functional Size Measurement of Multi-Layer Object-Oriented Conceptual Models, June 2003, 13 p. (published as ‘Object-oriented information systems’ in Lecture Notes in Computer Science, 2003) 03/185 A. VEREECKE, M. STEVENS, E. PANDELAERE, D. DESCHOOLMEESTER, A classification of programmes and its managerial impact, June 2003, 11 p. (forthcoming in International Journal of Operations and Production Management, 2003) 03/186 S. STEENHAUT, P. VANKENHOVE, Consumers’ Reactions to “Receiving Too Much Change at the Checkout”, July 2003, 28 p. 03/187 H. OOGHE, N. WAEYAERT, Oorzaken van faling en falingspaden: Literatuuroverzicht en conceptueel verklaringsmodel, July 2003, 35 p. 03/188 S. SCHILLER, I. DE BEELDE, Disclosure of improvement activities related to tangible assets, August 2003, 21 p. 03/189 L. BAELE, Volatility Spillover Effects in European Equity Markets, August 2003, 73 p. 03/190 A. SCHOLLAERT, D. VAN DE GAER, Trust, Primary Commodity Dependence and Segregation, August 2003, 18 p 03/191 D. VAN DEN POEL, Predicting Mail-Order Repeat Buying: Which Variables Matter?, August 2003, 25 p. (published in Tijdschrift voor Economie en Management, 2003) 03/192 T. VERBEKE, M. DE CLERCQ, The income-environment relationship: Does a logit model offer an alternative empirical strategy?, September 2003, 32 p. 03/193 S. HERMANNS, H. OOGHE, E. VAN LAERE, C. VAN WYMEERSCH, Het type controleverslag: resultaten van een empirisch onderzoek in België, September 2003, 18 p. 03/194 A. DE VOS, D. BUYENS, R. SCHALK, Psychological Contract Development during Organizational Socialization: Adaptation to Reality and the Role of Reciprocity, September 2003, 42 p. (published in Journal of Organizational Behavior, 2003). 03/195 W. BUCKINX, D. VAN DEN POEL, Predicting Online Purchasing Behavior, September 2003, 43 p. (forthcoming in European Journal of Operational Research, 2004) 03/196 N.A. DENTCHEV, A. HEENE, Toward stakeholder responsibility and stakeholder motivation: Systemic and holistic perspectives on corporate sustainability, September 2003, 37 p. 03/197 D. HEYMAN, M. DELOOF, H. OOGHE, The Debt-Maturity Structure of Small Firms in a Creditor-Oriented Environment, September 2003, 22 p. 03/198 A. HEIRMAN, B. CLARYSSE, V. VAN DEN HAUTE, How and Why Do Firms Differ at Start-Up? A ResourceBased Configurational Perspective, September 2003, 43 p.

FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT

Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER SERIES

11

03/199 M. GENERO, G. POELS, M. PIATTINI, Defining and Validating Metrics for Assessing the Maintainability of EntityRelationship Diagrams, October 2003, 61 p. 03/200 V. DECOENE, W. BRUGGEMAN, Strategic alignment of manufacturing processes in a Balanced Scorecard-based compensation plan: a theory illustration case, October 2003, 22 p. 03/201 W. BUCKINX, E. MOONS, D. VAN DEN POEL, G. WETS, Customer-Adapted Coupon Targeting Using Feature Selection, November 2003, 31 p. (published in Expert Systems with Applications, 2004) 03/202 D. VAN DEN POEL, J. DE SCHAMPHELAERE, G. WETS, Direct and Indirect Effects of Retail Promotions, November 2003, 21 p. (forthcoming in Expert Systems with Applications). 03/203 S. CLAEYS, R. VANDER VENNET, Determinants of bank interest margins in Central and Eastern Europe. Convergence to the West?, November 2003, 28 p. 03/204 M. BRENGMAN, M. GEUENS, The four dimensional impact of color on shoppers’ emotions, December 2003, 15 p. (forthcoming in Advances in Consumer Research, 2004) 03/205 M. BRENGMAN, M. GEUENS, B. WEIJTERS, S.C. SMITH, W.R. SWINYARD, Segmenting Internet shoppers based on their web-usage-related lifestyle: a cross-cultural validation, December 2003, 15 p. (forthcoming in Journal of Business Research, 2004) 03/206 M. GEUENS, D. VANTOMME, M. BRENGMAN, Developing a typology of airport shoppers, December 2003, 13 p. (forthcoming in Tourism Management, 2004) 03/207 J. CHRISTIAENS, C. VANHEE, Capital Assets in Governmental Accounting Reforms, December 2003, 25 p. 03/208 T. VERBEKE, M. DE CLERCQ, Environmental policy uncertainty, policy coordination and relocation decisions, December 2003, 32 p. 03/209 A. DE VOS, D. BUYENS, R. SCHALK, Making Sense of a New Employment Relationship: Psychological ContractRelated Information Seeking and the Role of Work Values and Locus of Control, December 2003, 32 p. 03/210 K. DEWETTINCK, J. SINGH, D. BUYENS, Psychological Empowerment in the Workplace: Reviewing the Empowerment Effects on Critical Work Outcomes, December 2003, 24 p. 03/211 M. DAKHLI, D. DE CLERCQ, Human Capital, Social Capital and Innovation: A Multi-Country Study, November 2003, 32 p. (forthcoming in Entrepreneurship and Regional Development, 2004). 03/212 D. DE CLERCQ, H.J. SAPIENZA, H. CRIJNS, The Internationalization of Small and Medium-Sized Firms: The Role of Organizational Learning Effort and Entrepreneurial Orientation, November 2003, 22 p (forthcoming in Small Business Economics, 2004). 03/213 A. PRINZIE, D. VAN DEN POEL, Investigating Purchasing Patterns for Financial Services using Markov, MTD and MTDg Models, December 2003, 40 p. (forthcoming in European Journal of Operational Research, 2004). 03/214 J.-J. JONKER, N. PIERSMA, D. VAN DEN POEL, Joint Optimization of Customer Segmentation and Marketing Policy to Maximize Long-Term Profitability, December 2003, 20 p. 04/215 D. VERHAEST, E. OMEY, The impact of overeducation and its measurement, January 2004, 26 p. 04/216 D. VERHAEST, E. OMEY, What determines measured overeducation?, January 2004, 31 p. 04/217 L. BAELE, R. VANDER VENNET, A. VAN LANDSCHOOT, Bank Risk Strategies and Cyclical Variation in Bank Stock Returns, January 2004, 47 p. 04/218 B. MERLEVEDE, T. VERBEKE, M. DE CLERCQ, The EKC for SO2: does firm size matter?, January 2004, 25 p.

FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT

Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER SERIES

12

04/219 G. POELS, A. MAES, F. GAILLY, R. PAEMELEIRE, The Pragmatic Quality of Resources-Events-Agents Diagrams: an Experimental Evaluation, January 2004, 23 p. 04/220 J. CHRISTIAENS, Gemeentelijke financiering van het deeltijds kunstonderwijs in Vlaanderen, Februari 2004, 27 p. 04/221 C.BEUSELINCK, M. DELOOF, S. MANIGART, Venture Capital, Private Equity and Earnings Quality, February 2004, 42 p. 04/222 D. DE CLERCQ, H.J. SAPIENZA, When do venture capital firms learn from their portfolio companies?, February 2004, 26 p. 04/223 B. LARIVIERE, D. VAN DEN POEL, Investigating the role of product features in preventing customer churn, by using survival analysis and choice modeling: The case of financial services, February 2004, 24p. 04/224 D. VANTOMME, M. GEUENS, J. DE HOUWER, P. DE PELSMACKER, Implicit Attitudes Toward Green Consumer Behavior, February 2004, 33 p. 04/225 R. I. LUTTENS, D. VAN DE GAER, Lorenz dominance and non-welfaristic redistribution, February 2004, 23 p. 04/226 S. MANIGART, A. LOCKETT, M. MEULEMAN et al., Why Do Venture Capital Companies Syndicate?, February 2004, 33 p. 04/227 A. DE VOS, D. BUYENS, Information seeking about the psychological contract: The impact on newcomers’ evaluations of their employment relationship, February 2004, 28 p. 04/228 B. CLARYSSE, M. WRIGHT, A. LOCKETT, E. VAN DE VELDE, A. VOHORA, Spinning Out New Ventures: A Typology Of Incubation Strategies From European Research Institutions, February 2004, 54 p. 04/229 S. DE MAN, D. VANDAELE, P. GEMMEL, The waiting experience and consumer perception of service quality in outpatient clinics, February 2004, 32 p. 04/230 N. GOBBIN, G. RAYP, Inequality and Growth: Does Time Change Anything?, February 2004, 32 p. 04/231 G. PEERSMAN, L. POZZI, Determinants of consumption smoothing, February 2004, 24 p. 04/232 G. VERSTRAETEN, D. VAN DEN POEL, The Impact of Sample Bias on Consumer Credit Scoring Performance and Profitability, March 2004, 24 p. 04/233 S. ABRAHAO, G. POELS, O. PASTOR, Functional Size Measurement Method for Object-Oriented Conceptual Schemas: Design and Evaluation Issues, March 2004, 43 p. 04/234 S. ABRAHAO, G. POELS, O. PASTOR, Comparative Evaluation of Functional Size Measurement Methods: An Experimental Analysis, March 2004, 45 p. 04/235 G. PEERSMAN, What caused the early millennium slowdown? Evidence based on vector autoregressions, March 2004, 46 p. (forthcoming in Journal of Applied Econometrics, 2005) 04/236 M. NEYT, J. ALBRECHT, Ph. BLONDEEL, C. MORRISON, Comparing the Cost of Delayed and Immediate Autologous Breast Reconstruction in Belgium, March 2004, 18 p. 04/237 D. DEBELS, B. DE REYCK, R. LEUS, M. VANHOUCKE, A Hybrid Scatter Search / Electromagnetism MetaHeuristic for Project Scheduling, March 2004, 22 p. 04/238 A. HEIRMAN, B. CLARYSSE, Do Intangible Assets and Pre-founding R&D Efforts Matter for Innovation Speed in Start-Ups?, March 2004, 36 p.

FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT

Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER SERIES

13

04/239 H. OOGHE, V. COLLEWAERT, Het financieel profiel van Waalse groeiondernemingen op basis van de positioneringsroos, April 2004, 15 p. 04/240 E. OOGHE, E. SCHOKKAERT, D. VAN DE GAER, Equality of opportunity versus equality of opportunity sets, April 2004, 22 p. 04/241 N. MORAY, B. CLARYSSE, Institutional Change and the Resource Flows going to Spin off Projects: The case of IMEC, April 2004, 38 p. 04/242 T. VERBEKE, M. DE CLERCQ, The Environmental Kuznets Curve: some really disturbing Monte Carlo evidence, April 2004, 40 p. 04/243 B. MERLEVEDE, K. SCHOORS, Gradualism versus Big Bang: Evidence from Transition Countries, April 2004, 6 p. 04/244 T. MARCHANT, Rationing : dynamic considerations, equivalent sacrifice and links between the two approaches, May 2004, 19 p. 04/245 N. A. DENTCHEV, To What Extent Is Business And Society Literature Idealistic?, May 2004, 30 p. 04/246 V. DE SCHAMPHELAERE? A. DE VOS, D. BUYENS, The Role of Career-Self-Management in Determining Employees’ Perceptions and Evaluations of their Psychological Contract and their Esteemed Value of Career Activities Offered by the Organization, May 2004, 24 p.

Lihat lebih banyak...

Bayesian kernel based classification for financial distress detection

Descripción

Comentarios