The Similarity between the Square of the Coefficient of Variation and the Gini Index of a General Random Variable = Similitud entre el cuadrado del coeficiente de variación y el índice de Gini en una variable aleatoria general

June 8, 2017 | Autor: Luis Abril | Categoría: Statistics, Coefficient of Variation, Gini Index, Discrete random variable, Graphical Representation, Cumulative distribution function, Lorenz Curve, Cumulative distribution function, Lorenz Curve

Share Embed

Laporkan tautan ini

Descripción

´ REVISTA DE METODOS CUANTITATIVOS PARA LA ECONOM´ IA Y LA EMPRESA (10). P´ aginas 5–18. Diciembre de 2010. ISSN: 1886-516X. D.L: SE-2927-06. URL: http://www.upo.es/RevMetCuant/art40.pdf

The Similarity between the Square of the Coefficient of Variation and the Gini Index of a General Random Variable ´ lez Abril, Luis Gonza Departamento de Econom´ıa Aplicada I Universidad de Sevilla (Espa˜ na) Correo electr´ onico: [email protected]

Velasco Morente, Francisco Departamento de Econom´ıa Aplicada I Universidad de Sevilla (Espa˜ na) Correo electr´ onico: [email protected]

´n Ruiz, Jose ´ Manuel Gavila Departamento de Econom´ıa Aplicada I Universidad de Sevilla (Espa˜ na) Correo electr´ onico: [email protected]

´ nchez-Reyes Ferna ´ ndez, Luis Mar´ıa Sa Departamento de Econom´ıa Aplicada I Universidad de Sevilla (Espa˜ na) Correo electr´ onico: [email protected]

ABSTRACT In this paper, several identities concerning expectation, variance, covariance, cumulative distribution functions, the coefficient of variation, and the Lorenz curve are obtained and they are used in establishing theoretical results. Furthermore, a graphical representation of the variance is proposed which, together with the aforementioned identities, enables the square of the coefficient of variation to be considered as an equality measure in the same way as is the Gini index. A study of the similarities between the theoretical expression of the Gini index and the square of the coefficient of variation is also carried out in this paper. Keywords: concentration measures; cumulative distribution function; Lorenz curve; mean difference. JEL classification: C100; C190. MSC2010: 62-09; 62P20; 91B02.

Art´ıculo recibido el 12 de abril de 2010 y aceptado el 22 de octubre de 2010.

5

Similitud entre el cuadrado del coeficiente de variaci´ on y el ´ındice de Gini en una variable aleatoria general

RESUMEN En este trabajo se obtienen diversas identidades relativas a la espezanza, varianza, covarianza, funci´on de distribuci´on acumulada, coeficiente de variaci´on y curva de Lorenz que se usar´an para obtener resultados te´oricos interesantes. Se construye, adem´as, una representaci´ on gr´afica de la varianza, la cual, utilizando las propiedades obtenidas, nos indica que el cuadrado del coeficiente de variaci´on se puede considerar como una medida de igualdad, de igual forma que se considera al ´ındice de Gini. En este art´ıculo tambi´en se lleva a cabo un estudio de las similitudes entre la expresi´on te´orica del ´ındice de Gini y el cuadrado del coeficiente de variaci´ on. Palabras clave: medidas de concentraci´ on; funci´on de distribuci´on; curva de Lorenz; diferencia media. Clasificaci´ on JEL: C100; C190. MSC2010: 62-09; 62P20; 91B02.

6

1

INTRODUCTION

Powerful tools, which are specifically designed for certain increasingly difficult problems, are currently under development. Nevertheless, it is not always necessary to design new tools, but to give a new interpretation to other known tools. Thus, there are easy relationships between the main characteristics of a random variable which are widely known but remained unused. In this paper, several identities are obtained from a very simple but powerful result. One particular result leads us to study the square of the coefficient of variation and the Gini index. The Gini index or Gini coefficient (Gini 1912) is perhaps one of the main inequality measures in the discipline of Economics and it has been applied in many studies. Furthermore, this index can be used to measure the dispersion of a distribution of income, or consumption, or wealth, or a distribution of any other kind (Xu 2004) since, from the statistical point of view, it is a function of the mean difference. Its attractiveness to many economists is that it has an intuitive geometric interpretation, that is, it can be defined as twice a ratio of two regions defined by the line of perfect equality (45-degree line) and the Lorenz curve in the unit box. Furthermore, it is an important component of the Sen index of poverty intensity (Xu and Osberg 2002). There are two main different approaches for analyzing theoretical results of the Gini index: the one is based on discrete distributions; the other on continuous distributions. Both approaches can be unified (Dorfman 1979), but for some purposes the continuous formulation is more convenient, yielding insights that are not as accessible when the random variable is discrete (Yitzhaki and Schechtman 2005). For this reason, a continuous formulation is considered in this paper. The major drawback when the Gini index is used is that two very different distributions can have the same value of this index and, therefore, it is not possible to declare which distribution is more equitable. This problem has been faced in the literature by means of stochastic dominance (Fishburn 1980) and inverse stochastic dominance (Muliere and Scarsini 1989). It is worth noting that a more general study is carried out in (N´ un ˜ez 2006), where several approaches are presented. In this paper, to avoid this situation, it is proved that the square of the coefficient of variation can be thought of as the ratio of the area that lies between the curve of equality and the Lorenz curve in the same way as can the Gini index and, therefore, it can be used as “the most natural” measure to discriminate between two distributions when their Gini indices are the same. Let us note that the square of the coefficient of variation1 was firstly proposed as a transfer measure in (Shorrocks and Foster 1987) and later in (Davies and Hoy 1994), another possibilities were set up in (Ramos and Sordo 2003). Furthermore, it will also be shown that both coefficients have 1

The main drawback of this coefficient is that it is very sensitive to extreme values (Bartels 1977).

7

a similar definition. Hence, by using the definition of the coefficient of variation, the Gini index can be defined for any random variable with a non-zero expectation and not only for non-negative expectations. The rest of the paper is organized as follows: Section 2 presents a result which forms the basis of later developments since it provides identities on probability theory. Notes on mean difference, independence, covariance, and variance are given in Section 3. In Section 4, two equality measures of a non-negative random variable, the Gini index, and the square of the covariation coefficient, are obtained from the previous identities and a relationship between variance, expectation, the cumulative distribution function and the Lorenz curve is given, which provides us with a graphical interpretation of the variance. The identities are generalized and the Gini index is considered for any random variable. Finally, conclusions are drawn.

2

MAIN RESULT

Let us see a simple but important result: R∞ Theorem 1 Let g(x) be a function such that −∞ |x|r |g(x)| dx < ∞ for r = 0, 1. Hence Z ∞ Z ∞ x g(x)dx = (G∗ (x) + G∗ (−x)) dx (1) −∞

0

where

Z ∗

G (x) =

G∗g(·) (x)

= I(x)

Z

∞

x

g(u) du − I(−x) x

g(u) du,

(2)

−∞

and I(x) = I(0,+∞) (x) is the indicator function of the interval (0, +∞). Proof . It is straightforward by integration by parts.

¥

Its generalization to two variables is an immediate consequence of this result. R∞ R∞ Corollary 2 Let g(x, y) be a function such that −∞ −∞ |x|r |y|s |g(x, y)| dxdy < ∞, for r, s = 0, 1. Hence Z ∞Z ∞ Z ∞Z ∞ xyg(x, y)dxdy = (G∗ (x, y) + G∗ (x, −y) + G∗ (−x, y) + G∗ (−x, −y)) dxdy (3) −∞ −∞

0

where G∗ (x, y) = G∗G∗

g(x,·)

0

(y) (x).

¥

The expression of G∗ (x) is useful to simplify the thesis of Corollary 2; nevertheless R∞ Rx an even simpler expression can be used. If G = −∞ g(u)du and G(x) = −∞ g(u)du are defined, then (1) can be written as: Z ∞ Z ∞ x g(x)dx = (G − G(x) − G(−x)) dx. (4) −∞

0

8

Let g(x) and g(x, y) be the marginal probability density function (pdf) of a random variable X and the joint pdf of a continuous random vector (X, Y ), respectively. Hence, from (2) and (3): ( −FX (x) if x < 0 ∗ G (x) = (5) 1 − FX (x) if x > 0 and

G∗ (x, y) = F (x, y) − I(x)FY (y) − I(y)FX (x) + I(x)I(y),

(6)

where F (x, y) is the a joint cumulative distribution function (cdf) of (X, Y ), and FX (x) and FY (y) are marginal cdfs of X and Y , respectively. Therefore, since E(X) is the expectation of X and σXY is the covariance of (X,Y), the following result can be stated: Lemma 3 Let (X, Y ) be a continuous random vector with σXY < ∞. Hence Z ∞ E(X) = (1 − FX (x) − FX (−x)) dx, 0 Z ∞Z ∞ (1 − FX (x) − FY (y) − FX (−x) − FY (−y) + · · · E(XY ) = 0

0

σXY

(7)

· · · + F (x, y) + F (−x, y) + F (x, −y) + F (−x, −y)) dx dy, Z ∞Z ∞ = (F (x, y) − FX (x)FY (y)) dx dy. −∞

(8) (9)

−∞

Proof . Identity (7) is obtained from identities (1) and (5). Identity (8) is given by identities (3) and (6). Identity (7) implies Z ∞Z ∞ (1 − FX (x) − FY (y) − FX (−x) − FY (−y) + FX (x)FY (y) + · · · E(X)·E(Y ) = 0

0

· · · + FX (−x)FY (y) + FX (x)FY (−y) + FX (−x)FY (−y)) dx dy and, therefore

Z ∞Z ∞ E(XY )−E(X)·E(Y ) = ((F (x, y) − FX (x)FY (y)) + (F (−x, y) − FX (−x)FY (y)) +· · · 0

0

· · · + (F (x, −y) − FX (x)FY (−y)) + (F (−x, −y) − FX (−x)FY (−y))) dx dy and taking into account that: R∞R∞

R∞R0 0 0 (F (x, −y) − FX (x)FY (−y))dydx = 0 −∞ (F (x, y) − FX (x)FY (y))dydx, R∞R∞ R0 R∞ 0 0 (F (−x, y) − FX (−x)FY (y))dxdy = −∞ 0 (F (x, y) − FX (x)FY (y))dxdy, R∞R∞ R0 R0 0 0 (F (−x, −y) − FX (−x)FY (−y))dxdy = −∞ −∞ (F (x, y) − FX (x)FY (y))dxdy, then (9) is obtained.

¥

Let us see, in the next section, how Lemma 3 is useful in establishing theoretical results.

9

3 NOTES ON RANGE, MEAN DIFFERENCE, INDEPENDENCE, AND COVARIANCE OF RANDOM VARIABLES Note 4 In fact, result (7) can easily be generalized as follows: Z ∞ 2r+1 E(X ) = (2r + 1) x2r · (1 − FX (x) − FX (−x)) dx,

∀r = 0, 1, 2, ...

0

and, if X is non-negative, then E(X

r+1

Z

) = (r + 1) 0

∞

xr · (1 − FX (x)) dx,

∀r = 0, 1, 2, ...

That is, the r-th moment about the origin of a non-negative random variable can be obtained from the cdf F (x) directly instead of from the pdf f (x). N Note 5 Let X1 , X2 , . . . , Xn be independent and identically distributed (iid) random variables with the same distribution as X. If the transformations given by Un = max{X1 , X2 , . . . , Xn } and Vn = min {X1 , X2 , . . . , Xn } are considered, then their cdfs are: FUn (u) = F n (u) and FVn (v) = 1 − (1 − F (v))n . R∞ R∞ By using (7), E(Vn ) = 0 (−1 + (1 − F (x))n + (1 − F (−x))n ) dx, and E(Un ) = 0 (1− F n (x) − F n (−x))dx. Hence, Z ∞ E(Un − Vn ) = (1 − F n (x) − (1 − F (x))n )dx. −∞

Furthermore, as a particular case, the mean difference of two iid random variables, ∆ = E(|X1 − X2 |), can be written as: Z ∞ Z ∞ 2 2 ∆ = E(U2 − V2 ) = (1 − F (x) − (1 − F (x)) )dx = 2 F (x)(1 − F (x))dx. −∞

−∞

N Note 6 Usually, the covariance is defined as Cov(X, Y ) = E [(X −E(X))·(Y −E(Y ))] and an interpretation of its meaning with respect to the independence or dependence between X and Y is given a posteriori. From (9), it is possible to give a new introduction to covariance as follows: Given a random vector (X, Y ), the variables X and Y are said to be independent if F (x, y) = FX (x) FY (y), for every x, y ∈ R. Hence, there is dependence between X and Y if any x, y ∈ R exist such that F (x, y) − FX (x) FY (y) 6= 0. Therefore, a first measure of dependence or covariation between two random variables can be considered as: Z ∞Z ∞ (F (x, y) − FX (x) FY (y)) dx dy, −∞

−∞

which is named “covariance” between X and Y , and denoted by Cov(X, Y ). Once the moments of a random vector are defined, then it can be proved that Cov(X, Y ) = E [(X− E(X)) · (Y − E(Y ))]. Thus, covariance is introduced from the concept of independence. N 10

R∞ R∞ Note 7 From (9), V ar(X) = −∞ −∞ (F (x, y) − FX (x)FX (y)) dxdy, where V ar(X) denotes the variance of X, and as F (x, y) = P [X ≤ x, X ≤ y] = P [X ≤ min(x, y)], then the variance can be rewritten as: Z ∞Z ∞ V ar(X) = (FX (min(x, y)) (1 − FX (max(x, y))) dx dy, −∞

−∞

and, therefore, it is straightforward to prove, by taking the properties of the cdf into account, that: 1 2 ∆ =2 2

µZ

∞

¶2 µZ F (x)(1 − F (x)) dx ≤ V ar(X) ≤

−∞

∞

p

¶2 p F (x) 1 − F (x) dx

−∞

which provides us with a lower and an upper bound of the variance.

N

4 GINI INDEX, COEFFICIENT OF VARIATION, AND A GRAPHICAL REPRESENTATION OF THE VARIANCE Let X be a non-negative continuous random variable with cdf F (x), pdf f (x) and finite variance. From Note 4, the expectation of X can be written as E(X) = µ = R∞ R 1 x 0 (1 − F (x))dx. Furthermore, the Lorenz function L(x) = µ 0 t f (t) dt can be considered analogous to a cdf of a non-negative random variable UX , and by considering g(x) = µ1 x f (x) in (1), then: Z 0

Z x g(x) dx =

∞

0

∞

(1 − L(x)) dx ⇒ E(X 2 ) = µE(UX ) ⇒ V ar(X) = µ (E(UX ) − µ) .

However, E(UX ) − µ =

R∞ 0

(F (x) − L(x)) dx. Therefore, Z

∞

(F (x) − L(x)) dx = 0

V ar(X) . E(X)

(10)

It should be pointed out that result (10) provides us with a relationship between some of the most important characteristics of a non-negative random variable: the expectation, the variance, the cumulative distribution function and the Lorenz curve. Moreover, result (10) gives a new interpretation of the variance of a non-negative random variable as the product of µ and the area enclosed by the cdf F (x) and the Lorenz curve L(x), that is, the variance is the product of A (= E(X)) and B in Figure 1. Let us now introduce an equality measure from the area enclosed between the curve given by y = F (x) and y = L(x), that is, area B in Figure 1. From the previous result, ; it follows that area B is equal to E[UX ]−µ. In order to eliminate the E(UX ) = µ+ V ar(X) µ units of the variable and to achieve a relative measure this value is divided by µ, thereby 1 obtaining B µ = E( µ UX − 1). From (10) (let us denote µ by µX ), µ ¶ 1 2 CV (X) = E UX − 1 , (11) µX 11

y

F(x)

1

L(x)

A A=E(X)

B

A · B=Var(X) 2

B/A=CV (X)

x

Figure 1: Graphical representation of the mean, the variance and the square of the coefficient of variation of a non-negative random variable. where CV (X) is the coefficient of the variation of X. Hence, the square of the coefficient of variation has an intuitive geometric interpretation as the ratio of two regions. It is worth noting that the construction in Figure 1 is similar to that of the Gini index. In order to study this similarity, the transformation U = F (X) is considered, Ru and the Lorenz curve can be written as L(u) = µ1 0 F −1 (t)dt, 0 ≤ u ≤ 1, where F −1 is the left inverse of F . Hence, the area enclosed between the curve given by y = u and y = L(u), that is, area B in Figure 2, is an equality measure. In the same way as for L(x), the L(u) function can be considered analogous to a cdf of a non-negative random R1 R1 variable UF (X) , and from (7), E(UF (X) ) = 0 (1 − L(u)) du = 0 (u − L(u)) du + 21 ≤ 1 (note that U = F (X) is a uniform distribution and FU (u) = u, 0 < u < 1). Hence, R1 0 ≤ E(UF (X) ) − 21 = 0 (u − L(u)) du ≤ 12 , and multiplying by 2 in order to normalize this R1 expression, results in 0 ≤ E(2UF (X) − 1) = 2 0 (u − L(u)) du ≤ 1. Furthermore, it is wellR1 known that the Gini index is IG(X) = 2 0 (u − L(u)) du and that E(U ) = E(F (X)) = µF (X) = 21 , and hence a similar expression of the square of the coefficient of variation (11) is given by the Gini index: µ ¶ 1 IG(X) = E U −1 . (12) µF (X) F (X) Hence, the Gini index can be seen as a “normalization” of the square of the coefficient of variation, by using the transformation U = F (X), from (11) and (12). Therefore, the square of the coefficient of variation of X is an equality measure in the same as is the Gini index. Another two similar expressions, which are straightforward to obtain, for IG(X) and CV 2 (X), are given in the following: 12

Figure 2: Graphical representation of the Gini index of a non-negative random variable. In terms of integrals: IG(X) = CV 2 (X) =

Z 1 1 (u − L(u)) du, E(U ) 0 Z 1 1 (u − L(u)) dF −1 (u), E(F −1 (U )) 0

In terms of covariances:

(from (10)).

µ

¶ X F (X) , , (given in (Lerman and Yitzhaki 1984)). µX µF (X) ¶ µ X X 2 , . CV (X) = Cov µX µX IG(X) = Cov

Note 8 It is worth bearing in mind that the square of the coefficient of variation, as an inequality measure of a distribution of income (or consumption, or wealth, or a distribution of any other kind), verifies the four properties which are generally postulated in the economic literature on inequality (for the sake of simplicity let us interpret this coefficient on countries): Anonymity (it does not matter who the high and low earners are); Scale Independence (it does not consider the size of the economy, the way it is measured, or whether it is a rich or poor country on average); Population Independence (it does not matter how large the population of the country is); and Transfer Principle (if an income less than the difference is transferred from a rich person to a poor person, then the resulting distribution is more equal) (Dalton 1920). N Example 4.1 If X ∈ U (a, b) (Uniform distribution), then F (x) = u = x−a b−a , with a ≤ x ≤ −1 b and dF (u) = (b − a)du. Hence, Z 1 Z 1 2 2 (u − L(u)) dF −1 (u) = µCV 2 (X) IG(X) = 2 (u − L(u)) du = b − a b − a 0 0 b+a 1 b−a 2 = CV (X) = · . b−a 3 b+a ¥ 13

The major drawback (when the Gini index is used) is that there are non-negative random variables X and Y such that IG(X) = IG(Y ) and, therefore, it is impossible to quantify which distribution is more equitable. To avoid this situation, and by following the above results, the most natural solution is obtained by calculating the square of the coefficient of variation. Let us see an example: 1 Example 4.2 Let X ∈ U ( 49 , 1). The square of the coefficient of variation is straightfor-

ward to calculate: CV 2 (X) = 1 b−a 3 b+a

1 (b−a)2 3 (b+a)2

= 0.3072, and, from Example 4.1, the Gini index is

8 25 .

IG(X) = = Let us consider the random variable Y with values and probabilities given by {0, 0.5, 1} 8 and {0.2, 0.6, 0.2}, respectively. In this case, the Gini index is IG(Y ) = 25 = IG(X), 2 2 nevertheless, CV (Y ) = 0.4000 is greater than CV (X). Thus, it can be concluded that the distribution of X is more equitable than the distribution of Y . ¥ Another expression with regard to the integrals can be given: Let X1 and X2 be independent and identically distributed (iid) random variables with the same distribution as X, then: Z 1 E |X1 − X2 | (u − L(u)) du = ; 2µ 0 Z 1 E(X1 − X2 )2 . (u − L(u)) dF −1 (u) = 2µ 0 The main advantage of the Gini index over the square of the coefficient of variation is that the Gini index is bounded, that is, 0 ≤ IG(X) ≤ 1 while the square of coefficient of variation has no upper bound, that is, 0 ≤ CV 2 (X). Nevertheless, the Gini index is only defined for non-negative random variables and this condition is not required by the coefficient of variation. In both cases, by the definition of the L(·) function, it is necessary that µ 6= 0. The condition X ≥ 0 leads to a bounded Gini index, but it is also possible to define the Gini index for any X random variable. This is studied in the following section.

5

THE GINI INDEX OF ANY RANDOM VARIABLE

Let X be a continuous random variable with cdf F (x), pdf f (x), µ 6= 0 and finite variance. Rx Clearly, the Lorenz function, L(x) = µ1 −∞ t f (t) dt, cannot be considered as analogous to a cdf of a random variable since L(x) can take negative values. Nevertheless, it is possible to consider g(x) = µ1 x f (x) in (1) and hence, by using (4): Z 2

∞

E(X ) =

Z 2

x f (x)dx = µ −∞

(1 − L(x) − L(−x)) dx. 0

14

∞

y

y

F(x) 1

1

µ>0

B

B L(x)

B=Var(X)/µ

µ 0} = (a, b), with −∞ ≤ a < b ≤ ∞, and R(x) = F (x) − L(x) are considered for any x ∈ (a, b) (see Figure 3), then: 1. If µ > 0, then R(x) > 0, and the maximum is attained in x = µ. 2. If µ < 0, then R(x) < 0, and the minimum is attained in x = µ. Hence, in the same way as for the non-negative random variable X, the square of the coefficient of variation can be considered as an equality measure since: Z Z 1 ∞ 1 1 0≤ (F (x) − L(x))dx = (u − L(u)) dF −1 (u) = CV 2 (X). µ −∞ µ 0 The only difference between the general random variable case with regard to the nonnegative random variable case is that the graphical interpretation of this coefficient as the ratio between two areas is not possible. 15

1

1

B

µ>0 B=IG/2

µ

Lihat lebih banyak...

The Similarity between the Square of the Coefficient of Variation and the Gini Index of a General Random Variable = Similitud entre el cuadrado del coeficiente de variación y el índice de Gini en una variable aleatoria general

Descripción

Comentarios