Bayesian Model Determination in Complex Systems

May 24, 2017 | Autor: A. Mohammadi | Categoría: Statistics, Bayesian, Applied Statistics, MCMC, Mixture Models (Mathematics), Graphical Models, Probabilistic Graphical Models, Bayesian Inference, Model Selection, Bayesian Methods (MCMC), Graphical Models, Probabilistic Graphical Models, Bayesian Inference, Model Selection, Bayesian Methods (MCMC)

Share Embed

Laporkan tautan ini

Descripción

Bayesian Model Determination in Complex Systems

PhD thesis

to obtain the degree of PhD at the University of Groningen on the authority of the Rector Magnificus Prof. E. Sterken and in accordance with the decision by the College of Deans. This thesis will be defended in public on 24 April 2015 at 12.45 hours

by

Abdolreza Mohammadi born on 6 February 1982 in Shiraz, Iran

Supervisor Prof. E. C. Wit

Assessment Committee Prof. N. Friel Prof. N. Petkov Prof. G. Kauermann

ISBN: 978-90-367-7831-2

To my wife and my family

Contents Contents

vii

Chapter 1: Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Bayesian model determination in graphical models . . . . . . . . . . . . .

1

1.3

Trans-dimensional Markov chain Monte Carlo . . . . . . . . . . . . . . . .

2

1.4

Outline of thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . .

3

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Chapter 2: Bayesian Structure Learning in Sparse Gaussian Graphical Models

7

2.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3

Bayesian Gaussian graphical models . . . . . . . . . . . . . . . . . . . . . .

10

2.4

The birth-death MCMC method . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.1

Proposed BDMCMC algorithm . . . . . . . . . . . . . . . . . . . .

13

2.4.2

Step1: Computing the birth and death rates . . . . . . . . . . . . .

14

2.4.3

Step 2: Direct sampler from precision matrix . . . . . . . . . . . . .

16

Statistical performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.5.1

Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.5.2

Application to human gene expression data . . . . . . . . . . . . .

27

2.5.3

Extension to time course data . . . . . . . . . . . . . . . . . . . . .

27

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.5

2.6

Chapter 3: Bayesian Modelling of Dupuytren Disease Using Gaussian Copula Graphical Models 41 3.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Contents

viii 3.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.3

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.3.1

Gaussian graphical models . . . . . . . . . . . . . . . . . . . . . . .

45

3.3.2

Gaussian copula graphical models . . . . . . . . . . . . . . . . . .

46

3.3.3

Bayesian Gaussian copula graphical models . . . . . . . . . . . . .

49

Analysis of Dupuytren disease dataset . . . . . . . . . . . . . . . . . . . . .

55

3.4.1

Inference for Dupuytren disease with risk factors . . . . . . . . . .

57

3.4.2

Severity of Dupuytren disease between pairs of fingers . . . . . . .

58

3.4.3

Fit of model to Dupuytren data . . . . . . . . . . . . . . . . . . . .

61

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.4

3.5

Chapter 4: BDgraph: An R Package for Bayesian Structure Learning in Graphical Models 67 4.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

4.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.3

User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.4

Methodological background . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.4.1

Bayesian Gaussian graphical models . . . . . . . . . . . . . . . . .

71

4.4.2

Gaussian copula graphical models . . . . . . . . . . . . . . . . . .

75

The BDgraph environment . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

4.5.1

Description of the bdgraph function . . . . . . . . . . . . . . . . .

77

4.5.2

Description of the bdgraph.sim function

. . . . . . . . . . . . . .

78

4.5.3

Description of the plotcoda and traceplot functions . . . . . . . .

78

4.5.4

Description of the phat and select functions

. . . . . . . . . . . .

79

4.5.5

Description of the compare and plotroc functions . . . . . . . . .

80

User interface by toy example . . . . . . . . . . . . . . . . . . . . . . . . .

80

4.6.1

Running the BDMCMC algorithm . . . . . . . . . . . . . . . . . .

81

4.6.2

Convergence check . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.6.3

Comparison and goodness-of-fit . . . . . . . . . . . . . . . . . . .

82

Application to real data sets . . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.7.1

Application to Labor force survey data . . . . . . . . . . . . . . .

85

4.7.2

Application to Human gene expression . . . . . . . . . . . . . . .

89

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

4.5

4.6

4.7

4.8

Contents

ix

Chapter 5: Bayesian Modelling of the Exponential Random Graph Models with Non-observed Networks 97 5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3 Exponential families and graphical models . . . . . . . . . . . . . . . . . 99 5.3.1 Exponential random graph models . . . . . . . . . . . . . . . . . 99 5.3.2 Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4 Bayesian hierarchical model for ERGMs with non-observed networks . . 101 5.4.1 Model for prior specification on graph . . . . . . . . . . . . . . . 102 5.4.2 Prior specification on precision matrix . . . . . . . . . . . . . . . 102 5.4.3 MCMC sampling scheme . . . . . . . . . . . . . . . . . . . . . . . 103 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Chapter 6: Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue with Optional Second Service 111 6.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3 The M/G/1 queuing system with optional second service . . . . . . . . . . 113 6.3.1 Some performance measures in this queuing system . . . . . . . . 114 6.4 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.4.1 Bayesian inference for finite Gamma Mixture . . . . . . . . . . . 116 6.4.2 BD-MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.3 Model identification . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.5 Predictive densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6 Estimation of some performance measures via the BD-MCMC output . . 121 6.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.7.1 Simulation study: mixture of Gammas . . . . . . . . . . . . . . . 123 6.7.2 Simulation study: Mixture of truncated Normal and Log-Normal . 125 6.8 Discussion and future directions . . . . . . . . . . . . . . . . . . . . . . . 128 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Summary

133

Samenvatting

135

Chapter 1 Introduction 1.1

Motivation

One of the main challenges in modern science is to model the complex systems. The difficulty of modeling complex systems lies partly in their topology and how they form rather complex networks. For example, in neuroscience we are interested to understand how various regions of the brain interact with one another. In genetics, the cell is a network of chemicals linked by chemical reactions and we are interested to model this complex network. From this perspective, our interest to modeling networks is part of a broader current of research on complex systems. The main contribution of this thesis is to develop Bayesian statistical methods that jointly model the underlying network (or graph) and its structure among variables in the system. Graphical models provide potential tools to model and make statistical inference regarding complex relationships among variables. The key feature of graphical models is the close relationship between their probabilistic properties and the topology of the underlying graphs represents, as it allows an intuitive understanding of complex systems.

1.2

Bayesian model determination in graphical models

In graphical models, nodes represent variables and edges represent pairwise dependencies, with the edge set defining the global conditional independence structure of the distribution. The methodological issues faced, as the dimension grows, include questions of the nature and consistency of prior specification (priors over graph space, and parameters on any single, specified graph). Then the challenging problem is searching over the space of

General introduction

2

graphs to identify subsets of interest under the theoretically implied posterior distributions. This represents a complex model selection problem. The analysis challenge is inherently one of model uncertainty and model selection: we are interested in exploring graph space and identifying graphs that are most appropriate for a given dataset. Therefore, inference on variable dependencies and prediction is then based on parametric inferences within a set of selected graphs. In Bayesian paradigm, we are interested in exploring graph space and identifying graphs that are most appropriate for a given data. In this regard, we need to calculate the posterior distribution of the graph G conditional on data P r(G|data) = P

P r(G)P r(data|G) , G∈G P r(G)P r(data|G)

(1.1)

in which G is a graph space. Computing this posterior distribution is computationally unfeasible, since in the denominator we require the sum over all possible graph space. The graph space supper-exponentially increases according to the dimension of variables. p nodes in a graph mean p(p − 1)/2 possible edge, and hence we have 2p(p−1)/2 different possible graphs corresponding to all combinations of individual edges being in or out of the graph. For example, for the graph with only 10 variables, we have more than 35 × 1012 possible different graphical models. This motivates us to develop effective search algorithms for exploring graphical model uncertainty. In order to be accurate and scalable, the main key is to design search algorithms which are able to quickly move towards high posterior probability region, and also to take advantage of local computation. One solution is the trans-dimensional MCMC methodology (4).

1.3

Trans-dimensional Markov chain Monte Carlo

This subsection is more narrowly focused on Bayesian model selection via Markov chain Monte Carlo (MCMC) methods for what can be called trans-dimensional problems; those where the dynamic variable of the simulation, the unknowns in the Bayesian set-up, does not have fixed dimension. One conceptually elegant method is reversible jump Markov chain Monte Carlo (RJMCMC), which was proposed by (3), also termed trans-dimensional MCMC, in which the model itself is conceived as another unknown and the MCMC algorithm is enlarged to allow ’jumps’ between all possible models. A prior is required over the model space, but with judicious selection of jumps the number of models does not need to

1.4 Outline of thesis contribution

3

be specified in advance and each model does not require separate estimation. These moves then require (reversible) bridges to be built between parameters of models in different dimensions. The posterior probability of a model is then estimated by the proportion of times that the particular model is accepted in the MCMC run. This method has been employed and discussed for model selection in many contexts. Specially, (2) used this approach for model selection in decomposable undirected Gaussian graphical models. More recently, (1, 6, 5) used this method for model selection in more general case, non-decomposable graphical models. An alternative trans-dimensional MCMC approach is the birth-death MCMC (BDMCMC) algorithm, which is based on a continuous time Markov birth-death process. In this method, the time between jumps to a larger dimension (birth) or a smaller one (death) is taken to be a random variable with a specific rate. The choice of birth and death rates determines the birth-death process and is made in such a way that the stationary distribution is precisely the posterior distribution of interest. Contrary to the RJMCMC approach, moves between models are always accepted, which makes the BDMCMC approach extremely efficient. In the context of finite mixture distributions with variable dimension, this method has been used (15).

1.4

Outline of thesis contribution

In this thesis we consider the problem of Bayesian inference in the following statistical models: graphical models (Chapters 2, 3, 4), exponential random graph models (Chapter 5) and queuing systems (Chapter 6). In Chapter 2 we introduce a novel and efficient Bayesian framework for Gaussian graphical model determination. We cover the theory and computational details of the proposed method. We carry out the posterior inference by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on birth-death process. It is easy to implement and computationally feasible for high-dimensional graphs. We show our method outperforms alternative Bayesian approaches in terms of convergence and computing time. Unlike frequentist approaches, it gives a principled and, in practice, sensible approach to structure learning. We apply the method to large-scale real applications from human and mammary gland gene expression studies to show its empirical usefulness. The result of this chapter is published in (13). The method that we propose in Chapter 2 is limited only to the data that follows the Gaussianity assumption. In Chapter 3 we propose a Bayesian approach for graphical

4

General introduction

model determination based on a Gaussian copula approach that can deal with continuous, discrete, or mixed data. We embed a graph selection procedure inside a semi-parametric Gaussian copula. We carry out the posterior inference by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on the birth-death process. We implement our approach to discovering potential risk factors related to Dupuytren disease. The contents of this chapter corresponded to the manuscripts (7, 8) and the paper (11). In Chapter 4 we introduce an R package BDgraph (12) which contains functions to perform Bayesian structure learning in high-dimensional graphical models with either continuous or discrete variables. This package efficiently performs the Bayesian approaches that proposed in Chapters 2 and 3. The core of the BDgraph package efficiently implemented in C++ to maximize computational speed. The contents of this chapter corresponded to the manuscript (14). In Chapter 5 we introduce a comprehensive Bayesian graphical modeling for new features of exponential random graph models (ERGM). The method increases the range and applicability of the ERGM as a potential tool for the statistical inference in network structure learning. In Chapter 6 we introduce a Bayesian framework in an M/G/1 queuing system with an optional second service. The semi-parametric model based on a finite mixture of Gamma distributions is considered to approximate both the general service and re-service times densities in this queuing system. We estimate system parameters, predictive densities and some performance measures related to this queuing system such as stationary system size and waiting time. The result of this chapter is published in (9, 10).

References

5

References [1] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496):1418–1433. [2] Giudici, P. and Green, P. (1999). Decomposable graphical gaussian model determination. Biometrika, 86(4):785–801. [3] Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4):711–732. [4] Green, P. J. (2003). Trans-dimensional markov chain monte carlo. Oxford Statistical Science Series, pages 179–198. [5] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1):119–128. [6] Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in gaussian graphical models with the g-wishart prior. Journal of Computational and Graphical Statistics, 20(1):140–157. [7] Mohammadi, A., Abegaz Yazew, F., van den Heuvel, E., and Wit, E. C. (2015). Bayesian modeling of dupuytren disease using gaussian copula graphical models. Arxiv preprint arXiv:1501.04849v2. [8] Mohammadi, A., Abegaz Yazew, F., and Wit, E. C. (2014). Bayesian copula gaussian graphical modelling. (IWSM’14) Proceedings of the 29th International Workshop on Statistical Modelling, 1:225–230. [9] Mohammadi, A., Salehi-Rad, M., and Wit, E. (2013). Using mixture of gamma distributions for bayesian analysis in an m/g/1 queue with optional second service. Computational Statistics, 28(2):683–700. [10] Mohammadi, A. and Salehi-Rad, M. R. (2012). Bayesian inference and prediction in an m/g/1 with optional second service. Communications in Statistics-Simulation and Computation, 41(3):419–435. [11] Mohammadi, A. and Wit, E. (2014). Contributed discussion on article by finegold and drton. Bayesian Analysis, 9(3):577–579.

6

References

[12] Mohammadi, A. and Wit, E. (2015a). BDgraph: Graph Estimation Based on Birth-Death MCMC Approach. R package version 2.17. [13] Mohammadi, A. and Wit, E. C. (2015b). Bayesian structure learning in sparse gaussian graphical models. Bayesian Analysis, 10(1):109–138. [14] Mohammadi, A. and Wit, E. C. (2015c). Bdgraph: Bayesian structure learning of graphs in r. arXiv preprint arXiv:1501.05108v2. [15] Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics, 28(1):40– 74.

Chapter 2 Bayesian Structure Learning in Sparse Gaussian Graphical Models1 2.1

Abstract

Decoding complex relationships among large numbers of variables with relatively few observations is one of the crucial issues in science. One approach to this problem is Gaussian graphical modeling, which describes conditional independence of variables through the presence or absence of edges in the underlying graph. In this paper, we introduce a novel and efficient Bayesian framework for Gaussian graphical model determination which is a trans-dimensional Markov Chain Monte Carlo (MCMC) approach based on a continuoustime birth-death process. We cover the theory and computational details of the method. It is easy to implement and computationally feasible for high-dimensional graphs. We show our method outperforms alternative Bayesian approaches in terms of convergence, mixing in the graph space and computing time. Unlike frequentist approaches, it gives a principled and, in practice, sensible approach for structure learning. We illustrate the efficiency of the method on a broad range of simulated data. We then apply the method on large-scale real applications from human and mammary gland gene expression studies to show its empirical usefulness. In addition, we implemented the method in the R package BDgraph which is freely available at http://CRAN.R-project.org/package=BDgraph. Key words: Bayesian model selection; Sparse Gaussian graphical models; Non-decomposable graphs; Birth-death process; Markov chain Monte Carlo; G-Wishart. 1

Published as: Mohammadi A. and E. C. Wit (2015). Bayesian Structure Learning in Sparse Gaussian Graphical Models, Bayesian Analysis, 10(1), 109-138.

Bayesian Structure Learning in Graphical Models

8

2.2

Introduction

Statistical inference of complex relationships among large numbers of variables with a relatively small number of observations appears in many circumstances. Biologists want to recover the underlying genomic network between thousands of genes, based on at most a few hundred observations. In market basket analysis analysts try to find relationships between only a small number of purchases of individual customers (17). One approach to these tasks is probabilistic graphical modeling (25), which is based on the conditional independencies between variables. Graphical models offer fundamental tools to describe the underlying conditional correlation structure. They have recently gained in popularity in both statistics and machine learning with the rise of high-dimensional data (22, 13, 31, 49, 15, 38, 52, 47, 48). For the purpose of performing structure learning, Bayesian approaches provide a straightforward tool, explicitly incorporating underlying graph uncertainty. In this paper, we focus on Bayesian structure learning in Gaussian graphical models for both decomposable and non-decomposable cases. Gaussian graphical determination can be viewed as a covariance selection problem (11), where the non-zero entries in the offdiagonal of the precision matrix correspond to the edges in the graph. For a p-dimensional variable there are in total 2p(p−1)/2 possible conditional independence graphs. Even with a moderate number of variables, the model space is astronomical in size. The methodological problem as the dimension grows includes searching over the graph space to identify high posterior regions. High-dimensional regimes, such as genetic networks, have hundreds of nodes, resulting in over 10100 possible graphs. This motivates us to construct an efficient search algorithm which explores the graph space to distinguish important edges from irrelevant ones and detect the underlying graph with high accuracy. One solution is the trans-dimensional MCMC methodology (20). In the trans-dimensional MCMC methodology, the MCMC algorithm explores the model space to identify high posterior probability models and estimate the parameters simultaneously. A special case is the reversible-jump MCMC (RJMCMC) approach, proposed by Green (19). This method constructs an ergodic discrete-time Markov chain whose stationary distribution is taken to be the joint posterior distribution of the model and the parameters. The process transits among models using an acceptance probability, which guarantees convergence to the target posterior distribution. If this probability is high, the process efficiently explores the model space. However, for the high-dimensional regime this is not always efficient. Giudici and Green (18) extended this method for the decomposable Gaussian graphical models. Dobra et al. (13) developed it based on the Cholesky

2.2 Introduction

9

decomposition of the precision matrix. Lenkoski (26), Wang and Li (49), and Cheng et al. (9) developed an RJMCMC algorithm, which combined the exchange algorithm (34) and the double Metropolis-Hastings algorithm (29) to avoid the intractable normalizing constant calculation. An alternative trans-dimensional MCMC methodology is the birth-death MCMC (BDMCMC) approach, which is based on a continuous time Markov process. In this method, the time between jumps to a larger dimension (birth) or a smaller one (death) is taken to be a random variable with a specific rate. The choice of birth and death rates determines the birth-death process, and is made in such a way that the stationary distribution is precisely the posterior distribution of interest. Contrary to the RJMCMC approach, moves between models are always accepted, which makes the BDMCMC approach extremely efficient. In the context of finite mixture distributions with variable dimension this method has been used (45), following earlier proposals by Ripley (39) and Geyer and Møller (16). The main contribution of this paper is to introduce a novel Bayesian framework for Gaussian graphical model determination and design a BDMCMC algorithm to perform both structure learning (graph estimation) and parameter learning (parameters estimation). In our BDMCMC method, we add or remove edges via birth or death events. The birth and death events are modeled as independent Poisson processes. Therefore, the time between two successive birth or death events has an exponential distribution. The birth and death events occur in continuous time and the relative rates at which they occur determine the stationary distribution of the process. The relationships between these rates and the stationary distribution is formalized in Section 2.4 (Theorem 2.1). The outline of this paper is as follows. In Section 2.3, we introduce the notation and preliminary background material such as suitable prior distributions for the graph and precision matrix. In Section 2.4, we propose our Bayesian framework and design our BDMCMC algorithm. In addition, this section contains the specific implementation of our method, including an efficient way for computing the birth and death rates of our algorithm and a direct sampler algorithm from G-Wishart distribution for the precision matrix. In Section 2.5, we show the performance of the proposed method in several comprehensive simulation studies and large-scale real-world examples from human gene expression data and a mouse mammary gland microarray experiment.

Bayesian Structure Learning in Graphical Models

10

2.3

Bayesian Gaussian graphical models

We introduce some notation and the structure of undirected Gaussian graphical models; for a comprehensive introduction see Lauritzen (25). Let G = (V, E) be an undirected graph, where V = {1, 2, ..., p} is the set of nodes and E ⊂ V × V is the set of existing edges. Let W = (i, j) | i, j ∈ V, i < j , and E = W\E denotes the set of non-existing edges. We define a zero mean Gaussian graphical model with respect to the graph G as MG = Np (0, Σ) | K = Σ−1 ∈ PG , where PG denotes the space of p × p positive definite matrices with entries (i, j) equal to zero whenever (i, j) ∈ E. Let x = (x1 , ..., xn ) be an independent and identically distributed sample of size n from model MG . Then, the likelihood is 1 n/2 (2.1) P (x|K, G) ∝ |K| exp − tr(KS) , 2 where S = x′ x. The joint posterior distribution is given as P (G, K|x) ∝ P (x|G, K)P (K|G)P (G).

(2.2)

For the prior distribution of the graph there are many options, of which we propose two. In the absence of any prior beliefs related to the graph structure, one case is a discrete uniform distribution over the graph space G, P (G) =

1 , |G|

for each G ∈ G.

Alternatively, we propose a truncated Poisson distribution on the graph size |E| with parameter γ, γ |E| p(G) ∝ , |E|!

for each G = (V, E) ∈ G.

Other choices of priors for the graph structure involve modelling the joint state of the edges as a multivariate discrete distribution (7) and (43), encouraging sparse graphs (22)

2.4 The birth-death MCMC method

11

or having multiple testing correction properties (42). For the prior distribution of the precision matrix, we use the G-Wishart (40, 28), which is attractive since it represents the conjugate prior for normally distributed data. It places no probability mass on zero entries of the precision matrix. A zero-constrained random matrix K ∈ PG has the G-Wishart distribution WG (b, D), if 1 1 (b−2)/2 P (K|G) = |K| exp − tr(DK) , IG (b, D) 2 where b > 2 is the degree of freedom, D is a symmetric positive definite matrix, and IG (b, D) is the normalizing constant, Z 1 (b−2)/2 IG (b, D) = |K| exp − tr(DK) dK. 2 PG When G is complete the G-Wishart distribution WG (b, D) reduces to the Wishart distribution Wp (b, D), hence, its normalizing constant has an explicit form (33). If G is decomposable, we can explicitly calculate IG (b, D) (40). For non-decomposable graphs, however, IG (b, D) does not have an explicit form; we can numerically approximate IG (b, D) by the Monte Carlo method (3) or Laplace approximation (27). The G-Wishart prior is conjugate to the likelihood (5.2), hence, conditional on graph G and observed data x, the posterior distribution of K is 1 1 (b∗ −2)/2 ∗ |K| exp − tr(D K) , P (K|x, G) = IG (b∗ , D∗ ) 2 where b∗ = b + n and D∗ = D + S, that is, WG (b∗ , D∗ ). Other choices of priors for the precision matrix are considered on a class of shrinkage priors (50) using the graphical lasso approach (47, 48). They place constant priors for the nonzero entries of the precision matrix and no probability mass on zero entries. In the following section, we describe an efficient trans-dimensional MCMC sampler scheme for our joint posterior distribution (2.2).

2.4

The birth-death MCMC method

Here, we determine a continuous time birth-death Markov process particularly for Gaussian graphical model selection. The process explores over the graph space by adding or removing an edge in a birth or death event. The birth and death rates of edges occur in continuous time with the rates determined by the stationary distribution of the process.

Bayesian Structure Learning in Graphical Models

12

Suppose the birth-death process at time t is at state (G, K) in which G = (V, E) with precision matrix K ∈ PG . Let Ω = ∪ G∈G (G, K) where G denotes the set of all possible K∈PG

graphs. We consider the following continuous time birth-death Markov process on Ω: Death: Each edge e ∈ E dies independently of the others as a Poisson process with P a rate δe (K). Thus, the overall death rate is δ(K) = e∈E δe (K). If the death of an edge e = (i, j) ∈ E occurs, then the process jumps to a new state (G−e , K −e ) in which G−e = (V, E \ {e}), and K −e ∈ PG−e . We assume K −e is equal to matrix K except for the entries in positions {(i, j), (j, i), (j, j)}. Note we can distinguish i from j, since by our definition of an edge i < j. Birth: A new edge e ∈ E is born independently of the others as a Poisson process P with a rate βe (K). Thus, the overall birth rate is β(K) = e∈E βe (K). If the birth of an edge e = (i, j) ∈ E occurs, then the process jumps to a new state (G+e , K +e ) in which G+e = (V, E ∪ {e}), and K +e ∈ PG+e . We assume K +e is equal to matrix K except for the entries in positions {(i, j), (j, i), (j, j)}. The birth and death processes are independent Poisson processes. Thus, the time between two successive events is exponentially distributed, with mean 1/(β(K) + δ(K)). Therefore, the probability of a next birth/death event is

P (birth for edge e) =

βe (K) , β(K) + δ(K)

for each e ∈ E,

(2.3)

P (death for edge e) =

δe (K) , β(K) + δ(K)

for each e ∈ E.

(2.4)

The following theorem gives a sufficient condition for which the stationary distribution of our birth-death process is precisely the joint posterior distribution of the graph and precision matrix. Theorem 2.1. The above birth-death process has stationary distribution P (K, G|x), if for each e ∈ W δe (K)P (G, K \ (kij , kjj )|x) = βe (K −e )P (G−e , K −e \ kjj |x).

(2.5)

Proof. Our proof is based on the theory derived by Preston (37, Section 7 and 8). Preston proposed a special birth-death process, in which the birth and death rates are func-

2.4 The birth-death MCMC method

13

tions of the state. The process evolves by two types of jumps: a birth is defined as the appearance of a single individual, whereas a death is the removal of a single individual. This process converges to a unique stationary distribution, if the balance conditions hold (37, Theorem 7.1). We construct our method in such a way that the stationary distribution equals the joint posterior distribution of the graph and the precision matrix. See the appendix for a detailed proof.

2.4.1

Proposed BDMCMC algorithm

Our proposed BDMCMC algorithm is based on a specific choice of birth and death rates that satisfies Theorem 2.1. Suppose we consider the birth and death rates as

βe (K) =

P (G+e , K +e \ (kij , kjj )|x) , for each e ∈ E, P (G, K \ kjj |x)

(2.6)

δe (K) =

P (G−e , K −e \ kjj |x) , P (G, K \ (kij , kjj )|x)

(2.7)

for each e ∈ E.

Based on the above rates, we determine our BDMCMC algorithm as below. 1

Algorithm 2.1. BDMCMC algorithm. Given a graph G = (V, E) with a precision matrix K, iterate the following steps: Step 1. Birth and death process P 1.1. Calculate the birth rates by (5.7) and β(K) = Pe∈E βe (K), 1.2. Calculate the death rates by (5.8) and δ(K) = e∈E δe (K), 1.3. Calculate the waiting time by w(K) = 1/(β(K) + δ(K)), 1.4. Simulate the type of jump (birth or death) by (5.5) and (5.6). Step 2. According to the type of jump, sample from the new precision matrix.

The main computational parts of our BDMCMC algorithm are computing the birth and death rates (steps 1.1 and 1.2) and sampling from the posterior distribution of the precision matrix (step 2). In Section 2.4.2, we illustrate how to calculate the birth and death rates. In Section 4.4.1, we explain a direct sampling algorithm from the G-Wishart distribution for sampling from the posterior distribution of the precision matrix. In our continuous time BDMCMC algorithm we sample in each step of jumping to the new state (e.g. {t1 , t2 , t3 , ...} in Figure 4.2). For inference, we put the weight on each state

Bayesian Structure Learning in Graphical Models

14

to effectively compute the sample mean as a Rao-Blackwellized estimator (6, subsection 2.5); See e.g. (4.12). The weights are equal to the length of the waiting time in each state ( e.g. {w1 , w2 , w3 , ...} in Figure 4.2). Based on these waiting times, we estimate the posterior distribution of the graphs, which are the proportion to the total waiting times of each graph (see Figure 4.2 in the right and Figure 2.3). For more detail about sampling from continuous time Markov processes see Cappé et al. (6, subsection 2.5).

Fig. 2.1 (Left) Continuous time BDMCMC algorithm where {t1 , t2 , t3 , ...} are jumping times and {w1 , w2 , w3 , ...} are waiting times. (Right) Posterior probability estimation of the graphs based on the proportions of their waiting times.

2.4.2

Step1: Computing the birth and death rates

In step 1 of our BDMCMC algorithm, the main task is calculating the birth and death rates (steps 1.1 and 1.2); Other steps are straightforward. Here, we illustrate how to calculate the death rates. The birth rates are calculated in a similar manner, since both birth and death rates (5.7) and (5.8) are the ratio of the conditional posterior densities. For each e = (i, j) ∈ E, the numerator of the death rate is P (G−e , K −e \ kjj |x) =

P (G−e , K −e |x) . P (kjj |K −e \ kjj , G−e , x)

The full conditional posterior for kjj is (see Roverato 40, Lemma 1) ∗ ), kjj − c | K −e \ kjj , G−e , x ∼ W (b∗ , Djj

where c = Kj,V \j (KV \j,V \j )−1 KV \j,j . Following Wang and Li (49) and some simplification,

2.4 The birth-death MCMC method

15

we have −e

P(G , K

−e

∗ ) 0 P(G) I(b∗ , Djj 1 (b∗−2)/2 0 ∗ \kjj |x) = | exp − tr(K D ) , |K P(x) IG−e (b, D) V \j,V \j 2

(2.8)

∗ ) is the normalizing constant of a G-Wishart distribution for p = 1 and where I(b∗ , Djj 0 K = K except for an entry 0 in the positions (i, j) and (j, i), and an entry c in the position (j, j).

For the denominator of the death rate we have P (G, K \ (kij , kjj )|x) =

P (G, K|x) , P ((kij , kjj )|K \ (kij , kjj ), G, x)

in which we need the full conditional distribution of (kij , kjj ). We can obtain the conditional posterior of (kii , kij , kjj ) and by further conditioning on kii and using the proposition in the appendix, we can evaluate the full conditional distribution of (kij , kjj ). Following Wang and Li (49) and some simplification, we have ∗ ,K) 1 1 P (G) J(b∗,Dee (b∗ −2)/2 1 ∗ |KV \e,V \e | exp − tr(K D ) , (2.9) P(K \(kij , kjj ), G|x) = P(x) IG (b, D) 2 where ∗ J(b∗ , Dee , K) = (

2π 1 ∗ ∗ ) 2 I(b , Djj )(kii − kii1 ) ∗ Djj

b∗−2 2

(

) ∗2 D 1 ij exp − (Dii∗ − ∗ )(kii − kii1 ) , 2 Djj

and K 1 = K except for the entries Ke,V \e (KV \e,V \e )−1 KV \e,e in the positions corresponding to e = (i, j). By plugging (2.8) and (2.9) into the death rates (5.8), we have δe (K) =

P (G−e ) IG (b, D) H(K, D∗ , e), P (G) IG−e (b, D)

(2.10)

in which ∗ Djj 1 H(K, D , e) = ( )2 1 2π(kii − kii )  " #  1  ∗2 D ij ∗ 0 1 ∗ 1 × exp − tr(D (K −K )) − (Dii − ∗ )(kii −kii ) .  2  Djj ∗

(2.11)

For computing the above death rates, we require the prior normalizing constants which is the main computational part. Calculation time for the remaining elements is extremely

Bayesian Structure Learning in Graphical Models

16 fast.

Coping with evaluation of prior normalizing constants Murray et al. (34) proved that the exchange algorithm based on exact sampling is a powerful tool for general MCMC algorithms in which their likelihoods have additional parameterdependent normalization terms, such as the posterior over parameters of an undirected graphical model. Wang and Li (49) and Lenkoski (26) illustrate how to use the concept behind the exchange algorithm to circumvent intractable normalizing constants as in (3.7). With the existence of a direct sampler of G-Wishart, Lenkoski (26) used a modification of the exchange algorithm to cope with the ratio of prior normalizing constants. Suppose that (G, K) is the current state of our algorithm and we would like to calcu˜ according to WG (b, D) via an exact sampler, late the death rates (3.7), first we sample K Algorithm 3.2 below. Then, we replace the death rates with

δe (K) =

P (G−e ) H(K, D∗ , e) , ˜ D, e) P (G) H(K,

(2.12)

in which the intractable prior normalizing constants have been replaced by an evaluation ˜ For theoretical justifications of this of H (given in (2.11)) in the prior, evaluated at K; procedure, see Murray et al. (34) and Liang (29).

2.4.3

Step 2: Direct sampler from precision matrix

Lenkoski (26) developed an exact sampler method for the precision matrix, which borrows ideas from Hastie et al. (21). The algorithm is as follows. Throughout, we use the direct sampler algorithm for sampling from the precision matrix K.

2.5

Statistical performance

Here we present the results for three comprehensive simulation studies and two applications to real data sets. In Section 2.5.1, we show that our method outperforms alternative Bayesian approaches in terms of convergence, mixing in the graph space and computing time; Moreover, the model selection properties compare favorably with frequentist alternatives. In Section 2.5.2, we illustrate our method on a large-scale real data set related to

2.5 Statistical performance

1

17

Algorithm 2.2. Direct sampler from precision matrix Lenkoski (26). Given a graph G = (V, E) with precision matrix K and Σ = K −1 : Step 1. Set Ω = Σ. Step 2. Repeat for i = 1, ..., p, until convergence: 2.1 Let Ni ⊂ V be the set of neighbors of node i in graph G. Form ΩNi and ΣNi ,i and solve βˆi∗ = Ω−1 Ni ΣNi ,i , 2.2 Form βˆi ∈ Rp−1 by padding the elements of βˆi∗ to the appropriate locations and zeroes in those locations not connected to i in graph G, 2.3 Update Ωi,−i and Ω−i,i with Ω−i,−i βˆi . Step 3. Return K = Ω−1 .

the human gene expression data. In Section 2.5.3, we demonstrate the extension of our method to graphical models which involve time series data. It shows how graphs can be useful in modeling real-world problems such as gene expression time course data. We performed all computations with the R package BDgraph, (32).

2.5.1

Simulation study

Graph with 6 nodes We illustrate the performance of our methodology and compare with two alternative Bayesian methods on a concrete small toy simulation example which comes from Wang and Li (49). We consider a data generating mechanism with p = 6 within MG = N6 (0, Σ) | K = Σ−1 ∈ PG , in which the precision matrix is   1 0.5 0 0 0 0.4    1 0.5 0 0 0     1 0.5 0 0   K= .   1 0.5 0     1 0.5   1 Just like Wang and Li (49) we let S = nK −1 where n = 18, which represents 18 samples from the true model MG . As a non-informative prior, we take a uniform distribution for

Bayesian Structure Learning in Graphical Models

18

the graph and a G-Wishart WG (3, I6 ) for the precision matrix. To evaluate the performance of our BDMCMC algorithm, we run our BDMCMC algorithm with 60, 000 iterations and 30, 000 as a burn-in. All the computations for this example were carried out on an Intel(R) Core(TM) i5 CPU 2.67GHz processor. We calculate the posterior pairwise edge inclusion probabilities based on the RaoBlackwellization (6, subsection 2.5) as PN pˆe =

t=1

I(e ∈ G(t) )w(K (t) ) , PN (t) ) w(K t=1

for each e ∈ W,

(2.13)

where N is the number of iterations, I(e ∈ G(t) ) is an indicator function, such that I(e ∈ G(t) ) = 1 if e ∈ G(t) and zero otherwise, and w(K (t) ) is the waiting time in the graph G(t) with the precision matrix K (t) ; see Figure 4.2. The posterior pairwise edge inclusion probabilities for all the edges e = (i, j) ∈ W are   0 0.98 0.05 0.02 0.03 0.92     0 0.99 0.04 0.01 0.04     0 0.99 0.04 0.02   pˆe =  .   0 0.99 0.06     0 0.98   0 The posterior mean of the precision matrix is   1.16 0.58 −0.01 0.00 −0.01 0.44    1.18 0.58 −0.01 0.00 −0.01    1.18 0.58 −0.01 0.00    ˆ K= .  1.18 0.58 −0.01    1.17 0.57    1.16 We compare the performance of our BDMCMC algorithm with two recently proposed trans-dimensional MCMC approaches. One is the algorithm proposed by Lenkoski (26) which we call “Lenkoski”. The other is an algorithm proposed by Wang and Li (49) which we call “WL”. The R code for the WL approach is available at http://r-forge.r-project.org/ projects/bgraph/. Compared to other Bayesian approaches, our BDMCMC algorithm is highly efficient due to its fast convergence speed. One useful test of convergence is given by the plot of

2.5 Statistical performance

19

0

10000

20000

30000

40000

50000

60000

1.0 0.0

0.2

0.4

0.6

0.8

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

the cumulative occupancy fraction for all possible edges, shown in Figure 2.2. Figure 2.2a shows that our BDMCMC algorithm converges after approximately 10, 000 iterations. Figure 2.2b and 2.2c show that the Lenkoski algorithm converges after approximately 30, 000, whereas the WL algorithm still does not converge after 60, 000 iterations.

0

(a) BDMCMC

10000

20000

30000

40000

50000

60000

0

10000

20000

(b) Lenkoski

30000

40000

50000

60000

(c) WL

Fig. 2.2 Plot of the cumulative occupancy fractions of all possible edges to check convergence in simulation example 2.5.1; BDMCMC algorithm in 2.2a, Lenkoski algorithm(26) in 2.2b, and WL algorithm in Wang and Li (49) 2.2c.

0.7 0.4

0.5

0.6

0

100

200

300

(a) BDMCMC

400

0.1

0.2

True graph

0.0

0.0

0.1

0.2

0.3

0.5 0.4

True graph

0.3

0.5 0.0

0.1

0.2

0.3

0.4

0.6

True graph

0.6

0.7

0.7

Figure 2.3 reports the estimated posterior distribution of the graphs for BDMCMC, WL, and Linkoski algorithm, respectively. Figure 2.3a indicates that our algorithm visited around 450 different graphs and the estimated posterior distribution of the true graph is 0.66, which is the graph with the highest posterior probability. Figure 2.3b shows that the Lenkoski algorithm visited around 400 different graphs and the estimated posterior distribution of the true graph is 0.40. Figure 2.3c shows that the WL algorithm visited only 23 different graphs and the estimated posterior distribution of the true graph is 0.35.

0

100

200

(b) Lenkoski

300

400

5

10

15

20

(c) WL

Fig. 2.3 Plot of the estimated posterior probability of graphs in simulation example 2.5.1; BDMCMC algorithm in 2.3a, Lenkoski algorithm (26) in 2.3b, and WL algorithm (49) in 2.3c.

Bayesian Structure Learning in Graphical Models

20

To assess the performance of the graph structure, we compute the posterior probability of the true graph, and the calibration error (CE) measure, defined as follows CE =

X

(2.14)

|ˆ pe − I(e ∈ Gtrue )|,

e∈W

where, for each e ∈ W, pˆe is the posterior pairwise edge inclusion probability in (4.12) and Gtrue is the true graph. The CE is positive with a minimum at 0 and smaller is better. Table 2.1 reports comparisons of our method with two other Bayesian approaches (WL and Lenkoski), reporting the mean values and standard errors in parentheses. We repeat the entire simulation 50 times. The first and second columns show the performance of the algorithms. Our algorithm performs better due to its faster convergence feature. The third column shows the acceptance probability (α) which is the probability of moving to a new graphical model. The fourth column shows that our algorithm is slower than the Lenkoski algorithm and faster than the WL approach. It can be argued that to make a fair comparison our method takes 60, 000 samples in 25 minutes while e.g. Lenkoski algorithm in 14 minutes takes only 60, 000×0.058 = 3, 480 efficient samples. For fair comparison we performed all simulations in R. However, our package BDgraph efficiently implements the algorithm with C++ code linked to R. For 60, 000 iterations, our C++ code takes only 17 seconds instead of 25 minutes in R, which means around 90 times faster than the R code. It makes our algorithm computationally feasible for high-dimensional graphs.

BDMCMC Lenkoski WL

P(true G | data)

CE

α

CPU time (min)

0.66 (0.00) 0.36 (0.02) 0.33 (0.12)

0.47 (0.01) 1.17 (0.08) 1.25 (0.46)

1 0.058 (0.001) 0.003 (0.0003)

25 (0.14) 14 (0.13) 37 (0.64)

Table 2.1 Summary of performance measures in simulation example 2.5.1 for BDMCMC approach, Lenkoski (26), and WL (49). The table presents the average posterior probability of the true graph, the average calibration error (CE) which is defined in (2.14), the average acceptance probability (α), and the average computing time in minutes, with 50 replications and standard deviations in parentheses.

Extensive comparison with Bayesian methods We perform here a comprehensive simulation with respect to different graph structures to evaluate the performance of our Bayesian method and compare it with two recently proposed trans-dimensional MCMC algorithms; WL (49) and Lenkoski (26). Corresponding to different sparsity patterns, we consider 7 different kinds of synthetic graphical models:

2.5 Statistical performance

21

1. Circle: A graph with kii = 1, ki,i−1 = ki−1,i = 0.5, and k1p = kp1 = 0.4, and kij = 0 otherwise. 2. Star: A graph in which every node is connected to the first node, with kii = 1, k1i = ki1 = 0.1, and kij = 0 otherwise. 3. AR(1): A graph with σij = 0.7|i−j| . 4. AR(2): A graph with kii = 1, ki,i−1 = ki−1,i = 0.5, and ki,i−2 = ki−2,i = 0.25, and kij = 0 otherwise. 5. Random: A graph in which the edge set E is randomly generated from independent Bernoulli distributions with probability 2/(p − 1) and the corresponding precision matrix is generated from K ∼ WG (3, Ip ). n o 6. Cluster: A graph in which the number of clusters is max 2, p/20 . Each cluster has the same structure as a random graph. The corresponding precision matrix is generated from K ∼ WG (3, Ip ). 7. Scale-free: A graph which is generated by using the B-A algorithm (2). The resulting graph has p − 1 edges. The corresponding precision matrix is generated from K ∼ WG (3, Ip ). For each graphical model, we consider four different scenarios: (1) dimension p = 10 and sample size n = 30, (2) p = 10 and n = 100, (3) p = 50 and n = 100, (4) p = 50 and n = 500. For each generated sample, we fit our Bayesian method and two other Bayesian approaches (WL and Lenkoski) with a uniform prior for the graph and the G-Wishart prior WG (3, Ip ) for the precision matrix. We run those three algorithms with the same starting points with 60, 000 iterations and 30, 000 as a burn in. Computation for this example was performed in parallel on 235 batch nodes with 12 cores and 24 GB of memory, running Linux. To assess the performance of the graph structure, we compute the calibration error (CE) measure defined in (2.14) and the F1 -score measure (4, 36) which is defined as follows F1 -score =

2TP , 2TP + FP + FN

(2.15)

where TP, FP, and FN are the number of true positives, false positives, and false negatives, respectively. The F1 -score lies between 0 and 1, where 1 stands for perfect identification and 0 for bad identification. Table 2.2 reports comparisons of our method with two other Bayesian approaches, where we repeat the experiments 50 times and report the average F1 -score and CE with their standard errors in parentheses. Our method performs well overall as its F1 -score and

22

Bayesian Structure Learning in Graphical Models

its CE are the best in most of the cases, mainly because of its fast convergence rate. Both our method and the Lenkoski approach perform better compared to the WL approach. The main reason is that the WL approach uses a double Metropolis-Hastings (based on a block Gibbs sampler), which is an approximation of the exchange algorithm. On the other hand, both our method and the Lenkoski approach use the exchange algorithm based on exact sampling from the precision matrix. As we expected, the Lenkoski approach converges slower compared to our method. The reason seems to be the dependence of the Lenkoski approach on the choice of the tuning parameter, σg2 (26, step 3 in algorithm p. 124). In our simulation, we found that the convergence rate (as well acceptance probability) of the Lenkoski algorithm depends on the choice of σg2 . Here we choose σg2 = 0.1 as a default. From a theoretical point of view, both our BDMCMC and the Lenkoski algorithms converge to the true posterior distribution, if we run them a sufficient amount of time. Thus, the results from this table just indicate how quickly the algorithms converge. Table 2.3 reports the average running time and acceptance probability (α) with their standard errors in parentheses across all 7 graphs with their 50 replications. It shows that our method compared to the Lenkoski approach is slower. The reason is that our method scans through all possible edges for calculating the birth/death rates, which is computationally expensive. On the other hand, in the Lenkoski algorithm, a new graph is selected by randomly choosing one edge which is computationally fast but not efficient. The table shows that the acceptance probability (α) for both WL and Lenkoski is small especially for the WL approach. Note the α here is the probability that the algorithm moves to a new graphical model and it is not related to the double Metropolis-Hastings algorithm. The α in the WL approach is extremely small and it should be the cause of the approximation which has been used for the ratio of prior normalizing constants. As Murray et al. (34) pointed out these kinds of algorithms can suffer high rejection rates. For the Lenkoski approach the α is relatively small, but compared with the WL method is much better. As in the Lenkoski approach, a new graph is proposed by randomly choosing one edge, yielding a relatively small acceptance probability. Comparison with frequentist methods We also compare the performance of our Bayesian method with two popular frequentist methods, the graphical lasso (glasso) (15) and Meinshausen-Buhlmann graph estimation (mb) (31). We consider the same 7 graphical models with the same scenarios in the previous example. For each generated sample, we fit our Bayesian method with a uniform prior for the

2.5 Statistical performance

23 F1 -score

BDMCMC Lenkoski

CE WL

BDMCMC Lenkoski

WL

0.93 (0.01) 0.17 (0.02) 0.70 (0.01) 0.59 (0.02) 0.50 (0.01) 0.49 (0.01) 0.45 (0.02)

0.24 (0.01) 0.16 (0.01) 0.34 (0.02) 0.36 (0.01) 0.34 (0.01) 0.33 (0.01) 0.31 (0.02)

2.5 (1.6) 11.3 (2.1) 4.4 (2.2) 11.5 (3.5) 11.4 (8.0) 10.3 (9.4) 11.8 (8.8)

4.9 (2.1) 14 (1.4) 9.7 (2.4) 12.8 (3.1) 15.3 (6.3) 14.3 (7.3) 15.6 (6.9)

15.8 (5) 13.6 (3.4) 12.5 (8.8) 16.3 (6.2) 14.1 (14.0) 13.5 (9.8) 13.3 (6.5)

0.98 (0.00) 0.18 (0.02) 0.95 (0.00) 0.90 (0.01) 0.65 (0.02) 0.67 (0.02) 0.56 (0.02)

0.26 (0.01) 0.25 (0.02) 0.34 (0.01) 0.47 (0.01) 0.35 (0.01) 0.37 (0.02) 0.33 (0.02)

1.0 (0.4) 9.3 (1.6) 1.5 (0.4) 4.1 (3.7) 7.0 (5.6) 6.4 (7.2) 7.9 (8.0)

2.2 (0.5) 11.4 (1.3) 5.2 (0.5) 5.6 (2.7) 10.7 (6.3) 9.9 (7.8) 11.6 (7.0)

15.6 (6.7) 11.4 (3.4) 13.0 (5.7) 14.0 (7.5) 13.8 (10.5) 12.4 (9.4) 13.0 (7.8)

0.55 (0.10) 0.09 (0.04) 0.33 (0.09) 0.49 (0.17) 0.21 (0.07) 0.18 (0.07) 0.19 (0.07)

0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)

2.5(0.9) 68.8 (4.4) 19.0 (4.5) 28.6 (5.7) 73.2 (18.5) 72.8 (18.2) 72.4 (22.5)

75.3 (7.2) 166.7 (4.5) 159.1 (5.4) 117.5 (5.4) 250.6 (36.7) 246.0 (44.4) 243.1 (47.8)

50 (0.0) 49 (0.0) 49 (0.0) 97 (0.0) 49.2 (5.6) 47.8 (8.4) 49 (0.0)

0.72 (0.09) 0.35 (0.05) 0.54 (0.07) 0.78 (0.11) 0.34 (0.10) 0.32 (0.13) 0.33 (0.08)

0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)

1.7 (0.6) 31.7 (4.5) 7.2 (1.9) 4.8 (1.8) 34.3 (11.2) 32.2 (10.6) 35.3 (13.6)

55.8 (5.4) 92.3 (3.6) 84.9 (4.0) 61.7 (4.4) 149.3 (28.8) 142.2 (27.3) 151.7 (26.9)

50 (0.0) 49 (0.0) 49 (0.0) 97 (0.0) 50.7 (7.0) 48.5 (5.9) 49 (0.0)

p=10 & n=30 circle star AR1 AR2 random cluster scale-free

0.95 (0.00) 0.15 (0.02) 0.90 (0.01) 0.56 (0.01) 0.57 (0.03) 0.61 (0.02) 0.53 (0.03)

p=10 & n=100 circle star AR1 AR2 random cluster scale-free

0.99 (0.00) 0.21 (0.02) 0.98 (0.00) 0.89 (0.01) 0.76 (0.01) 0.74 (0.02) 0.69 (0.02)

p=50 & n=100 circle star AR1 AR2 random cluster scale-free

0.99 (0.01) 0.17 (0.04) 0.86 (0.04) 0.86 (0.04) 0.51 (0.09) 0.55 (0.11) 0.49 (0.11)

p=50 & n=500 circle star AR1 AR2 random cluster scale-free

1.00 (0.01) 0.65 (0.05) 0.94 (0.02) 0.98 (0.01) 0.73 (0.09) 0.74 (0.09) 0.73 (0.10)

Table 2.2 Summary of performance measures in simulation example 2.5.1 for BDMCMC approach, Lenkoski (26), and WL (49). The table presents the F1 -score, which is defined in (4.13) and CE, which is defined in (2.14), for different models with 50 replications and standard deviations in parentheses. The F1 -score reaches its best score at 1 and its worst at 0. The CE is positive valued for which 0 is minimum and smaller is better. The best models for both F1 -score and CE are boldfaced.

Bayesian Structure Learning in Graphical Models

24 BDMCMC

Lenkoski

WL

p = 10

α Time

1 97 (628)

0.114 (0.001) 40 (225)

8.8e-06 (4.6e-11) 380 (11361)

p = 50

α Time

1 5408 (1694)

0.089 (0.045) 1193 (1000)

0.0000 (0.0000) 9650 (1925)

Table 2.3 Comparison of our BDMCMC algorithm with the WL approach (49) and Lenkoski approach (26). It presents the average computing time in minutes and the average probability of acceptance (α) with their standard deviations in parentheses.

graph and the G-Wishart prior WG (3, Ip ) for the precision matrix. To fit the glasso and mb methods, however, we must specify a regularization parameter λ that controls the sparsity of the graph. The choice of λ is critical since different λ’s may lead to different graphs. We consider the glasso method with three different regularization parameter selection approaches, which are the stability approach to regularization selection (stars) (30), rotation information criterion (ric) (53), and the extended Bayesian information criterion (ebic) (14). Similarly, we consider the mb method with two regularization parameter selection approaches, namely stars and the ric. We repeat all the experiments 50 times. Table 2.4 provides comparisons of all approaches, where we report the averaged F1 score with their standard errors in parentheses. Our Bayesian approach performs well as its F1 -score typically out performs all frequentist methods, except in the unlikely scenario of a high number of observations where it roughly equals the performance of the mb method with stars criterion. All the other approaches appear to perform well in some cases, and fail in other cases. For instance, when p = 50, the mb method with ric is the best for the AR(1) graph and the worst for the circle graph. To assess the performance of the precision matrix estimation, we use the KullbackLeibler divergence (23) which is given as follows  ! ˆ |K| 1 −1 ˆ , K − p − log (2.16) KL = tr Ktrue 2 |Ktrue | ˆ is the estimate of the precision matrix. where Ktrue is the true precision matrix and K Table 2.5 provides a comparison of all methods, where we report the averaged KL with their standard errors in parentheses. Based on KL, the overall performance of our Bayesian approach is good as its KL is the best in all scenarios except one.

2.5 Statistical performance

25 glasso

BDMCMC

mb

stars

ric

ebic

stars

ric

0.00 (0.00) 0.01 (0.00) 0.20 (0.13) 0.09 (0.02) 0.36 (0.06) 0.45 (0.05) 0.30 (0.05)

0.01 (0.01) 0.15 (0.02) 0.61 (0.01) 0.19 (0.02) 0.48 (0.02) 0.54 (0.02) 0.4 (0.02)

0.48 (0.00) 0.00 (0.00) 0.17 (0.07) 0.00 (0.00) 0.08 (0.04) 0.07 (0.04) 0.06 (0.02)

0.42 (0.01) 0.01 (0.02) 0.46 (0.01) 0.07 (0.02) 0.45 (0.03) 0.50 (0.02) 0.36 (0.03)

0.01 (0.01) 0.14 (0.02) 0.83 (0.01) 0.19 (0.02) 0.53 (0.03) 0.54 (0.02) 0.46 (0.03)

0.00 (0.00) 0.08 (0.02) 0.90 (0.01) 0.34 (0.06) 0.61 (0.02) 0.66 (0.03) 0.56 (0.02)

0.50 (0.08) 0.29 (0.03) 0.57 (0.00) 0.63 (0.00) 0.57 (0.01) 0.59 (0.02) 0.48 (0.008)

0.45 (0.00) 0.01 (0.00) 0.56 (0.00) 0.08 (0.05) 0.45 (0.07) 0.53 (0.07) 0.34 (0.07)

0.89 (0.08) 0.07 (0.03) 0.94 (0.00) 0.41 (0.01) 0.68 (0.02) 0.68 (0.03) 0.63 (0.02)

0.81 (0.09) 0.29 (0.03) 0.85 (0.00) 0.64 (0.01) 0.61 (0.02) 0.61 (0.03) 0.52 (0.02)

0.28 (0.05) 0.14 (0.06) 0.56 (0.04) 0.59 (0.02) 0.52 (0.10) 0.54 (0.06) 0.48 (0.10)

0.00 (0.00) 0.06 (0.05) 0.59 (0.03) 0.02 (0.02) 0.40 (0.16) 0.42 (0.18) 0.32 (0.18)

0.28 (0.01) 0.00 (0.00) 0.49 (0.05) 0.00 (0.00) 0.04 (0.13) 0.13 (0.24) 0.02 (0.09)

0.00 (0.00) 0.15 (0.04) 0.82 (0.02) 0.66 (0.02) 0.61 (0.21) 0.64 (0.22) 0.60 (0.23)

0.00 (0.00) 0.05 (0.041) 0.98 (0.02) 0.02 (0.02) 0.49 (0.21) 0.50 (0.22) 0.40 (0.23)

0.27 (0.05) 0.29 (0.12) 0.57 (0.02) 0.69 (0.03) 0.62 (0.12) 0.65 (0.10) 0.57 (0.14)

0.00 (0.00) 0.60 (0.07) 0.54 (0.02) 0.64 (0.01) 0.46 (0.15) 0.51 (0.17) 0.41 (0.15)

0.25 (0.01) 0.01 (0.02) 0.44 (0.02) 0.66 (0.04) 0.56 (0.13) 0.58 (0.10) 0.47 (0.15)

0.00 (0.00) 0.31 (0.07) 0.97 (0.02) 0.89 (0.02) 0.82 (0.24) 0.82 (0.23) 0.82 (0.24)

0.00 (0.00) 0.60 (0.07) 0.98 (0.01) 0.69 (0.02) 0.61 (0.24) 0.64 (0.23) 0.62 (0.24)

p=10 & n=30 circle star AR1 AR2 random cluster scale-free

0.95 (0.00) 0.15 (0.02) 0.90 (0.01) 0.56 (0.01) 0.57 (0.03) 0.61 (0.02) 0.53 (0.03)

p=10 & n=100 circle star AR1 AR2 random cluster scale-free

0.99 (0.00) 0.21 (0.02) 0.98 (0.00) 0.89 (0.01) 0.76 (0.01) 0.74 (0.02) 0.69 (0.02)

p=50 & n=100 circle star AR1 AR2 random cluster scale-free

0.99 (0.01) 0.17 (0.04) 0.86 (0.04) 0.86 (0.04) 0.51 (0.09) 0.55 (0.11) 0.49 (0.11)

p=50 & n=500 circle star AR1 AR2 random cluster scale-free

1.00 (0.01) 0.65 (0.05) 0.94 (0.02) 0.98 (0.01) 0.73 (0.09) 0.74 (0.09) 0.73 (0.10)

Table 2.4 Summary of performance measures in simulation example 2.5.1 for BDMCMC approach, glasso (15) with 3 criteria and mb (31) method with 2 criteria. The table reports F1 -score, which is defined in (4.13), for different models with 50 replications and standard deviations are in parentheses. The F1 -score reaches its best score at 1 and its worst at 0. The two top models are boldfaced.

Bayesian Structure Learning in Graphical Models

26

glasso BDMCMC

stars

ric

ebic

0.73 (0.12) 0.57 (0.08) 0.70 (0.10) 1.22 (0.07) 0.67 (0.08) 0.61 (0.06) 0.65 (0.07)

15.84 (0.03) 0.31 (0.00) 3.63 (0.07) 1.27 (0.00) 8.32 (305) 4.90 (2.37) 5.83 (12.35)

-0.22 (0.00) 1.59 (0.06) 1.26 (0.00) -3.74 (3.23) --

10.34 (1.33) 0.33 (0.01) 2.77 (2.34) 1.28 (0.00) 12.44 (1637) 5.72 (7.35) 6.59 (26.62)

0.14 (0.00) 0.13 (0.00) 0.12 (0.00) 0.28 (0.01) 0.16 (0.00) 0.13 (0.00) 0.16 (0.00)

15.95 (0.01) 0.15 (0.00) 2.88 (0.16) 1.24 (0.01) 4.47 (1.09) 4.46 (12.62) 4.14 (1.27)

-0.10 (0.00) 0.81 (0.01) 1.14 (0.00) 3.30 (0.76) 3.62 (8.17) 3.01 (0.70)

9.60 (0.56) 0.17 (0.00) 0.37 (0.00) 1.25 (0.02) 3.92 (2.55) 4.47 (30.31) 3.68 (1.94)

0.67 (0.13) 1.75 (0.21) 1.17 (0.23) 1.97 (0.33) 2.01 (0.42) 1.94 (0.42) 1.96 (0.45)

117.12 (32.12) - 1.05 (0.08) 1.27 (0.07) 8 (1.10) 8.92 (0.50) 6.56 (0.19) 7.29 (0.06) 20.83 (6.44) -19.82 (3.75) -20.91 (6.71) --

115.88 (5.88) 1.49 (0.16) 6.20 (0.93) 7.27 (0.09) 30.07 (12.54) 26.47 (5.08) 28.98 (8.40)

0.11 (0.01) 0.34 (0.04) 0.15 (0.02) 0.19 (0.03) 0.26 (0.06) 0.24 (0.05) 0.25 (0.07)

111.75 (30.56) 0.78 (0.04) 4.87 (0.45) 5.50 (0.2) 18.89 (5.58) 21.09 (21.51) 19.86 (7.69)

111.32 (3.26) 0.96 (0.06) 1.75 (0.08) 4.01 (0.10) 17.80 (8.79) 21.34 (28.62) 19.61 (16.65)

p=10 & n=30 circle star AR1 AR2 random cluster scale-free p=10 & n=100 circle star AR1 AR2 random cluster scale-free p=50 & n=100 circle star AR1 AR2 random cluster scale-free p=50 & n=500 circle star AR1 AR2 random cluster scale-free

-0.63 (0.03) 3.51 (0.23) 6.42 (0.13) 17.14 (6.59) ---

Table 2.5 Summary of performance measures in simulation example 2.5.1 for BDMCMC approach and glasso (15) with 3 criteria. The table reports the KL measure, which is defined in (2.16), for different models with 50 replications and standard deviations are in parentheses. The KL is positive valued for which 0 is minimum and smaller is better. The best models are boldfaced.

2.5 Statistical performance

2.5.2

27

Application to human gene expression data

We apply our proposed method to analyze the large-scale human gene expression data which was originally described by Bhadra and Mallick (5), Chen et al. (8), and Stranger et al. (46). The data are collected by Stranger et al. (46) using Illumina’s Sentrix Human6 Expression BeadChips to measure gene expression in B-lymphocyte cells from Utah (CEU) individuals of Northern and Western European ancestry. They consider 60 unrelated individuals whose genotypes are available from the Sanger Institute website (ftp: //ftp.sanger.ac.uk/pub/genevar). The genotype is coded as 0, 1, and 2 for rare homozygous, heterozygous and homozygous common alleles. Here the focus is on the 3125 Single Nucleotide Polymorphisms (SNPs) that have been found in the 5’ UTR (untranslated region) of mRNA (messenger RNA) with a minor allele frequency ≥ 0.1. There were four replicates for each individual. Since the UTR has been subject to investigation previously, it should have an important role in the regulation of the gene expression. The raw data were background corrected and then quantile normalized across replicates of a single individual and then median normalized across all individuals. We chose the 100 most variable probes among the 47,293 total available probes corresponding to different Illumina TargetID. Each selected probe corresponds to a different transcript. Thus, we have n = 60 and p = 100. The data are available in the R package, BDgraph. Bhadra and Mallick (5) have analyzed the data by adjusting the effect of SNPs using an expression quantitative trait loci (eQTL) mapping study. They found 54 significant interactions among the 100 traits considered. Previous studies have shown that these data are an interesting case study to carry out prediction. We place a uniform distribution as an uninformative prior on the graph and the GWishart WG (3, I100 ) on the precision matrix. We run our BDMCMC algorithm for 60, 000 iterations with a 30, 000 sweeps burn-in. The graph with the highest posterior probability is the graph with 281 edges, which includes almost all the significant interactions discovered by Bhadra and Mallick (5). Figure 4.9 shows the selected graph with 86 edges, for which the posterior inclusion probabilities in (4.12) is greater then 0.6. Edges in the graph show the interactions among the genes. Figure 4.10 shows the image of the the all posterior inclusion probabilities for visualization.

2.5.3

Extension to time course data

Here, to demonstrate how well our proposed methodology can be extended to other types of graphical models, we focus on graphical models involving time series data (10, 1). We

Bayesian Structure Learning in Graphical Models

28

GI_40354211−S Hs.185140−S ● ●

hmm9615−S ● GI_21389558−S ●

GI_37542230−S GI_24308084−S ● hmm10289−S GI_38569448−S ● ● ● Hs.171273−S GI_42656398−S GI_4507820−S ● ● ● GI_27764881−S GI_22027487−S ● GI_4504700−S ● ● GI_19743804−A GI_16554578−S ● GI_37546026−S ● ● GI_37537697−S ● GI_42662536−S GI_42660576−S GI_34222299−S ● ●GI_5454143−S ● GI_41350202−S ● GI_14602456−I GI_21464138−A ● ● ● GI_31652245−I ● GI_27477086−S GI_24797066−S ● GI_20143949−A ● ● GI_4505888−A hmm10298−S ● ● GI_37537705−I ● GI_18641371−S GI_37546969−S GI_18379361−A ● GI_11993934−S ● ● ● GI_41190543−S GI_27894333−A ● ● GI_23510363−A GI_11095446−S ● GI_41197088−S ● ● hmm3587−S GI_30409981−S ● GI_27894329−S ● Hs.449609−S ● GI_23065546−A GI_18641372−S ● GI_18426974−S GI_28416938−S ● ● GI_20070269−S ● ● GI_27482629−S ● GI_31377723−S ● Hs.449605−S GI_4504410−S GI_13325059−S GI_4507426−S ● ● ● ● ● GI_7662241−S GI_24497529−I ● ● GI_4502630−S Hs.512137−S ● GI_28557780−S Hs.512124−S ● ● ● GI_41190507−S GI_28610153−S GI_31077201−S GI_30795192−A ● ● GI_8923472−S ● GI_19224662−S ● ● ● Hs.449572−S Hs.449602−S GI_31341400−S ● ● GI_37546229−S GI_17981706−S hmm3577−S ● ● ● GI_4505948−S ● GI_14211892−S ● GI_42476191−S ● Hs.136376−S ●GI_7657043−S ● GI_16159362−S GI_4504436−S ● ● ● Hs.449584−S GI_33356559−S GI_7019408−S ● ● ● GI_20302136−S ● GI_7661757−S GI_13027804−S ● ● GI_34915989−S GI_33356162−S ● ● hmm3574−S ● GI_9961355−S ● GI_27754767−A GI_20373176−S ● ● GI_21614524−S GI_27754767−I ● ● Hs.406489−S ●

Fig. 2.4 The inferred graph for the human gene expression data set. It reports the selected graph with 86 significant edges for which their posterior inclusion probabilities (4.12) are more then 0.6. Posterior Edge Inclusion Probabilities

1.0

20 0.8

40

0.6

0.4

60

0.2 80

0.0

20

40

60

80

Fig. 2.5 Image visualization of the posterior pairwise edge inclusion probabilities for all possible edges in the graph.

show how graphs can be useful in modeling real-world problems such as gene expression time course data.

2.5 Statistical performance

29

Suppose we have a T time point longitudinal microarray study across p genes. We assume a stable dynamic graph structure for the time course data as follows: for t = 1, ..., T,

xt ∼ Np (f (t), K −1 ),

(2.17)

P ′ in which vector f (t) = {fi (t)}p with fi (t) = βi′ h(t) = m r=1 βir hr (t), βi = (βi1 , ..., βim ) , h(t) = (h1 (t), ..., hm (t))′ , and m is the number of basic elements. h(t) is a cubic spline basis which should be continuous with continuous first and second derivatives (21, chapter 5). The aim of this model is to find a parsimonious description of both time dynamics and gene interactions. For this model, we place a uniform distribution as an uninformative prior on the graph and a G-Wishart WG (3, Ip ) on the precision matrix. For a prior distribution for βi , we choose Np (µ0i , B0i ), with i = 1, ..., p. Thus, based on our likelihood and priors, the conditional distribution of βi is (2.18)

βi |x, K, G ∼ Np (µi , Bi ), in which Bi =

−1 (B0i

+ Kii

T X

h(t)hT (t))−1 ,

t=1 T X

−1 µi = Bi (B0i µ0i +

h(t)(x′t KV,i − Ki,V \i f−i (t))).

t=1

Thus, to account for the time effect in our model, we require one more Gibbs sampling step in the BDMCMC algorithm for updating βi , for i = 1, ..., p. To evaluate the efficiency of the method, we focus on the mammary gland gene expression time course data from Stein et al. (44). The data reports a large time course Affymetrix microarray experiment across different developmental stages performed by using mammary tissue from female mice. There are 12, 488 probe sets representing ∼ 8600 genes. In total, the probe sets are measured across 54 arrays with three mice used at each of 18 time points. The time points are in the four main stages, as follows: virgin, 6, 10, and 12 weeks; pregnancy, 1, 2, 3, 8.5, 12.5, 14.5, and 17.5; lactation, 1, 3, and 7; involution, 1, 2, 3, 4, and 20. By using cluster analysis, we identify 30 genes which provide the best partition among the developmental stages. Those genes play a role in the transitions across the main developmental events. The mammary data is available in the R package smida; for more details about the data see Wit and McClure (51, chapter one). Abegaz

Bayesian Structure Learning in Graphical Models

30

and Wit (1) analyze this data based on a sparse time series chain graphical model. By using our proposed methodology, we infer the interactions between the crucial genes. By placing a uniform prior on the graph and the G-Wishart WG (3, I30 ) on the precision matrix, we run our BDMCMC algorithm for 60, 000 iterations using 30, 000 as burn-in. Figure 2.6 shows the selected graph based on the output of our BDMCMC algorithm. The graph shows the edges with a posterior inclusion probability greater then 0.6. As we can see in Figure 2.6, the genes with the highest number of edges are LCN2, HSD17B, CRP1, and RABEP1, which each have 7 edges. Schmidt-Ott et al. (41) suggested that gene LCN2 (lipocalin 2) plays an important role in the innate immune response to bacterial infection and also functions as a growth factor. For the gene HSD17B (17-β hydroxysteroid dehydrogenase), past studies suggest that this gene family provides each cell with the necessary mechanisms to control the level of intracellular androgens and/or estrogens (24). Gene CRP1 was identified by Abegaz and Wit (1) as a likely hub.

PGM ●

SID1 ● TOR1B ●

H13 ●

OLFM1 ●

RABEP1 ●

TOM1 ● AI153246

GTF2A ● SPIN1 ●

●

HSD17B

SAA2 ●

PAX5 ●

●

X63240 ● XIRP1 ●

S1C22A4 ●

●

●

●

LCN2 ● TNFAIP6

CYP1B1

AA739413

CRP1 ●

PTPRB ● ACTB ●

SOCS3 ●

IGH ●

IGIV1 ●

IGHB ● FMOD ●

P0

●

Fig. 2.6 The inferred graph for the mammary gland gene expression data set. It reports the selected graph with 56 significant edges for which their posterior inclusion probabilities (4.12) are more then 0.6.

2.6 Discussion

31 Posterior Edge Inclusion Probabilities SID1 S1C22A4 CDKN1B RABEP1 TOM1 PAX5 SPIN1 AA739413 AI153246 XIRP1 H13 P0 GTF2A TOR1B CRP1 OLFM1 X63240 SAA2 IGH SOCS3 ACTB TNFAIP6 LCN2 IGIV1 HSD17B FMOD PGM PTPRB IGHB CYP1B1

0.8

0.6

0.4

0.2

SID1 S1C22A4 CDKN1B RABEP1 TOM1 PAX5 SPIN1 AA739413 AI153246 XIRP1 H13 P0 GTF2A TOR1B CRP1 OLFM1 X63240 SAA2 IGH SOCS3 ACTB TNFAIP6 LCN2 IGIV1 HSD17B FMOD PGM PTPRB IGHB CYP1B1

0.0

Fig. 2.7 Image visualization of the posterior pairwise edge inclusion probabilities of all possible edges in the graph.

2.6

Discussion

We introduce a Bayesian approach for graph structure learning based on Gaussian graphical models using a trans-dimensional MCMC methodology. The proposed methodology is based on the birth-death process. In Theorem 2.1, we derived the conditions for which the balance conditions of the birth-death MCMC method holds. According to those conditions we proposed a convenient BDMCMC algorithm, whose stationary distribution is our joint posterior distribution. We showed that a scalable Bayesian method exists, which, also in the case of large graphs, is able to distinguish important edges from irrelevant ones and detect the true model with high accuracy. The resulting graphical model is reasonably robust to modeling assumptions and the priors used. As we have shown in our simulation studies (2.5.1), in Gaussian graphical models, any kind of trans-dimensional MCMC algorithm which is based on a discrete time Markov process (such as reversible jump algorithms by 49 and 26) could suffer from high rejection rates, especially for high-dimensional graphs. However in our BDMCMC algorithm, moves between graphs are always accepted. In general, although our trans-dimensional MCMC algorithm has significant additional computing cost for birth and death rates, it has clear benefits over reversible jump style moves when graph structure learning in a non-hierarchical setting is of primary interest.

Bayesian Structure Learning in Graphical Models

32

In Gaussian graphical models, Bayesian structure learning has several computational and methodological issues as the dimension grows: (1) convergence, (2) computation of the prior normalizing constant, and (3) sampling from the posterior distribution of the precision matrix. Our Bayesian approach efficiently eliminates these problems. For convergence, Cappé et al. (6) demonstrate the strong similarity of reversible jump and continuous time methodologies by showing that, on appropriate rescaling of time, the reversible jump chain converges to a limiting continuous time birth-death process. In Section 2.5.1 we show the fast convergence feature of our BDMCMC algorithm. For the second problem, in Subsection 2.4.2, by using the ideas from Wang and Li (49) and Lenkoski (26) we show that the exchange algorithm circumvents the intractable normalizing constant. For the third problem, we used the exact sampler algorithm which was proposed by Lenkoski (26). Our proposed method provides a flexible framework to handle the graph structure and it could be extended to different types of priors for the graph and precision matrix. In Subsection 2.5.3, we illustrate how our proposed model can be integrated in types of graphical models, such as a multivariate time series graphical models. Although we have focused on normally distributed data, in general, we can extend our proposed method to other types of graphical models, such as log-linear models (see e.g. 13 and 27), non-Gaussianity data by using copula transition (see e.g. 12), or copula regression models (see e.g. 35). This will be considered in future work.

Appendix 1: Proof of theorem 2.1 Before we derive the detailed balance conditions for our BDMCMC algorithm we introduce some notation. Assume the process is at state (G, K), in which G = (V, E) with precision matrix K ∈ PG . The process behavior is defined by the birth rates βe (K), the death rates δe (K), and the birth and death transition kernels TβGe (K; .) and TδGe (K; .). For each e ∈ E, TβGe (K; .) denotes the probability that the process jumps from state (G, K) to a point in the new state ∪K ∗ ∈PG+e (G+e , K ∗ ). Hence, if F ⊂ PG+e we have

TβGe (K; F)

βe (K) = β(K)

Z be (ke ; K)dke .

(2.19)

ke :K∪ke ∈F

Likewise, for each e ∈ E, TδGe (K; .) denotes the probability that the process jumps from

2.6 Discussion

33

state (G, K) to a point in the new state ∪K ∗ ∈PG−e (G−e , K ∗ ). Therefore, if F ⊂ PG−e we have

X

TδGe (K; F) =

η∈E:K\kη ∈F

=

δη (K) δ(K)

δe (K) I(K −e ∈ F). δ(K)

Detailed balance conditions. In our birth-death process, P (K, G|x) satisfies detailed balance conditions if Z XZ δ(K)dP (K, G|x) = β(K −e )TβGe (K −e ; F)dP (K −e , G−e |x), (2.20) F

e∈E

PG−e

and Z β(K)dP (K, G|x) = F

XZ e∈E

δ(K +e )TδGe (K +e ; F)dP (K +e , G+e |x),

(2.21)

PG+e

where F ⊂ PG . The first expression says edges that enter the set F due to the deaths must be matched by edges that leave that set due to the births, and vice versa for the second part. To prove the first part (2.20), we have Z LHS = δ(K)dP (G, K|x) F Z = I(K ∈ F)δ(K)dP (G, K|x) PG Z X = I(K ∈ F) δe (K)dP (G, K|x) PG

=

e∈E

=

e∈E

XZ XZ e∈E

I(K ∈ F)δe (K)dP (G, K|x)

PG

I(K ∈ F)δe (K)P (G, K|x)

p Y i=1

dkii

Y (i,j)∈E

dkij .

Bayesian Structure Learning in Graphical Models

34

For the RHS, by using (2.19) we have XZ RHS = β(K −e )TβGe (K −e ; F)dP (K −e , G−e |x) e∈E

=

e∈E

=

Z ke

PG−e

Z

PG−e

XZ

be (ke ; K)dke dP (K −e , G−e |x)

βe (K)

XZ e∈E

=

PG−e

XZ

:K −e ∪k

e ∈F

I(K ∈ F)βe (K)be (ke ; K)dke dP (K −e , G−e |x)

ke −e

−e

I(K ∈ F)βe (K)be (ke ; K)P (K , G |x)

e∈E

p Y

Y

dkii

i=1

dkij .

(i,j)∈E

By putting δe (K)P (G, K|x) = βe (K)be (ke ; K)P (K −e , G−e |x), we have LHS=RHS. Now, in the above equation P (G, K|x) = P (G, K \ (kij , kjj )|x)P ((kij , kjj )|K \ (kij , kjj ), G, x), and P (G−e , K −e |x) = P (G−e , K −e \ kjj |x)P (kjj |K −e \ kjj , G−e , x). We simply choose the proposed density for the new element ke = kij as follows: be (ke ; K) =

P ((kij , kjj )|K \ (kij , kjj ), G, x) . P (kjj |K −e \ kjj , G−e , x)

Therefore, we reach the expression in Theorem 2.1. The proof for the second part (2.21) is the same.

Appendix 2: Proposition Let A be a 2 × 2 random matrix with Wishart distribution W (b, D) as below 1 1 (b−2)/2 P (A) = |A| exp − tr(DA) , I(b, D) 2

2.6 Discussion

35

where " # a11 a12 A= , a12 a22

" # d11 d12 D= . d12 d22

Then 2 (i) a11 ∼ W (b + 1, D11.2 ) where D11.2 = d11 − d−1 22 d21 , (ii) P (A) P (a11 ) 1 1 (b−2)/2 = |A| exp − tr(DA) , J(b, D, a11 ) 2

P (a12 , a22 |a11 ) =

where J(b, D, a11 ) =

2π d22

12

(b−1) 2

I(b, d22 )a11

1 exp − D11.2 a11 . 2

Proof. For proof of part (i), see Muirhead (33, Theorem 3.2.10). The result for part (ii) is immediate by using part (i).

36

References

References [1] Abegaz, F. and Wit, E. (2013). Sparse time series chain graphical models for reconstructing genetic networks. Biostatistics, 14(3):586–599. [2] Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47. [3] Atay-Kayis, A. and Massam, H. (2005). A monte carlo method for computing the marginal likelihood in nondecomposable gaussian graphical models. Biometrika, 92(2):317–335. [4] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412–424. [5] Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2):447–457. [6] Cappé, O., Robert, C., and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):679–700. [7] Carvalho, C. M., , and Scott, J. G. (2009). Objective bayesian model selection in gaussian graphical models. Biometrika, 96(3):497–512. [8] Chen, L., Tong, T., and Zhao, H. (2008). Considering dependence among genes and markers for false discovery control in eqtl mapping. Bioinformatics, 24(18):2015–2022. [9] Cheng, Y., Lenkoski, A., et al. (2012). Hierarchical gaussian graphical models: Beyond reversible jump. Electronic Journal of Statistics, 6:2309–2331. [10] Dahlhaus, R. and Eichler, M. (2003). Causality and graphical models in time series analysis. Oxford Statistical Science Series, pages 115–137. [11] Dempster, A. (1972). Covariance selection. Biometrics, 28(1):157–175. [12] Dobra, A., Lenkoski, A., et al. (2011a). Copula gaussian graphical models and their application to modeling functional disability data. The Annals of Applied Statistics, 5(2A):969–993.

References

37

[13] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011b). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496):1418–1433. [14] Foygel, R. and Drton, M. (2010). Extended bayesian information criteria for gaussian graphical models. In Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R., and Culotta, A., editors, Advances in Neural Information Processing Systems 23, pages 604– 612. [15] Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441. [16] Geyer, C. J. and Møller, J. (1994). Simulation procedures and likelihood inference for spatial point processes. Scandinavian Journal of Statistics, pages 359–373. [17] Giudici, P. and Castelo, R. (2003). Improving markov chain monte carlo model search for data mining. Machine learning, 50(1-2):127–158. [18] Giudici, P. and Green, P. (1999). Decomposable graphical gaussian model determination. Biometrika, 86(4):785–801. [19] Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4):711–732. [20] Green, P. J. (2003). Trans-dimensional markov chain monte carlo. Oxford Statistical Science Series, pages 179–198. [21] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer. [22] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20(4):388–400. [23] Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. [24] Labrie, F., Luu-The, V., Lin, S.-X., Claude, L., Simard, J., Breton, R., and Bélanger, A. (1997). The key role of 17β-hydroxysteroid dehydrogenases in sex steroid biology. Steroids, 62(1):148–158.

38

References

[25] Lauritzen, S. (1996). Graphical models, volume 17. Oxford University Press, USA. [26] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1):119–128. [27] Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in gaussian graphical models with the g-wishart prior. Journal of Computational and Graphical Statistics, 20(1):140–157. [28] Letac, G. and Massam, H. (2007). Wishart distributions for decomposable graphs. The Annals of Statistics, 35(3):1278–1323. [29] Liang, F. (2010). A double metropolis–hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computation and Simulation, 80(9):1007–1022. [30] Liu, H., Roeder, K., and Wasserman, L. (2010). Stability approach to regularization selection (stars) for high dimensional graphical models. In Advances in Neural Information Processing Systems, pages 1432–1440. [31] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462. [32] Mohammadi, A. and Wit, E. (2015). BDgraph: Graph Estimation Based on Birth-Death MCMC Approach. R package version 2.17. [33] Muirhead, R. (1982). Aspects of multivariate statistical theory, volume 42. Wiley Online Library. [34] Murray, I., Ghahramani, Z., and MacKay, D. (2012). Mcmc for doubly-intractable distributions. arXiv preprint arXiv:1206.6848. [35] Pitt, M., Chan, D., and Kohn, R. (2006). Efficient bayesian inference for gaussian copula regression models. Biometrika, 93(3):537–554. [36] Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1):37–63. [37] Preston, C. J. (1976). Special birth-and-death processes. Bulletin of the International Statistical Institute, 46:371–391.

References

39

[38] Ravikumar, P., Wainwright, M. J., Lafferty, J. D., et al. (2010). High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319. [39] Ripley, B. D. (1977). Modelling spatial patterns. Journal of the Royal Statistical Society. Series B (Methodological), pages 172–212. [40] Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29(3):391–411. [41] Schmidt-Ott, K. M., Mori, K., Li, J. Y., Kalandadze, A., Cohen, D. J., Devarajan, P., and Barasch, J. (2007). Dual action of neutrophil gelatinase–associated lipocalin. Journal of the American Society of Nephrology, 18(2):407–413. [42] Scott, J. G. and Berger, J. O. (2006). An exploration of aspects of bayesian multiple testing. Journal of Statistical Planning and Inference, 136(7):2144–2162. [43] Scutari, M. (2013). On the prior and posterior distributions used in graphical modelling. Bayesian Analysis, 8(1):1–28. [44] Stein, T., Morris, J. S., Davies, C. R., Weber-Hall, S. J., Duffy, M.-A., Heath, V. J., Bell, A. K., Ferrier, R. K., Sandilands, G. P., and Gusterson, B. A. (2004). Involution of the mouse mammary gland is associated with an immune cascade and an acute-phase response, involving lbp, cd14 and stat3. Breast Cancer Research, 6(2):R75–91. [45] Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics, 28(1):40– 74. [46] Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., et al. (2007). Population genomics of human gene expression. Nature genetics, 39(10):1217–1224. [47] Wang, H. (2012). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis, 7(4):867–886. [48] Wang, H. (2014). Scaling it up: Stochastic search structure learning in graphical models.

40

References

[49] Wang, H. and Li, S. (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics, 6:168–198. [50] Wang, H. and Pillai, N. S. (2013). On a class of shrinkage priors for covariance matrix estimation. Journal of Computational and Graphical Statistics, 22(3):689–707. [51] Wit, E. and McClure, J. (2004). Statistics for Microarrays: Design, Analysis and Inference. John Wiley & Sons. [52] Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563. [53] Zhao, T., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in r. The Journal of Machine Learning Research, 13(1):1059–1062.

Chapter 3 Bayesian Modelling of Dupuytren Disease Using Gaussian Copula Graphical Models 3.1

Abstract

Dupuytren disease is a fibroproliferative disorder with unknown etiology that often progresses and eventually can cause permanent contractures of the affected fingers. Most of the research on severity of the disease and the phenotype of this disease are observational studies without concrete statistical analyses. There is a lack of multivariate analysis for the disease taking into account potential risk factors. In this paper, we provide a novel Bayesian framework to discover potential risk factors and investigate which fingers are jointly affected. Gaussian copula graphical modelling is one potential way to discover the underlying conditional independence of variables in mixed data. Our Bayesian approach is based on Gaussian copula graphical models. We embed a graph selection procedure inside a semiparametric Gaussian copula. We carry out the posterior inference by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on a birth-death process. We implemented the method in the R package BDgraph. Key words: Dupuytren disease; Risk factors; Bayesian inference; Gaussian copula graphical models; Bayesian model selection; Latent variable models; Birth-death process; Markov chain Monte Carlo.

Bayesian modeling of Dupuytren disease

42

3.2

Introduction

Dupuytren disease is inherited and presents worldwide, which is more prevalent in people with northern European ancestry (3). This disease is an incurable fibroproliferative disorder that alters the palmar hand and may causes progressive and permanent flextion contracture of the fingers. At the first stage of the disease, giving rise to the development of skin pitting and subcutaneous nodules in the palm; see Figure 3.1 in the left side. At a later stage, cords appear that connect the nodules and may contract the fingers into a flexed position; see Figure 3.1 in the right side. A contracture can arise isolated in a single ray or even multiple rays and affects them in decreasing order. The disease mostly appears in the ulnar side of the hand, i.g. it affects the little and ring fingers most-frequently, see figure 3.2. The only treatment available is surgical interventions. The main questions are: (1) Can we determine what variables affect the disease and how? (2) Do we need to intervene on multiple fingers when a surgical intervention takes place? The first question is more of an epidemiological question, while the second question is a clinical question. Empirical research has described the patterns of occurrence of Dupuytren disease in multiple fingers. (19) have stated that most often the combination of affected ring and little fingers occurred, followed by the combination of an affected third, fourth, and fifth finger. (31) found that Dupuytren disease rarely affects isolated radial side, and that radial effect often associate with an affected ulnar side. (20) noticed that patients who had required surgery because of firmly affected thumb were on average 8 years older and had suffered significantly longer from the disease compared with patients with a mildly affected radial side. Moreover, these patients suffered from ulnar disease that repeatedly had required surgery, suggesting an intractable form of disease. More recently, (13), with a multivariate ordinal logit model, suggested that the middle finger is substantially correlated with other fingers on the ulnar side, and the thumb and index finger are correlated. They taking into account age and sex, and tested hypotheses on independence between groups of fingers. However, there is a lack of multivariate analysis study for the disease with taking into account potential risk factors. Essential risk factors of Dupuytren disease is has been variably attributed to both phenotypic and genotypic factors (29). Essential risk factors include genetic predisposition and ethnicity, as well as sex and age. Research on family studies and twin studies recommended that Dupuytren disease has a potential genetic risk factors. However, until now, it is unclear whether Dupuytren disease is a complex oligogenic or a simple monogenic Mendelian disorder. Several life-style risk factors (some considered controversial) include

3.2 Introduction

43

(a) With a flexed position

(b) Without a flexed position

Fig. 3.1 In the left is a hand image of a patient who has Dupuytren disease and his finger has been affected by the disease. Palmar nodules and small cords without signs of contracture. In the right is a hand image of a patient who has Dupuytren disease and his fingers has not been affected by the disease yet.

smoking, excessive alcohol consumption, manual work, and hand trauma, but also several diseases, such as diabetes mellitus and epilepsy, are thought to play a role in the cause of Dupuytren disease. However, the role of these risk factors and diseases has not been fully elucidated, and the results of different studies are occasionally conflicting (12). In this paper we analyse data which are collected in the north of Netherlands from patients who have Dupuytren disease. Both hands of patients are examined for signs of Dupuytren disease. These are tethering of the skin, nodules, cords, and finger contractures in patients with cords. Severity of the disease are measured by total angles on each of the 10 fingers. For the potential risk factors, in addition, we inquired about smoking habits, alcohol consumption, whether participants had performed manual labor during a significant part of their life, and whether they had sustained hand injury in the past, including surgery. In addition, we inquired about the presence of Ledderhose diabetes, epilepsy, peyronie, knucklepade, and liver disease; familial occurrence of Dupuytren disease, defined as a first-degree relative with Dupuytren disease. In our dataset we have 279 patients in which 79 of them, have the disease in at least one of their fingers. Therefore, there are a lot of zeros in the data as shown in figure 3.2. Beside, there are 13 potential risk factors. This mixed data set contains binary (disease factors), discrete (alcohol and hand injury), and continuous variables (total angles for fingers). The primary aim of this paper is to model the relationships between the risk factors

Bayesian modeling of Dupuytren disease

44

35 ● ●

30

150

25

100

20 ● ● ●

●

15

● ● ●

●

●

50

● ● ● ●

● ● ● ● ●

●

●

● ● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ● ●

●

● ● ● ● ● ● ● ● ●

●

10 ●

● ● ●

● ● ● ● ● ●

● ● ● ●

5

● ●

0

(a) Boxplot

Left5

Left4

Left3

Left2

Left1

Right5

Right4

Right3

Right2

Right1

Left5

Left4

Left3

Left2

Left1

Right5

Right4

Right3

Right2

Right1

0

(b) Histogram

Fig. 3.2 In the left is a boxplot for all hand fingers which is based on the total angles of the fingers. In the right is is occurrence of rays affected with Dupuytren disease for all 10 hand fingers.

and disease indicators for Dupuytren disease based on this mixed dataset. We propose an efficient Bayesian statistical methodology based on Gaussian copula graphical models that can be applied to binary, ordinal or continuous variables simultaneously. We embed the graphical model inside a semiparametric framework, using extended rank likelihood (10). We carry out posterior inference for the graph structure and the precision matrix by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on a continuous-time birth-death process (22). Graphical models provide an effective way to describe statistical patterns in data. In this context undirected Gaussian graphical models are commonly used, since inference in such models is often tractable. In undirected Gaussian graphical models, the graph structure is characterized by its precision matrix (the inverse of covariance matrix): the non-zero entries in the precision matrix show the edges in the graph. In the real world, data are often non-Gaussian. For non-Gaussian continuous data, variables can be transformed to Gaussian latent variables. For discrete data, however, the situation is more convoluted; there is no one-to-one transformation into latent Gaussian variables. A common approach is to apply a Markov chain Monte Carlo method (MCMC) to simulate both the latent Gaussian variables and the posterior distributions (10). A related approach is the Gaussian copula graphical models developed by (7). In their method, they have designed the sampler algorithm which is based on reversible-jump MCMC approach. Here, in our proposed method

3.3 Methodology

45

we implement the birth-death MCMC approach (22) which has several computational advantages compared with reversible-jump MCMC approach; see (22, Section 4). The paper is organized as follows. In Section 3.3, we introduce a comprehensive Bayesian framework based on Gaussian copula graphical models. In addition, we show the performance of our methodology and we compare it with state-of-the-art alternatives. In section 3.4 we analyses Dupuytren disease dataset based on our proposed Bayesian statistical methodology. In this section, first we discover the potential phenotype risk factors for the Dupuytren disease. Moreover, we consider the severity of Dupuytren disease between pairs of 10 hand fingers; the result help surgeons to decide weather they should operate one finger or they should operate multiple fingers simultaneously. In the last section, we discuss the connections to existing methods and possible future directions in the last section.

3.3 3.3.1

Methodology Gaussian graphical models

In graphical models, conditional dependence relationships among random variables are presented as a graph G. A graph G = (V, E) specifies a set of vertices V = {1, 2, ..., p}, where each vertex corresponds to a random variable, and a set of existing edges E ⊂ V ×V (15). E denotes the set of non-existing edges. We focus here on undirected graphical models in where (i, j) ∈ E is equivalent with (j, i) ∈ E, also known as Markov random fields. The absence of an edge between two vertices specifies the pairwise conditional independence of these two variables given the remaining variables, while an edge between the two variables determines the conditional dependence of the variables. For a p-dimensional variable exists in total 2p(p−1)/2 possible conditional independence graphs. Even with a relatively small number of variables, the size of graph space is enormous. The graph space can be explored by stochastic search algorithms (22, 8, 11). These types of algorithms explore the graph space by adding or deleting one edge at each step, known as a neighborhood search algorithm. A graphical model that follows a multivariate normal distribution is called a Gaussian graphical models, also known as a covariance selection model (6). Zero entries in the precision matrix correspond to the absence of edges on the graph and conditional independence between pairs of random variables given all other variables. We define a zero

Bayesian modeling of Dupuytren disease

46

mean Gaussian graphical model with respect to the graph G as MG = Np (0, Σ) | K = Σ−1 ∈ PG , where PG denotes the space of p × p positive definite matrices with entries (i, j) equal to zero whenever (i, j) ∈ E. Let z = (z1 , ..., zn ) be an independent and identically distributed sample of size n from model MG , where zi is a p dimensional vector of variables. Then, the likelihood is 1 n/2 P (z|K, G) ∝ |K| exp − tr(KS) , (3.1) 2 where S = z′ z.

3.3.2

Gaussian copula graphical models

A copula is a multivariate cumulative distribution function whose uniform marginals are on the interval [0, 1]. Copulas provide a flexible tool for understanding dependence among random variables, in particular for non-Gaussian multivariate data. By Sklar’s theorem (30) there exists a copula C such that any p dimensional distribution function H can be completely specified by its marginal distributions and a copula C satisfying H(y1 , . . . , yp ) = C F1 (y1 ), . . . , Fp (yp ) , where Fj are the univariate marginal distributions of H. If all Fj are all continuous, then C is unique, otherwise it is uniquely determined on Ran(F1 ) × · · · × Ran(Fp ) which is the cartesian product of the ranges of Fj . Conversely, a copula function can be extracted from any p dimension distribution function H and marginal distributions Fj by C(u1 , . . . , up ) = H F1−1 (y1 ), . . . , Fp−1 (yp ) , where Fj−1 (s) = inf{t | Fj (t) ≥ s} are the pseudo-inverse of Fj . The decomposition of a joint distribution into marginal distributions and a copula suggests that the copula captures the essential features of dependence between random variables. Moreover, the copula measure of dependence is invariant to any monotone transformation of random variables. Thus, copulas allow one to model the marginal distributions and the dependence structure of a multivariate random variables separately. In copula modelling, Genest et al. (9) develop a popular semiparametric estimation approach

3.3 Methodology

47

or rank likelihood based estimation in which the association among variables is represented with a parametric copula but the marginals are treated as nuisance parameters. The marginals are estimated non-parametrically using the scaled empirical distribution P n function Fˆj (y) = n+1 Fnj (y), where Fnj (y) = n1 ni=1 I{yij ≤ y}. As a result estimation and inference are robust to misspecification of marginal distributions. The semiparametric estimators are well-behaved for continuous data but fail for discrete data, for which the distribution of the ranks depends on the univariate marginal distributions, making them somewhat inappropriate for the analysis of mixed continuous and discrete data (10). To remedy this, Hoff (10) propose the extended rank likelihood which is a type of marginal likelihood that does not depend on the marginal distributions of the observed variables. Under the extended rank likelihood approach the ranks are free of the nuisance parameters (or marginal distributions) of the discrete data. This makes the extended rank likelihood approach more focused on the determination of graphical models (or multivariate association) and avoids the difficult problem of modelling the marginal distributions (7). In case of ordered discrete and continuous variables, a Gaussian copula has been considered to describe dependence pattern between heterogeneous variables using the extended rank likelihood in Gaussian copula graphical modelling (7). Let Y be a collection of continuous, binary, ordinal or count variables with Fj the marginal distribution of Yj and Fj−1 its pseudo inverse. For constructing a joint distribution of Y , we introduce a multivariate normal latent variable as follows iid

Z1 , ..., Zn ∼ N (0, Γ),

and define the observed data as Yij = Fj−1 (Φ(Zij )), where Γ is the correlation matrix for a Gaussian copula. The joint distribution of Y is given by P Y1 ≤ y1 , . . . , Yp ≤ yp = C(F1 (y1 ), . . . , Fp (yp ) | Γ), where C(·) is the Gaussian copula given by C(u1 , . . . , up | Γ) = Φp Φ−1 (u1 ), . . . , Φ−1 (up ) | Γ ,

Bayesian modeling of Dupuytren disease

48

where Φp (·) is the cumulative distribution of a multivariate normal distribution and Φ(·) is a cumulative distribution function of a univariate normal distribution. Hence the joint cumulative distribution function is P Y1 ≤ y1 , . . . , Yp ≤ yp = Φp Φ−1(F1 (y1 )), . . . , Φ−1(F (yp )) | Γ .

(3.2)

In the semiparametric copula estimation, since the marginals are treated as nuisance parameters the joint distribution in (4.9) is parametrized only by the correlation matrix of the Gaussian copula, Γ. Our aim is to infer the underlying graph structure G of the observed variables Y implied by the continuous latent variables Z. Since Zs are unobservable we follow the idea of (10) that relate them to the observed data as follows. Given the observed data Y from a sample of n observations, the latent samples z are constrained to belong to the set (r) A(y) = {z ∈ Rn×p : ljr (z) < zj < urj (z), r = 1, . . . , n; j = 1, . . . , p}.

where n o (k) (s) (r) : = max zj yj < yj , o n (s) (s) (r) . urj (z) = min zj : yj < yj

ljr (z)

(3.3)

Further (10) suggests that inference on the latent space can be performed by substituting the observed data y with the event A(y). For a given graph G and precision matrix K = Γ−1 , the likelihood is defined as: P (y | K, G, F1 , ..., Fp ) = P (y, z ∈ A(y) | K, G, F1 , ..., Fp ) = P (z ∈ A(y) | K, G) ×P (y | z ∈ A(y), K, G, F1 , ..., Fp ). The only part of the observed data likelihood relevant for inference on K and G is P (z ∈ A(y) | K, G). Thus, the extended rank likelihood function as referred by (10) is given by Z P(z ∈ A(y) | K, G) = P (z ∈ A(y) | K, G) = P (z | K, G)dz, A(y)

where the expression inside the integral for the Gaussian copula based distribution given by (4.9) takes a similar form as in 5.2.

3.3 Methodology

49

Therefore, we can infer about (K, G) by obtaining a posterior distribution P (K, G|z ∈ A(y)) ∝ P (z ∈ A(y)|K, G)P (K | G)P (G), which is discussed in detail in the next sections. Moreover, we evaluate the results induced by the latent variables using posterior predictive analysis on the scale of the original mixed variables.

3.3.3

Bayesian Gaussian copula graphical models

Prior specification In this section we discuss the specification of prior distributions for the graph G and the precision matrix K. For the prior distribution of the graph, we propose to use the discrete uniform distribution over the graph space, P (G) ∝ 1, as a non-informative prior. Other choices of priors for the graph structure have been considered by modelling the joint state of the edges (28), encouraging sparse graphs (11) or a truncated Poisson distribution on the graph size (22). We consider the G-Wishart (27) distribution for the prior distribution of the precision matrix. The G-Wishart distributions is conjugate for normally distributed data and places no probability mass on zero entries of the precision matrix. Matrix K ∈ PG has the GWishart distribution WG (b, D), if 1 1 (b−2)/2 |K| exp − tr(DK) , P (K|G) = IG (b, D) 2 where b > 2 is the degree of freedom, D is a symmetric positive definite matrix, and IG (b, D) is the normalizing constant, Z

(b−2)/2

|K|

IG (b, D) = PG

1 exp − tr(DK) dK. 2

If graph G is complete the G-Wishart distribution reduces to the usual Wishart distribution. In that case, its normalizing constant has an explicit form (24). If a graph is decomposable, IG (b, D) can be calculated explicitly (27). For non-decomposable graphs, we can approximate IG (b, D) by a Monte Carlo approach (2) or a Laplace approximation (17). The G-Wishart prior is conjugate to the likelihood (5.2), hence, the posterior distribu-

Bayesian modeling of Dupuytren disease

50 tion of K is

1 1 (b∗ −2)/2 ∗ P (K|Z ∈ A(y), G) = |K| exp − tr(D K) , IG (b∗ , D∗ ) 2 where b∗ = b + n and D∗ = D + S, that is, WG (b∗ , D∗ ). For other choices of priors for the precision matrix see (36, 34, 33, 37). Posterior inference Consider the joint posterior distribution of K ∈ PG and the graph G given by P (K, G | Z ∈ A(y)) ∝ P (Z ∈ A(y) | K) P (K | G) P (G).

(3.4)

Sampling from this joint posterior distribution can be done by a computationally efficient birth-death MCMC sampler proposed in Mohammadi and Wit (2013) for Gaussian graphical models. Here we extend their algorithm for the more general case of Gaussian copula graphical models. Our algorithm is based on a continuous time birth-death Markov process in which the algorithm explores the graph space by adding or removing an edge in a birth or death event. The birth and death rates of edges occur in continuous time with the rates determined by the stationary distribution of the process. The algorithm is considered in such a way that the stationary distribution equals the target joint posterior distribution of the graph and the precision matrix (5.3). The birth-death process is designed in such a way that the birth and death events are independent Poisson processes; the time between two successive events has an exponential distribution. Therefore, the probability of birth and death events are proportional to their rates. Mohammadi and Wit (22, section 3) prove that by considering the following birth and death rates, the birth-death MCMC sampling algorithm converges to the target joint posterior distribution of the graph and the precision matrix, βe (K) =

P (G+e , K +e \ (kij , kjj )|Z ∈ A(y)) , for each e ∈ E, P (G, K \ kjj |Z ∈ A(y))

(3.5)

δe (K) =

P (G−e , K −e \ kjj |Z ∈ A(y)) , P (G, K \ (kij , kjj )|Z ∈ A(y))

(3.6)

for each e ∈ E,

in which G+e = (V, E ∪ {e}), and K +e ∈ PG+e and similarly G−e = (V, E \ {e}), and K −e ∈ PG−e . The extended birth-death MCMC algorithm for Gaussian copula graphical

3.3 Methodology

51

models are summarized in Algorithm 3.1.

1

Algorithm 3.1 Given a graph G = (V, E) with a precision matrix K, iterate the following steps: 1. Sample the latent data. For each r ∈ V and j ∈ {1, 2, ..., n}, we update the latent (j) value zr from its full conditional distribution   X (j) (j) Zr |K, ZV \{r} = zK,V \{r} ∼ N − Krr′ zr′ /Krr , 1/Krr  , r′

truncated to the interval Ljr , Urj in (4.10). 2. Sample the graph based on birth and death process. P 2.1. Calculate the birth rates by equation 5.7 and β(K) = Pe∈E βe (K), 2.2. Calculate the death rates by equation 5.8 and δ(K) = e∈E δe (K), 2.3. Calculate the waiting time by W(K) = 1/(β(K) + δ(K)), 2.4. Calculate the type of jump (birth or death), 3. Sample the new precision matrix, according to the type of jump, based on Algorithm 3.2. In Algorithm 3.1, the first step is to sample from the latent variables given the observed data. Then, based on this sample, we calculate the birth and death rates and waiting times. Based on birth and death rates we calculate the type of jump. Details of how to efficiently calculate the birth and death rates are discussed in subsection 1. Finally in step 3, according to the new state of jump, we sample from new precision matrix using a direct sampling scheme from the G-Wishart distribution which is described in Algorithm 3.2. in subsection 1. To calculate the posterior probability of a graph we compute the Rao-Blackwellized sample mean (4, subsection 2.5). The Rao-Blackwellized estimate of the posterior graph probability is the proportion of the total waiting time for that graph (see Figure 4.2 in the right). The weights are equal to the length of the waiting time in each state ( e.g. {W1 , W2 , W3 , ...} in Figure 4.2).

Computing the birth and death rates Calculating the birth and death rates (5.7 and 5.8) is the bottleneck of our BDMCMC algorithm. Here, we explain how to calculate efficiently the death rates; the birth rates are calculated a similar manner.

Bayesian modeling of Dupuytren disease

52

Fig. 3.3 This image visualizes Algorithm 3.1. (Bottom left) Continuous time BDMCMC algorithm where {W1 , W2 , ...} denote waiting times and {t1 , t2 , ...} denote jumping times. (Bottom right) Estimated posterior probability of the graphs which are proportional to sum of their waiting times.

Following (22) and after some simplification, for each e = (i, j) ∈ E, we have δe (K) =

∗ Djj 1 P (G−e ) IG (b, D) ( ) 2 H(K, D∗ ), 1 P (G) IG−e (b, D) 2π(kii − k11 )

(3.7)

where  " #  1  ∗ 2 (Dij ) ∗ 1 H(K, D∗ ) = exp − tr(De,e (K 0 − K 1 )) − (Dii∗ − )(k − k ) , ii 11 ∗  2  Djj in which "

# kii 0 K = , 0 Kj,V \j (KV \j,V \j )−1 KV \j,j 0

and K 1 = Ke,V \e (KV \e,V \e )−1 KV \e,e . The computational bottleneck in (3.7) is the ratio of normalizing constants.

3.3 Methodology

53

Dealing with calculation of normalizing constants Calculating the ratio of normalizing constants has been a major issue in recent literature (32, 35, 22). To compute the normalizing constants of a G-Wishart, (27) proposed an importance sampling algorithm, while (2) developed a Monte Carlo method. These methods can be computationally expensive and numerical instable (11, 35). (35, 5, 22) developed an alternative approach, which borrows ideas from the exchange algorithm (25) and the double Metropolis-Hastings algorithm (18) to compute the ratio of such normalizing constants. When the dimension of the problem is high, the curse of dimensionality may be a serious difficulty for the double MH sampler (18). More recently (32) derived an explicit representation of the normalizing constant ratio. Theorem 3.1 (Uhler et al. 32). Let G = (V, E) be an undirected graph and G−e = (V, E −e ) denotes the graph G with one less edge e. Then √ Γ((b + d + 1)/2) IG (b, Ip ) =2 π , IG−e (b, Ip ) Γ((b + d)/2) where d denotes the number of triangles formed by the edge e and two other edges in G and Ip denotes an identity matrix with p dimension. Proof. it is immediate by using Uhler et al. (32, theorem 3.7). Therefore, for the case of D = Ip , we have a simplified expression for the death rates, given by δe (K) =

∗ 2Djj 1 P (G−e ) Γ((b + d + 1)/2) ( ) 2 H(K, D∗ ), 1 P (G) Γ((b + d)/2) (kii − kii )

Sampling from posterior distribution of precision matrix Several sampling methods from a G-Wishart have been proposed; for a review of existing methods see (35) and (16). Here we use an exact sampler algorithm developed by (16) and summarized in Algorithm 3.2. Simulation study We perform a comprehensive simulation study with respect to different graph structures to evaluate the performance of our method and compare it to an alternative approach proposed by Dobra and Lenkoski (7), referred to as DL. We generate mixed data from a latent Gaussian copula model with 5 different types of variables, for “Gaussian”, “non-Gaussian”,

54

1

Bayesian modeling of Dupuytren disease Algorithm 3.2. Direct sampler from precision matrix (16). Given a graph G = (V, E) with precision matrix K: 1. Set Σ = K −1 , 2. Repeat for j = 1, ..., p, until converge: 2.1 Let Nj ⊂ V be the set of variables that connected to j in G. Form ΣNj and KN−1j ,j and solve −1 βˆj∗ = Σ−1 Nj KNj ,j ,

2.2 Form βˆj ∈ Rp−1 by plugging zeroes in those locations not connected to j in G and padding the elements of βˆj∗ to the rest locations, 2.3 Replace Σj,−j and Σ−j,j with Σ−j,−j βˆj , 3. Return K = Σ−1 . “ordinal”, “count”, and “binary”. We performed all computations with our extended R package BDgraph (21, 23). Corresponding to different sparsity patterns, we consider 4 different kinds of synthetic graphical models: 1. Random Graph: A graph in which the edge set E is randomly generated from independent Bernoulli distributions with probability 2/(p − 1) and corresponded precision matrix is generated from K ∼ WG (3, Ip ). n o 2. Cluster Graph: A graph in which the number of clusters is max 2, p/20 . Each cluster has the same structure as a random graph. The corresponded precision matrix is generated from K ∼ WG (3, Ip ). 3. Scale-free Graph: A scale-free graph has a power-low degree distribution generated by the Barabasi-Albert algorithm (1). The corresponded precision matrix is generated from K ∼ WG (3, Ip ). 4. Hub Graph: A graph in which every node is connected to one node, and corresponded precision matrix is generated from K ∼ WG (3, Ip ). For each graphical model, we consider four different scenarios: (1) dimension p = 10 and sample size n = 30, (2) p = 10 and n = 100, (3) p = 30 and n = 100, (4) p = 30 and n = 500. For each mixed data set, we fit our method and DL approach using a uniform prior for the graph and the G-Wishart prior WG (3, Ip ) for the precision matrix. We run the two algorithms with the same starting points with 100, 000 iteration and 50, 000 as a burn in. Computations for this example were performed in parallel on a 235 batch nodes with 12 cores and 24 GB of memory, running Linux.

3.4 Analysis of Dupuytren disease dataset

55

To assess the performance of the graph structure, we compute the F1 -score measure (26) for MAP graph which defined as 2TP , 2TP + FP + FN

F1 -score =

(3.8)

where TP, FP, and FN are the number of true positives, false positives, and false negatives, respectively. The F1 -score score lies between 0 and 1, where 1 stands for perfect identification and 0 for bad identification. Also, the mean square error (MSE) is used, defined as MSE =

X

(ˆ pe − I(e ∈ Gtrue ))2 ,

(3.9)

e

where pˆe is the posterior pairwise edge inclusion probabilities and I(e ∈ Gtrue ) is an indicator function, such that I(e ∈ Gtrue ) = 1 if e ∈ Gtrue and zero otherwise. For our BDMCMC algorithm we calculate the posterior pairwise edge inclusion probabilities based on the Rao-Blackwellization (4, subsection 2.5) for each possible edge e = (i, j) as PN pˆe =

t=1

I(e ∈ G(t) )W(K (t) ) , PN (t) ) W(K t=1

(3.10)

where N is the number of iterations and W(K (t) ) is the waiting time in the graph G(t) with the precision matrix K (t) ; see (22). Table 3.1 reports comparisons of our method with DL (7), where we repeat the experiments 50 times and report the average F1 -score and MSE with their standard errors in parentheses. Our method performs well overall as its F1 score and its MSE are beter in most of the cases and mainly because of its fast convergence rate. As we expected, the DL approach converges slower compared to our method. From a theoretical point of view, both algorithms converge to the true posterior distribution, if we run them a sufficient amount of time. Thus, the results from this table just indicate how quickly the algorithms converge.

3.4

Analysis of Dupuytren disease dataset

The data set we analyses here are collected from patients who have Dupuytren disease from north of Netherlands. Both hands of patients who were willing to participate and signed an informed consent from are examined for signs of Dupuytren disease and knuckle pads. Signs of Dupuytren disease include tethering of the skin, nodules, cords, and finger contractures in patients with cords. Participants who had at least one of these features

Bayesian modeling of Dupuytren disease

56 F1-score p

n

10

30

graph BDMCMC DL Random 0.37 (0.17) 0.33 (0.11) Cluster 0.35 (0.16) 0.30 (0.11) Scale-free 0.31 (0.13) 0.34 (0.08) Hub 0.26 (0.11) 0.31 (0.10)

MSE BDMCMC 7.0 (2.0) 6.9 (1.8) 7.6 (1.2) 8.3 (0.8)

DL 9.2 (1.1) 9.6 (1.4) 9.7 (1.0) 10.2 (0.8)

random Cluster 10 100 Scale-free Hub

0.33 (0.17) 0.30 (0.17) 0.33 (0.16) 0.26 (0.12)

0.32 (0.11) 0.28 (0.09) 0.32 (0.12) 0.31 (0.09)

7.7 (1.7) 6.8 (1.5) 7.3 (1.4) 8.3 (0.9)

9.9 (1.0) 9.5 (1.3) 9.6 (1.0) 10.0 (1.0)

Random Cluster 30 100 Scale-free Hub

0.54 (0.06) 0.56 (0.05) 0.53 (0.17) 0.39 (0.08)

0.44 (0.04) 0.47 (0.04) 0.30 (0.05) 0.31 (0.04)

52.3 (9.9) 48.0 (6.5) 27.7 (14.6) 34.8 (6.8)

59.1 (8.7) 54.4 (8.1) 25.8 (1.7) 30.5 (4.5)

Random Cluster 30 500 Scale-free Hub

0.79 (0.04) 0.79 (0.05) 0.81 (0.07) 0.73 (0.07)

0.63 (0.07) 0.66 (0.05) 0.59 (0.06) 0.53 (0.08)

25.8 (6.5) 26.3 (5.2) 9.4 (3.2) 12.7 (3.4)

41.1 (14.3) 35.1 (7.9) 11.7 (3.0) 13.5 (3.2)

Table 3.1 Summary of performance measures in simulation example 1 for our method and DL (7). The table presents the F1 -score, defined in (4.13) and MSE, defined in (3.9), with 100 replications and standard deviations in parenthesis. The F1 -score reaches its best score at 1 and its worst at 0. The MSE is positive value for which 0 is minimal and smaller is better. The best models for both F1 -score and MSE are boldfaced.

were labeled as having Dupuytren disease. Severity of the disease are measured by total angles of each 10 fingers. The total angles is the sum of angles for metaccarpophalangeal joint, two interphalangeal joints (for thumb fingers are only two interphalangeal joints). As potential risk factors, in addition, information is available about smoking habits, alcohol consumption, whether participants performed manual labor during a significant part of their life, and whether they had sustained hand injury in the past, including surgery. In addition, disease history information about the presence of Ledderhose diabetes, epilepsy, peyronie, knucklepade, or liver disease, and familial occurrence of Dupuytren disease (defined as a first-degree relative with Dupuytren disease) is available. The data consist of 279 patients who have Dupuytren disease (n = 279); among those patients, 79 of them have an irreversible flexion contracture at least one of their fingers. The severity of the disease in the all 10 fingers of the patients is measured by total angles

3.4 Analysis of Dupuytren disease dataset

57

of each fingers. To study the potential phenotype risk factors of this disease, we consider the above mentioned 14 factors. (13) analyzes the Dupuytren disease with a multivariate ordinal logit model, taking into account age and sex, and tested hypotheses of the independence between groups of fingers. However, most of the studies on the phenotype of this disease have been observational studies without comprehensive statistical analyses. Phenotype risk factors previously described include alcohol consumption, smoking, manual labor, hand trauma, diabetes mellitus, and epilepsy (29, 14). In subsection 3.4.1, we infer the Dupuytren disease network with 14 potential risk factors based on our Bayesian approach. In subsection 3.4.2, we consider only the 10 fingers to infer the interaction between the fingers.

3.4.1

Inference for Dupuytren disease with risk factors

We consider the severity of disease in all 10 fingers of the patients and 14 potential phenotype risk factors of the disease, so p = 24. The factors are: age, sex, smoking, amount of alcohol (Alcohol), relative (Relative), number of hand injury of patients (HandInjury), Manual labour (Labour), Ledderhose disease (Ledderhose), diabetes disease (Diabetes), epilepsy disease (Epilepsy), liver Disease (LiverDisease), peyronie disease (Peyronie), knucklepade disease (Knucklepade). For each finger we measure angles of metaccarpophalangeal joint, two interphalangeal joints (for thumb fingers we only measure two interphalangeal joints); then we sum those angles for each fingers. The total angles could vary from 0 to 270 degrees; in this dataset the minimum degree is 0 and maximum 157 degrees. The age of participants (in years) ranges from 40 to 89 years, with an average age of 66 years. Smoking is binned into 3 ordered categories (never, stopped, and smoking). Amount of alcohol consumption is binned into 8 ordered categories (ranging from no alcohol to ≥ 20 consumption per week). The other variables are binary. We apply our Bayesian framework to infer the conditional (in)dependence among the variables in order to identify the potential risk factors of the Dupuytren disease and discover how they affect the disease. We place a uniform distribution as an uninformative prior on the graph and the G-Wishart WG (3, I24 ) on the precision matrix. We run our BDMCMC algorithm for 2, 000K iterations with a 1, 000K sweeps burn-in. The graph with the highest posterior probability is the graph with 42 edges. Figure 3.4 shows the selected graph with 26 edges, for which the posterior inclusion probabilities in (4.12) is greater then 0.5. Edges in the graph show the interactions among the genes. Figure 3.5 shows the image of all posterior inclusion probabilities for visualization.

58

Bayesian modeling of Dupuytren disease

Fig. 3.4 The inferred graph for the Dupuytren disease dataset based on 14 risk factors and the total degrees of flexion in all 10 fingers. It reports the selected graph with 24 significant edges for which their posterior inclusion probabilities (4.12) are more than 0.5.

The results (Figures 3.4 and 3.5) show that factors “Age”, “Ledderhose disease”, “Hand Injury”, “Alcohol”, have a significant effect on the severity of the Dupuytren disease. Graph 3.4 also shows that factor “Age” is a hub in this graph and it plays a significant role as it affects the severity of the disease directly and indirectly through the influence of other risk factors such as “Ledderhose”.

3.4.2

Severity of Dupuytren disease between pairs of fingers

Here, we consider the severity of Dupuytren disease between pairs of 10 hand fingers. Interaction between fingers is important because it help surgeons to decide weather they should operate one finger or multiple fingers simultaneously. The main idea is that if

3.4 Analysis of Dupuytren disease dataset

59

Posterior Edge Inclusion Probabilities Sex Age Labour Smoking Alcohol Diabetes Epilepsy LiverDisease Peyronie Ledderhose Knucklepads HandInjury Relative Right1 Right2 Right3 Right4 Right5 Left1 Left2 Left3 Left4 Left5

0.8

0.6

0.4

0.2

Sex Age Labour Smoking Alcohol Diabetes Epilepsy LiverDisease Peyronie Ledderhose Knucklepads HandInjury Relative Right1 Right2 Right3 Right4 Right5 Left1 Left2 Left3 Left4 Left5

0.0

Fig. 3.5 Image visualization of the posterior pairwise edge inclusion probabilities of all possible edges in the graph, for 10 fingers with 14 risk factors.

fingers are almost independent in terms of the severity of Dupuytren disease, there is no reason to operate the fingers simultaneously. To apply our Bayesian framework for these 10 variables (fingers), we place a uniform distribution as an uninformative prior on the graph and the G-Wishart WG (3, I10 ) on the precision matrix. We run our BDMCMC algorithm for 2, 000K iteration with a 1, 000K as burn-in. The graph with the highest posterior probability is the graph with 12 edges. Figure 3.6 shows the selected graph with 8 edges, for which the posterior inclusion probabilities in (4.12) is greater then 0.5. Edges in the graph show the interactions among the variables under consideration. Figure 3.7 shows the table of all posterior inclusion probabilities. The results (Figures 3.6 and 3.7) show that there are significant co-occurrences of Dupuytren disease in the ring fingers and middle fingers in both hands. Therefore we can infer that middle finger substantially associate to the ulnar side of the hand. Surpris-

Bayesian modeling of Dupuytren disease

60

Fig. 3.6 The inferred graph the Dupuytren disease dataset based on the total degrees of flexion in all 10 fingers. It reports the selected graph with 9 significant edges for which their posterior inclusion probabilities (4.12) are more than 0.5. Posterior Edge Inclusion Probabilities 1.00 Right1

0

Right2 0.23

0.23 0.24 0.21 0.25 0.24 0.27 0.21 0

Right3 0.24 0.13

0.13 0.11 0.28 0.22 0

Right4 0.21 0.11 0.95

0.3

0.27

0.09 0.09 0.23

0.95

0.3

0.25 0.23 0.77 0.09 0.11

0

0.1

0.21 0.13 0.09 0.07 0.08

0.1

0

0.28 0.21 0.08 0.12 0.41

Row

Right5 0.25 0.28

0.2

0.2

0.75

0.50

Left2 0.27

0.2

0

0.23 0.13 0.21 0.25

0.25 0.21 0.21 0.28 0

Left3 0.21 0.09 0.77 0.09 0.08 0.21 0.25

Left4 0.2

0.25 0.17 0.49 0

0.09 0.09 0.07 0.12 0.21 0.17 0.98

0.98 0.16 0

Left4

Left3

Left2

Left1

Right5

Right4

Right3

Right2

Right1

Left5 0.27 0.23 0.11 0.08 0.41 0.28 0.49 0.16 0.52

0.25

0.52 0 Left5

Left1 0.24 0.22 0.25 0.21 0.28

0.00

Fig. 3.7 Image visualization of the posterior pairwise edge inclusion probabilities of all possible edges in the graph, for 10 fingers.

ingly, our result shows that there is a significant relationship between middle fingers in both hands. This result supports the hypotheses the the disease has genetic factors or other biological factors that affect fingers simultaneously. Moreover, it also shows that the joint interactions between fingers in both hand is almost symmetric.

3.5 Conclusion

3.4.3

61

Fit of model to Dupuytren data

Posterior predictive checks can be used for checking the proposed Bayesian approach fits the Dupuytren data set. If the model fits the Dupuytren data set, then simulated data generated under the model should look like to the observed data. Therefore, first, based on our estimated graph from our BDMCMC algorithm in section 3.4.1, we draw simulated values from the posterior predictive distribution of replicated data. Then, we compare the samples to our observed data. Any systematic differences between the simulations and the data determine potential failings of the model. In this regard and based on the result in subsection 3.4.1 (Figures 3.4 and 3.5), for both simulation and observed data, we obtain the conditional distributions of the potential risk factors and fingers. We show that the result for finger 4 in right hand and risk factor age in figure 3.8, for finger 5 in right hand and risk factor relative in figure 3.9, and for finger 2 in right hand and risk factor Ledderhose in figure 3.10. Figure 3.8 plots the empirical and predictive distribution of variable finger 4 in hand right conditional on variable “age”in four categories {(40, 50), (50, 60), (60, 70), (70, 90)}. For variable finger 4 in hand right, based on Tubiana Classification, we categories it in 5 categories (category 1: 0 degree for total angle; 2: degree between (1, 45); 3: degree between (46, 90); 4: degree between (90, 135); 5: degree more than 135). Figure 3.9 plots the empirical and predictive distribution of variable finger 5 in hand right conditional on variable “Relative”. Figure 3.10 plots the empirical and predictive distribution of variable finger 2 in hand right conditional on variable “Ledderhose”. Figures 3.10 and 3.9 show that the fit is good, since the predicted conditional distributions, in general, are the same as the empirical distributions.

3.5

Conclusion

In this paper we have proposed a Bayesian methodology for discovering the effect of potential risk factors of Dupuytren disease and the underling relationships between fingers simultaneously. The results of the case study clearly demonstrate that age, alcohol, relative, and ledderhose diseases all affect Dupuytren disease directly. Other risk factors only affect Dupuytren disease indirectly. Another important result is that severity of the Dupuytren disease in fingers are correlated. In particular, the middle finger with the ring finger. This implies that

Bayesian modeling of Dupuytren disease

62

P(Right4|50 plotcoda( output.ggm ) The results in Figure 4.4 indicates that our BDMCMC algorithm, roughly speaking, converges after 20, 000 iteration and that burn-in of 25, 000 is sufficient.

4.6.3

Comparison and goodness-of-fit

The function compare provides several measures for checking the performance of our algorithms and compared with other alternative approaches, with respect to the true graph structure. To check the performance of our both algorithms (Algorithms 2 and 3), we also run the Algorithms 3 with the same condition as bellow

4.6 User interface by toy example

83

Selected graph

Posterior probability of graphs Pr(selected graph|data)

0.20

3 ●

6 ●

0.10

1 ●

0.05

5 ●

0.15

2 ● Pr(graph|data)

4 ●

8 ●

0.00

7 ●

0

200

300

400

500

600

Trace of graph size graph

14

Posterior probability = 0.213280973283255

100

8 4

6

Graph size

10

0.3 0.2

0

0.0

2

0.1

Pr(graph size|data)

12

0.4

Posterior probability of graphs size

6

8

10 Graph size

12

14

0

10000

30000

50000

Iteration

Fig. 4.3 Visualization summary of the simulation data based on the output of bdgraph function. Figure in the top-left is the inferred graph with the highest posterior probability. Figure in the topright is the estimated posterior probabilities of all visited graphs. Figure in the bottom-left is the estimated posterior probabilities of all visited graphs based on the size of the graph. Figure in the bottom-right is the trace of our algorithm based on the size of the graphs.

R> output.gcgm plotroc( data.sim, output.ggm, output.gcgm ) As expected, the result indicates that the “GGM”approach performs slightly better than the “GCGM”, since the data is generated from a Gaussian graphical model.

84

BDgraph: An R Package for Structure Learning in Graphical Models

0.6 0.4 0.0

0.2

Posterior link probability

0.8

1.0

Trace of the Posterior Probabilities of the Links

0

10000

20000

30000

40000

50000

Iteration

Fig. 4.4 Plot for monitoring the convergence of BDMCMC algorithm. It shows the trace of the cumulative posterior probability of the all possible links of the graph.

0.6 0.4 0.2

True Postive Rate

0.8

1.0

ROC Curve

0.0

ggm gcgm 0.0

0.2

0.4

0.6

0.8

1.0

False Postive Rate

Fig. 4.5 ROC plot to compare the performance of the GGM and GCGM approachs.

Here, we also compare our approach to the Meinshausen-Buhlmann approach mb (20) by using the huge package (34). We consider the following code R> install.packages( "huge" ) R> library( huge ) R> huge output.huge compare( data.sim, output.ggm, output.gcgm, output.huge, vis = TRUE ) True graph BDgraph(ggm) BDgraph(gcgm) huge(mb) true positive 7 5 4 5 true negative 21 16 17 14 false positive 0 5 4 7 false negative 0 2 3 2 true positive rate 1 0.71 0.57 0.71 false positive rate 0 0.24 0.19 0.33 accuracy 1 0.75 0.75 0.68 balanced F-score 1 0.59 0.53 0.53 positive predictive 1 0.50 0.50 0.41 This result indicates that the huge package based on the mb discovers 5 true links, in addition, it finds 7 extra links which are not in the true graph. See Figure 4.6 for visualization. For more comparison see Mohammadi and Wit (24, section 4).

4.7

Application to real data sets

In this section, we analyze two data sets from biology and sociology, by using the functions that are available in the BDgraph package. In Section 4.7.1, we analyze a Labor force survey data set, which are mixed data. In Section 4.7.2, we analyze Human gene expression data, which are high-dimensional data that do not follow the Gaussianity assumption. Both data sets are available in our package.

4.7.1

Application to Labor force survey data

(13) analyzes the multivariate dependencies among income, eduction and family background, by using data concerning 1002 males in the U.S labor force. The data is available in our package. Users can call the data via R> data( surveyData ) R> dim( surveyData ) [1] 1002

7

BDgraph: An R Package for Structure Learning in Graphical Models

86

True graph

BDgraph(ggm)

3 ●

3 ●

4 ●

2 ●

5 ●

4 ●

1 ●

6 ●

2 ●

5 ●

8 ●

1 ●

6 ●

8 ●

7 ●

7 ●

BDgraph(gcgm)

huge(mb)

3 ●

3 ●

4 ●

2 ●

5 ●

4 ●

1 ●

6 ●

8 ● 7 ●

2 ●

5 ●

1 ●

6 ●

8 ● 7 ●

Fig. 4.6 Comparing the performance of the BDgraph with huge packages. The graph in the topleft is the true graph. The graph in the top-right is the selected graph based on Gaussian grapical models. The graph in the bottom-left is the selected graph based on Gaussian copula grapical models. The graph in the bottom-right is the selected graph based on the mb approach in the huge package.

R> head( surveyData, 5 )

[1,] [2,] [3,] [4,] [5,]

income degree children pincome pdegree pchildren age NA 1 3 3 1 5 59 11 0 3 NA 0 7 59 8 1 1 NA 0 9 25 25 3 2 NA 0 5 55 100 3 2 4 3 2 56

Missing data are indicated by NA and in general the rate of missing data is around 0.09 with higher rates for variables income and pincome. In this data set, we have seven observed variables as follows

4.7 Application to real data sets

87

income: An ordinal variable indicating the respondent’s income in 1000s of dollars. degree: An ordinal variable with five categories indicating the respondent’s highest educational degree. children: A count variable indicating the number of children of the respondent. pincome: An ordinal variable with five categories indicating financial status of respondent’s parents. pdegree: An ordinal variable with five categories indicating the highest educational degree of the respondent’s parents. pchildren: A count variable indicating the number of children of the respondent’s parents. age: A count variable indicating the respondent’s age in years. Since those variables are measured on various scales, the marginal distributions are heterogeneous, which makes the study of their joint distribution very challenging. However, we can apply our Bayesian framework based on the Gaussian copula graphical models to this data set. Thus, we run the function bdgraph with option method = "gcgm". For the prior distributions of the graph and precision matrix, as default of the function bdgraph, we place a uniform distribution as a uninformative prior on the graph and a G-Wishart WG (3, I7 ) on the precision matrix. We run our function for 50, 000 iteration with 25, 000 as burn-in. R> output summary( output ) $selected_graph income degree income . degree 1 children 1 pincome . pdegree . pchildren . age 1

children pincome pdegree pchildren age 1 1 . . . 1 . . . 1 1 . . . . 1 1 1 . . . 1 1 . 1 1 1 . 1 1 1 1 1 1 . . . 1 . 1 . .

$phat income degree children pincome pdegree pchildren age

88

BDgraph: An R Package for Structure Learning in Graphical Models

income degree children pincome pdegree pchildren age

. . . . . . .

1 . . . . . .

1.00 0.53 . . . . .

0.27 0.03 0.14 . . . .

0.12 1.00 0.88 1.00 . . .

0.05 0.84 1.00 0.53 0.99 . .

1.00 0.11 1.00 0.07 1.00 0.03 .

$Khat [1,] [2,] [3,] [4,] [5,] [6,] [7,]

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 3.546 -1.707 -0.949 -0.104 -0.041 0.019 -1.219 -1.707 4.127 0.321 0.009 -1.853 0.413 -0.056 -0.949 0.321 8.875 0.086 0.737 -1.074 -4.392 -0.104 0.009 0.086 5.686 -2.050 0.355 0.036 -0.041 -1.853 0.737 -2.050 6.510 1.029 1.028 0.019 0.413 -1.074 0.355 1.029 7.826 0.022 -1.219 -0.056 -4.392 0.036 1.028 0.022 9.250

The result of the function summary are the adjacency matrix of the graph with the highest posterior probability (selected_graph), estimated posterior probabilities of all possible links (phat) and estimated precision matrix (Khat). Figure 4.7 shows the selected graph which is a graph with the highest posterior probability. Links in the graph are indicated by signs “+”and “-”, which represents a positively and negatively correlated relationship between associated variables, respectively. The result indicate that education, fertility and age determinate income, since there are highly positively correlated relationships between income and those three variables, with posterior probability equal to one for all. It shows that respondent’s education and fertility are negatively correlated with posterior probability equal to 0.67. The respondent’s education is certainly related to his parent’s education, since there is positively correlated relationship with posterior probability equal to one. Moreover, the result indicate that relationships between income, education and fertility hold across generations. For this data set, (13) estimated the conditional independence between variables by inspecting whether the 95% credible intervals for the associated regression parameters do not contain zero. Our result is the same with Hoff’s result except for one link. Our result indicate there is a strong relationship between parent’s education (pdegree) and fertility (child), which is not selected by Hoff.

4.7 Application to real data sets

89

age

+ +

−

income

+ children

− pdegree

+

− +

+ −

+

degree

−

pincome

+

pchildren

Fig. 4.7 Graph with the highest posterior probability for the labor force survey data based on the output from bdgraph. Sign “+”represents a positively correlated relationship between associated variables and “-”represents a negatively correlated relationship.

4.7.2

Application to Human gene expression

Here, by using the functions that are available in the BDgraph package, we study the structure learning of the sparse graphs applied to the large-scale human gene expression data which was originally described by (31). They collected the data to measure gene expression in B-lymphocyte cells from Utah individuals of Northern and Western European ancestry. They consider 60 individuals whose genotypes are available online at ftp://ftp.sanger.ac.uk/pub/genevar. Here the focus is on the 3125 Single Nucleotide Polymorphisms (SNPs) that have been found in the 5’ UTR (untranslated region) of mRNA (messenger RNA) with a minor allele frequency ≥ 0.1. Since the UTR (untranslated region) of mRNA (messenger RNA) has been subject to investigation previously, it should have an important role in the regulation of the gene expression. The raw data were background corrected and then quantile normalized across replicates of a single individual and then median normalized across all individuals. Following (4), among the 47, 293 total available probes, we consider the 100 most variable probes that correspond to different Illumina TargetID transcripts. The data for these 100 probes are available in our package. To see the data, users can run the code R> data(geneExpression) R> dim( geneExpression )

BDgraph: An R Package for Structure Learning in Graphical Models

90 [1] 60 100

20 15 10

Frequency

20

6

8

10

12

14

16

0

0

0

5

5

10

10

Frequency

15

30

20

25

40

The data consist of only 60 observations (n = 60) across 100 genes (p = 100). This data set is an interesting case study for graph structure learning, as it has been used in (4, 24, 11). In this data set, all the variables are continuous but they do not follow the Gaussianity assumption, as you can see in Figure 4.8. Thus, we apply the Gaussian copula graphical models. Therefore, we run function bdgraph with option method = "gcgm". For the prior distributions of the graph and precision matrix, as default of the function bdgraph, we place a uniform distribution as a uninformative prior on graph and the G-Wishart WG (3, I100 ) on the precision matrix.

6

8

10

12

14

6

8

10

12

14

GI_17981706−S

10 8 6

Frequency

4

15 10

6

8

10

12

14

GI_41190507−S

16

0

0

0

2

5

5

10

Frequency

15

20

12

25

GI_41197088−S

20

GI_18426974−S

6

8

10

GI_33356162−S

12

6

8

10

12

14

16

Hs.449605−S

Fig. 4.8 Univariate histograms of first 6 genes in human gene data set.

We run our function for 100, 000 iteration with 50, 000 as burn-in as follows R> output select( output ) This function takes around 13 hours. The function select gives as a default the graph with the highest posterior probability, which has 991 links. We use the following code to visualize the ones that we believe have posterior probabilities larger than 0.995. R> select( output, cut = 0.995, vis = TRUE )

4.7 Application to real data sets

91

By using option vis = TRUE, this function plots the selected graph. Figure 4.9 shows the selected graph with 266 links, for which the posterior inclusion probabilities (4.12) are greater than 0.995.

GI_3079 ● GI_2014 ●

GI_1837 ●

GI_3165 ● hmm3587 ● GI_4505 ● GI_4247 ● GI_3753 ●

GI_4035 ●

GI_1332 ●

GI_1302 ●

Hs.1851 ● GI_3422 ● GI_1974 ● GI_7657 GI_4119 ● GI_2776 GI_5454 ● GI_3107 GI_3754 ● ● GI_2748 ● ● GI_2449 ● ● GI_3754 hmm1029 ● GI_7019 ● ● Hs.4495 GI_1922 GI_2479 ● ● ● GI_7661 GI_2351 GI_2430 ● ● GI_4266 ● Hs.4496 ● ● GI_2789 GI_4135 ● ● GI_7662 GI_3754 ● ● GI_3137 GI_2030 GI_1109 ● ● ● GI_1199 GI_2747 GI_4505 ● ● ● Hs.4064 GI_9961 Hs.5121 GI_3856 GI_2841 ● ● ● ● ● hmm3577 GI_3491 ● hmm3574 ● GI_3040 ● GI_3134 GI_3753 ● ● ● GI_8923 GI_4119 ● ● GI_2138 GI_4504 GI_2855 ● GI_1615 hmm1028 ● ● ● ● GI_1864 GI_1842 ● ● GI_2007 ● GI_4502 GI_3754 ● ● GI_2202 GI_2861 GI_2775 ● ● ● GI_2146 GI_1864 GI_1655 GI_3422 ● ● ● ● hmm9615 ● GI_3335 GI_1421 GI_4504 GI_2775 ● ● ● ● GI_1351 GI_4504 GI_2138 GI_3335 ● ● ● ● Hs.4496 ● GI_4266 GI_4265 ● GI_2789 ● ● GI_4507 GI_1460 ● ●

GI_2037 Hs.4495 ● ●

GI_2161 ● Hs.4496 ●

Hs.5121 ●

Hs.1363 ●

Hs.1712 ●

GI_4119 ● GI_1798 ●

Fig. 4.9 The inferred graph for the human gene expression data using Gaussian copula graphical models. This graph consists of links with posterior probabilities (4.12) larger than 0.995.

The function phat reports the estimated posterior probabilities of all possible links in the graph. For our data the output of this function is a 100 × 100 matrix. Figure 4.10 reports the visualization of that matrix. Most of the links in our selected graph (graph with the highest posterior probability) conform the results in previous studies. For instance, (4) found 54 significant interactions between genes, most of which are covered by our method. In addition, our approach indicates additional gene interactions with high posterior probabilities that are missed by previous work; thus our method may complement the analysis of human gene interaction networks.

BDgraph: An R Package for Structure Learning in Graphical Models

92

Posterior Probabilities of all Links

1.0

20 0.8

40

0.6

0.4

60

0.2 80

0.0

20

40

60

80

Fig. 4.10 Image visualization of the estimated posterior probabilities for all possible links in the graph for the human gene expression data.

4.8

Conclusion

The BDgraph package aims to help researchers in two ways. Firstly, the package provides a Bayesian framework which potentially can be extended, customized and adapted to address different requirements, in graphical models. Secondly, it is currently the only R package that provides a simple and complete range of tools for conducting Bayesian inference for graphical modelling based on conditional independence graph estimation. We plan to maintain and develop the package in the future. Future versions of the package will contain more options for prior distributions of graph and precision matrix. On possible extension of our package, is to deal with outliers, by using robust Bayesian graphical modelling using Dirichlet t-Distributions (8, 22). An implementation of this method would be desirable in real applications.

References

93

References [1] Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. (1999). LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition. [2] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412–424. [3] Bates, D. and Maechler, M. (2014). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.1-2. [4] Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2):447–457. [5] Cappé, O., Robert, C., and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):679–700. [6] Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems:1695. [7] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496):1418–1433. [8] Finegold, M. and Drton, M. (2014). Robust bayesian graphical modeling using dirichlet t-distributions. Bayesian Analysis, 9(3):521–550. [9] Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441. [10] Friedman, J., Hastie, T., and Tibshirani, R. (2014). glasso: Graphical lasso- estimation of Gaussian graphical models. R package version 1.8. [11] Gu, Q., Cao, Y., Ning, Y., and Liu, H. (2015). Local and global inference for high dimensional gaussian copula graphical models. arXiv preprint arXiv:1502.02347. [12] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer.

94

References

[13] Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics, pages 265–283. [14] Højsgaard, S. (2012). Graphical independence networks with the grain package for r. Journal of Statistical Software, 46(i10). [15] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20(4):388–400. [16] Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H., and Bühlmann, P. (2012). Causal inference using graphical models with the r package pcalg. Journal of Statistical Software, 47(11):1–26. [17] Lauritzen, S. (1996). Graphical models, volume 17. Oxford University Press, USA. [18] Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. (1979). Basic linear algebra subprograms for fortran usage. ACM Transactions on Mathematical Software (TOMS), 5(3):308–323. [19] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1):119–128. [20] Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462. [21] Mohammadi, A., Abegaz Yazew, F., van den Heuvel, E., and Wit, E. C. (2015). Bayesian modeling of dupuytren disease using gaussian copula graphical models. Arxiv preprint arXiv:1501.04849v2. [22] Mohammadi, A. and Wit, E. (2014). Contributed discussion on article by finegold and drton. Bayesian Analysis, 9(3):577–579. [23] Mohammadi, A. and Wit, E. (2015a). BDgraph: Graph Estimation Based on Birth-Death MCMC Approach. R package version 2.17. [24] Mohammadi, A. and Wit, E. C. (2015b). Bayesian structure learning in sparse gaussian graphical models. Bayesian Analysis, 10(1):109–138. [25] Muirhead, R. (1982). Aspects of multivariate statistical theory, volume 42. Wiley Online Library.

References

95

[26] Murray, I. and Ghahramani, Z. (2004). Bayesian learning in undirected graphical models: approximate mcmc algorithms. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 392–399. AUAI Press. [27] Nelsen, R. B. (2007). An introduction to copulas. Springer Science & Business Media. [28] R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [29] Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29(3):391–411. [30] Scutari, M. (2010). Learning bayesian networks with the bnlearn r package. Journal of Statistical Software, 35(3):22. [31] Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., et al. (2007). Population genomics of human gene expression. Nature genetics, 39(10):1217–1224. [32] Uhler, C., Lenkoski, A., and Richards, D. (2014). Exact formulas for the normalizing constants of wishart distributions for graphical models. arXiv preprint arXiv:1406.4901. [33] Wang, H. and Li, S. (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics, 6:168–198. [34] Zhao, T., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2014). huge: Highdimensional Undirected Graph Estimation. R package version 1.2.6.

Chapter 5 Bayesian Modelling of the Exponential Random Graph Models with Non-observed Networks 5.1

Abstract

Across the sciences, one of the main objectives is network modelling to discover complex relationships among many variables. The most promising statistical models that can be used for network modelling is the Exponential Random Graph Models (ERGMs). These models provide an insightful probabilistic model to represent a variety of structural tendencies that define complicated dependence patterns hardly modeled by other probabilistic models. However, they are restricted to the models that regarded the network as given, observed networks data. In the present paper, we develop a novel Bayesian statistical framework which combines the class of ERGMs with graphical models capable of modelling non-observed networks. Our proposed method greatly extends the scope of the ERGMs to more applied research areas. We discuss possible extensions of the method. Key words: Bayesian inference, Exponential random graph models, Graphical models, Covariance selection, Birth-death process, Markov chain Monte Carlo, G-Wishart.

98

5.2

Bayesian Modelling of the Exponential Random Graph Models

Introduction

Network modeling pervades all of science since one of the main objectives of science is to discover complex relationships among large numbers of variables. For the prevention of epidemics, social science relies on a keen understanding of interactions among individuals (20, 7). Similarly in biology, curing complex diseases requires an understanding of how “genes talk to each other”(1). One way to describe these kinds of complex relationships is by means of an abstract network. For a general overview of the statistical models and methods for network modeling, see (10). Exponential random graph models (34, 22) are promising and flexible family of statistical models for modelling network topology. These models have been used mainly in the social science literature since they allow to statistically account for the complexity inherent in many network data (27, 26). In ERGMs, the basic assumption is that the topological structure of an observed network can be explained by the a vector of network statistics that capture the generative structures in the network (27). However, up till now, these models are restricted to the observed network data. In most applications data are not the observed networks, like in biology (e.g. microarray gene expression data and cell signaling data) in neuroscience (e.g. fMRI data). Is it possible to extend the ERGMs to those type of data? A possible solution is combining the class of ERGMs with graphical models. Graphical models (12) provide an appealing and insightful way to obtain uncertainty estimates when inferring network structure. The close relationship between the topology of the underlying graphs and their probabilistic properties is a main aspect in graphical models, and it provides the potential tools to interpret the complex models. In this regard, Bayesian approaches provide a mainly straightforward tools, and much recent progress has been made in graphical models (9, 5, 14, 32, 16). More recently, (16) have developed a search algorithm based on birth-death MCMC approach that work with high-dimensional data. However, graphical models are powerful approaches only for estimating the underlying graph structure, they are not designed to model the network graphs. In this paper, we develop a new Bayesian statistical framework for ERGMs, which is capable for network modeling of non-observed networks. The proposed method greatly extends the scope of the ERGMs to more applied research areas, which not limited only in social science. In our method, to apply the ERGMs to non-observed networks data, we combine the class of ERGMs with graphical models capable of modelling non-observed networks. In particular, in our Bayesian framework, we design a computationally efficient search algorithm to explore all the graph space to distinguish not only important edges

5.3 Exponential families and graphical models

99

but also key features and detect the underlying graph structure. This search algorithm is based on birth-death Markov chain Monte Carlo algorithm proposed by (16) for Gaussian graphical models. The paper is organized as follows. In section 5.3 we briefly explain the exponential random graph models and graphical models. In section 5.4 we introduce our proposed Bayesian framework for exponential graph models with non-observed networks.

5.3

Exponential families and graphical models

In this section we briefly review two probabilistic families of graphical models which we will use in our proposed methodology; for more details see e.g. (29).

5.3.1

Exponential random graph models

Exponential random graph models (ERGMs) or p∗ models (34) are the families of statistical models that provide a flexible way to model the complex dependence structure of networks. They are the most popular models for social networks and also used in physics and biology (25). The aim to model data as observed networks consisting of nodes and edges, which in the social network context represent actors and relationships between these actors, respectively. There has been comparably little research on using Bayesian approach to infer the parameters of ERGMs besides recent articles by (26, 11, 2, 3). See (21) and (22) for an overview of ERGMs. In an ERGMs, the random matrix G = {gij } is defined over the graph space on a set of p nodes, with each variable in G representing the presence or absence of a particular edge ( gij = 1 if there is a link from i to j, and gij = 0 otherwise). Edges connecting a node to itself are not allowed so gii = 0. For a graph (which consists of a set of edges over the set of nodes), the ERGM is then given by P (G|θ) =

1 exp{θt S(G)}, Z(θ)

(5.1)

where θ ∈ Θ represents a vector of unknown parameters, and Z(θ) is a normalizing constant, X Z(θ) = exp{θt S(G)}, G∈Gp

and S(G) term is a network statistic of interest that gives the ERGMs much of its explana-

100

Bayesian Modelling of the Exponential Random Graph Models

tory power. The vector S(G) can contain statistics to capture the generative structures of connectivity in the network. It can include, for instance, P • the number of edges ( i,j gij ) to capture network density; P • the number of triangles ( i,j,l gij gil gjl ) to capture transitivity; P • the number of 2-stars ( i,j,l gil gjl ) where a k-star (k > 2) is a node with k neighbors or a node of degree k, and the wide variety of other endogenous structures (27). Estimating ERGMs parameters is a challenging problem, since the analytic form of the normalizing constant Z(θ) is unknown due to the combinational complexity of summing over all possible 2p(p−1)/2 graphs in Gp (2, 3, 8). Although, ERGMs are difficult to handle in practice, they are quite popular in the literature since they are conceived to capture the complex dependence structure of the graph and allow a reasonable interpretation of the observed data. At the basis of this class of models, the dependence hypothesis is that edges selforganize into specific structures called configurations (e.g. triangles, n-star). Flexibility to adapt to different network contexts, there is a broad range of possible network configurations (34, 22). A positive value of θi ∈ θ results in a tendency for the specific configuration corresponding to Si (G) to be observed in the data, otherwise it should expected by chance. Note, in ERGMs data are observed networks, which is a strong limitation. In most of the applications, data are measured on variables, such as gene expression data and cell signalling data. The question we intend to answer is whether it is possible to extend the idea of ERGMs to those types of data? We extend the ERGMs to non-observed networks by combining it with graphical models.

5.3.2

Graphical models

Graphical models (12) use a graph concept to represent conditional dependence relationships among random variables, as non-observed networks. When observed data come from noisy measurements of the variables, then graphical models present an appealing and insightful way to describe graph-based dependencies between the random variables. A graph G = (V, E) denotes a set of vertices V = {1, 2, ..., p} – where each node corresponds with a random variable – and a set of existing edges E, and E denotes the set of non-existing edges. We are interested on undirected graphical models in where (i, j) ∈ E is equivalent with (j, i) ∈ E, also known as Markov random fields (24). In this class of models, nodes in the graph G correspond to the random variables. The absence of an edge between two nodes determines the two corresponding variables are conditionally inde-

5.4 Bayesian hierarchical model for ERGMs with non-observed networks

101

pendent given the remaining variables, while an edge between the two nodes indicates the conditional dependence of the variables. Graphical models that follow the multivariate Gaussian distribution are called Gaussian graphical models (GGMs), also known as covariance selection models (4). In GGMs, zero entries in the precision matrix correspond to the absence of edges on the graph, which are conditional independence between pairs of random variables given the rest. With respect to the graph G, a zero mean Gaussian graphical model is MG = Np (0, Σ) | K = Σ−1 ∈ PG , where PG denotes the space of p × p positive definite matrices with entries equal to zero for not existing edges in graph G. Let X = (X 1 , ..., X n ) be an independent and identically distributed sample of size n from model MG ; Then, the likelihood is 1 n/2 (5.2) P (X|K, G) ∝ |K| exp − tr(KU ) , 2 where U = X′ X. For a p-dimensional graph, the size of graph space is in total 2p(p−1)/2 , which grows supper-exponentially with the number of nodes in the graph. Thus, Bayesian inference on all graph space is severely limited by the nature of the graph space. In this regard, there are the efficient stochastic search algorithms that can explore the graph space (16, 5, 9). These types of search algorithms explore the graph space by adding or removing one edge in each step, knows as neighborhood search algorithm. These algorithms can potentially work with the graph with more than 100 nodes (9, 16, 17).

5.4

Bayesian hierarchical model for ERGMs with nonobserved networks

In this section, we proposed the hierarchical Bayesian methodology to discover the underlining network structure and features which are important. We can display the hierarchical model schematically as below θ −→ G −→ K −→ X = (X1 , ..., Xn ).

102

Bayesian Modelling of the Exponential Random Graph Models

Thus, we consider the joint posterior distribution of the parameters as bellow P (θ, G, K | X) ∝ P (X | θ, G, K) P (K | G) P (G | θ) P (θ).

(5.3)

In our methodology, we assume the observed data follows a multivariate Gaussian distribution.

5.4.1

Model for prior specification on graph

Here, by consider the idea of exponential random graph, we use a prior on the graph as follows 1 P (G | θ) = exp{θt S(G)} (5.4) Z(θ) where S(G) is a vector of statistics of the graph (e.g., the number of edges, triangles, etc.) and θ ∈ Θ denotes the parameter vector of the model.

5.4.2

Prior specification on precision matrix

For the prior distribution of the precision matrix, we use the G-Wishart (23) distribution. In Gaussian graphical models, the G-Wishart prior distribution is highly attractive since it is conjugate to normally distributed data and places no probability mass on zero entries of the precision matrix. The G-Wishart distribution WG (b, D), for random matrix K is defined as 1 1 (b−2)/2 P (K|G) = |K| exp − tr(DK) , K ∈ PG , IG (b, D) 2 where b > 2 is a degree of freedom, D is a symmetric positive definite matrix, and IG (b, D) is a normalizing constant, Z

(b−2)/2

|K|

IG (b, D) = PG

1 exp − tr(DK) dK. 2

For complete graph G, the G-Wishart distribution reduces to the Wishart distribution, hence, its normalizing constant has an explicit form (18). Also, for decomposable graphs, the IG (b, D) has an explicit form (23). However, for non-decomposable graphs, the IG (b, D) has an intractable form (28). Since the G-Wishart prior is conjugate to the likelihood (5.2), the posterior distribution

5.4 Bayesian hierarchical model for ERGMs with non-observed networks of K is

103

1 1 (b∗ −2)/2 ∗ |K| exp − tr(D K) , P (K|X, G) = IG (b∗ , D∗ ) 2

where b∗ = b + n and D∗ = D + U , that is, WG (b∗ , D∗ ). To consider other possible prior for precision matrix see e.g. (33, 31, 30), and (35). They place no probability mass on zero entries and stable priors on the non-zero entries of the precision matrix.

5.4.3

MCMC sampling scheme

The high dimensionality of the graph G leads to the use of MCMC sampler algorithm can potentially explore all graph space. Specifically, we introduce the MCMC algorithm that simulate from the joint posterior distribution (5.3). The proposed MCMC algorithm is in three steps as follows Step 1: Sample from θ, based on exchange algorithm (19). Step 2: Sample from graph space, based on birth-death MCMC sampling algorithm proposed by (16). Step 3: Sample from precision matrix, based on exact sampling algorithm form G-Wishart distribution proposed by (13). For step 1, in section 1, We illustrate how to sample from conditional distribution of θ based on exchange algorithm (19). For step 2, we using computationally efficient birth-death MCMC sampler proposed by (16) for Gaussian graphical models. Their algorithm explores the graph space by adding or deleting an edge in a birth or death event, in which the events are based on a continuous time birth-death Markov process. In a graph G = (E, V ) with precision matrix K, each edge e = (i, j) ∈ E dies independently of others as a Poisson process with death rate P δe (K). Since the events are independent, the overall death rate is δ(K) = e∈E δe (K). With similar definition, each non-edge e = (i, j) ∈ E appears independently as a Poisson P process with birth rate βe (K) and the overall birth rate is β(K) = e∈E βe (K). The birth and death rates of edges occur in continuous time with the rates determined by the stationary distribution of the process. The algorithm is considered in such a way that the stationary distribution equals the target posterior distribution. Since, the birth-death processes are independent Poisson processes, the time between two successive events has an exponential distribution with mean 1/(β(K) + δ(K)). This time can be considered as the process expends for any particular instance of the graph space. Therefore, the probability of birth and death events are proportional to their rates

104

Bayesian Modelling of the Exponential Random Graph Models

as P (birth for edge e) =

βe (K) , β(K) + δ(K)

for each e ∈ E,

(5.5)

P (death for edge e) =

δe (K) , β(K) + δ(K)

for each e ∈ E.

(5.6)

Mohammadi and Wit (16, section 3) proof the birth-death MCMC sampling algorithm converge to the target posterior distribution by considering accordingly birth and death rates, P (G+e , K +e |X, θ) βe (K) = , for each e ∈ E, (5.7) P (G, K|X, θ) δe (K) =

P (G−e , K −e |X, θ) , P (G, K|X, θ)

for each e ∈ E,

(5.8)

in which G+e = (V, E ∪ {e}), and K +e ∈ PG+e and similarly G−e = (V, E \ {e}), and K −e ∈ PG−e . Based on the above explanation, we summarize the proposed sampler algorithm as below. 1

Algorithm 5.1. Given the current state (θ, G, K): 1. Update θ, using a Metropolis step. 1.1. Draw θ∗ from symmetric proposal distribution h(θ∗ | θ) 1.2. Draw G∗ ∼ p(G | θ∗ ) 1.3. Propose the exchange move with probability qθ∗ (G)P (θ∗ )qθ (G∗ ) ∗ α(θ → θ ) = min 1, qθ (G)P (θ)qθ∗ (G∗ ) 2. Update G, conditional on θ based on birth and death process. P 2.1. Calculate the birth rates by equation (5.7) and β(K) = Pe∈E βe (K), 2.2. Calculate the death rates by equation (5.8) and δ(K) = e∈E δe (K), 2.3. Calculate the waiting time by W(K) = 1/(β(K) + δ(K)), 2.4. Calculate the type of jump from (5.7) and (5.8) 3. Update K, conditional on the recent G, sample the new precision matrix.

In Algorithm 5.1, the first step is to sample from θ by using exchange algorithm which we explain in section 1. Then (in step two), we calculate the birth and death rates and waiting times. Based on birth and death rates we calculate the type of jump. Details of how to efficiently calculate the birth and death rates are discussed in subsection 1. Finally

5.4 Bayesian hierarchical model for ERGMs with non-observed networks

105

in step 3, according to the new state of the jump, we sample from a new precision matrix using a direct sampling scheme from the G-Wishart distribution which proposed by (13). Step 1: Updating parameter θ In step 1 of algorithm 5.1, we are interested in sample from conditional distribution of θ, P (θ|G) =

1 exp{θt S(G)}p(θ). Z(θ)

(5.9)

Sampling from this conditional distribution is difficult, since requires the evaluation of the P intractable normalizing constant Z(θ) = G∈Gp exp{θt S(G)}. Murray et al. (19) introduced the exchange algorithm based on exact sampling, which is designed for general MCMC algorithms in which their target posterior distributions have additional intractable normalization constant. To circumvent such an intractable normalizing constant, Caimo and Friel (2) explain how to use the concept behind the exchange algorithm. Suppose that G is the current state of our algorithm and we would like to sample θ from (5.9), first we sample θ∗ from symmetric proposal distribution h(θ∗ |θ). Then we sample G∗ , which is the difficult step of the algorithm since this requires a draw from (5.4). Note that, exchange algorithm requires an exact sampling from G. Following (2), we approximate the exact simulation by sampling G∗ from P (G|θ∗ ) using an MCMC run that is “long enough” to get a point that can be treated as if it were simulated exactly from P (G|θ∗ ). They suggested that 500 iteration is a long-enough run. (6) proof that as few as 50 or 100 iterations are usually sufficient. Computing the birth and death rates Calculating the birth and death rates (5.7 and 5.8) is the bottleneck of our BDMCMC algorithm. Here, we explain how to calculate efficiently the death rates and for the birth rates is followed a similar manner. For more details see (15, 16). Following (16) and some simplification, for each e = (i, j) ∈ E, we have δe (K) = where

∗ Djj 1 P (G−e |θ) IG (b, D) ( ) 2 H(K, D∗ ), 1 P (G|θ) IG−e (b, D) 2π(kii − k11 )

P (G−e |θ) −e = eθ[S(G )−S(G)] , P (G|θ)

Bayesian Modelling of the Exponential Random Graph Models

106 and

 # "  1  ∗ 2 (Dij ) ∗ 0 1 ∗ ∗ 1 H(K, D ) = exp − tr(De,e (K − K )) − (Dii − )(kii − k11 ) , ∗  2  Djj 0 0 in which K 0 is a 2 × 2 diagonal matrix that k11 = kii and k22 = Kj,V \j (KV \j,V \j )−1 KV \j,j and K 1 = Ke,V \e (KV \e,V \e )−1 KV \e,e . Evaluating the ratio of prior normalizing constants is the computational bottleneck for computing the death/birth rates.

Explicit form for ratio of prior normalizing constants Calculating the ratio of prior normalizing constants has been a major computational bottleneck in the Bayesian literature (28, 32, 16, 15). More recently (28) providing an explicit representation of such intractable normalizing constant and (15) implement this concept to sampling algorithm of graphical models. By using the theorem derived by (28, Theorem 3.7.), for special case which D is an identity matrix, we have √ Γ((b + d + 1)/2) IG (b, Ip ) =2 π , IG−e (b, Ip ) Γ((b + d)/2) where d denotes the number of triangles formed by the edge e and two other edges in G and Ip denotes a p dimensional identity matrix. Therefore, for the case of D = Ip , we have a simplified formula for the death rates which is given by δe (K) = eθ[S(G

5.5

∗ ] Γ((b + d + 1)/2) ( 2Djj ) 12 H(K, D∗ ), Γ((b + d)/2) (kii − kii1 )

−e )−S(G)

Discussion

We have proposed a Bayesian methodology for exponential random graph models with non-observed networks, which opens a large toolbox to network modelling for non-observed data. By combining exponential random graph models and graph models, we have developed hierarchical Bayesian frameworks. The Bayesian framework that we proposed here is not limited to data which follow Gaussianity assumption. One possible extension of our work could be to Gaussian copula graphical models (15). In our Bayesian framework, we focus on undirected graphical models. We can extend our work to directed graphical models as well. This is an ongoing research subject.

References

107

References [1] Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2):447–457. [2] Caimo, A. and Friel, N. (2011). Bayesian inference for exponential random graph models. Social Networks, 33(1):41–55. [3] Caimo, A. and Friel, N. (2013). Bayesian model selection for exponential random graph models. Social Networks, 35(1):11–24. [4] Dempster, A. (1972). Covariance selection. Biometrics, 28(1):157–175. [5] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496):1418–1433. [6] Everitt, R. G. (2012). Bayesian parameter estimation for latent markov random fields and social networks. Journal of Computational and Graphical Statistics, 21(4):940–960. [7] Goodreau, S. M. (2007). Advances in exponential random graph (p*) models applied to a large social network. Social Networks, 29(2):231–248. [8] He, R. and Zheng, T. (2015). Glmle: graph-limit enabled fast computation for fitting exponential random graph models to large social networks. Social Network Analysis and Mining, 5(1):1–19. [9] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20(4):388–400. [10] Kolaczyk, E. D. (2009). Statistical analysis of network data: methods and models. Springer Science & Business Media. [11] Koskinen, J. H., Robins, G. L., and Pattison, P. E. (2010). Analysing exponential random graph (p-star) models with missing data using bayesian data augmentation. Statistical Methodology, 7(3):366–384. [12] Lauritzen, S. (1996). Graphical models, volume 17. Oxford University Press, USA. [13] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1):119–128.

108

References

[14] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462. [15] Mohammadi, A., Abegaz Yazew, F., van den Heuvel, E., and Wit, E. C. (2015). Bayesian modeling of dupuytren disease using copula gaussian graphical models. Arxiv preprint arXiv:1501.04849v2. [16] Mohammadi, A. and Wit, E. C. (2015a). Bayesian structure learning in sparse gaussian graphical models. Bayesian Analysis, 10(1):109–138. [17] Mohammadi, A. and Wit, E. C. (2015b). Bdgraph: Bayesian structure learning of graphs in r. arXiv preprint arXiv:1501.05108v2. [18] Muirhead, R. (1982). Aspects of multivariate statistical theory, volume 42. Wiley Online Library. [19] Murray, I., Ghahramani, Z., and MacKay, D. (2012). Mcmc for doubly-intractable distributions. arXiv preprint arXiv:1206.6848. [20] Newman, M. E., Watts, D. J., and Strogatz, S. H. (2002). Random graph models of social networks. Proceedings of the National Academy of Sciences, 99(suppl 1):2566–2572. [21] Robins, G., Pattison, P., Kalish, Y., and Lusher, D. (2007a). An introduction to exponential random graph (< i> p*) models for social networks. Social networks, 29(2):173– 191. [22] Robins, G., Snijders, T., Wang, P., Handcock, M., and Pattison, P. (2007b). Recent developments in exponential random graph (< i> p*) models for social networks. Social networks, 29(2):192–215. [23] Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29(3):391–411. [24] Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications. CRC Press. [25] Saul, Z. M. and Filkov, V. (2007). Exploring biological network structure using exponential random graph models. Bioinformatics, 23(19):2604–2611.

References

109

[26] Snijders, T. A. (2002). Markov chain monte carlo estimation of exponential random graph models. Journal of Social Structure, 3(2):1–40. [27] Snijders, T. A., Pattison, P. E., Robins, G. L., and Handcock, M. S. (2006). New specifications for exponential random graph models. Sociological methodology, 36(1):99–153. [28] Uhler, C., Lenkoski, A., and Richards, D. (2014). Exact formulas for the normalizing constants of wishart distributions for graphical models. arXiv preprint arXiv:1406.4901. [29] Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2):1–305. [30] Wang, H. (2012). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis, 7(4):867–886. [31] Wang, H. (2014). Scaling it up: Stochastic search structure learning in graphical models. [32] Wang, H. and Li, S. (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics, 6:168–198. [33] Wang, H. and Pillai, N. S. (2013). On a class of shrinkage priors for covariance matrix estimation. Journal of Computational and Graphical Statistics, 22(3):689–707. [34] Wasserman, S. and Pattison, P. (1996). Logit models and logistic regressions for social networks: I. an introduction to markov graphs andp. Psychometrika, 61(3):401–425. [35] Wong, F., Carter, C. K., and Kohn, R. (2003). Efficient estimation of covariance selection models. Biometrika, 90(4):809–830.

Chapter 6 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue with Optional Second Service1 6.1

Abstract

The paper proposes Bayesian framework in an M/G/1 queuing system with optional second service. The semi-parametric model based on a finite mixture of Gamma distributions is considered to approximate both the general service and re-service times densities in this queuing system. A Bayesian procedure based on birth-death MCMC methodology is proposed to estimate system parameters, predictive densities and some performance measures related to this queuing system such as stationary system size and waiting time. The approach is illustrated with several numerical examples based on various simulation studies. Key words: Gamma mixtures, Bayesian inference, MCMC, Birth-death MCMC, Predictive distribution, M/G/1 queue, Optional service.

1

Published as: Mohammadi A., M. Salehi-Rad, and E. Wit (2013). Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue with Optional Second Service, Computational Statistics 28(2), 683-700.

112

6.2

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

Introduction

Bayesian analysis of queuing systems is a relatively recent research area. Some useful references are (1, 2, 17, 3, 4, 5, 20). The queuing systems that they considered customers depart the system after taking their service. But in many applied queuing systems, some customers need to be re-serviced after taking their main service. For example, in a production line, some items might fail and require repair. In these kinds of problems, we must re-service some items. The primary aim of this paper is to propose a Bayesian inference scheme for an M/G/1 queuing system in which some customers with probability p need re-servicing. In this queuing system we have a service unit in which customers arriving following a Poisson process and demanding service with a general distribution and some customers that need re-service. From a classical queuing theory perspective, this queuing system has been studied by (23, 24); they considered three alternatives for re-servicing in this queuing system and obtained the mean busy period, the probability of the idle period and the probability generating function(pgf) of the steady-state system size. The main contribution of this paper is to introduce a semi-parametric model for the general density of service and re-service based on a mixture of Gamma distributions, providing an alternative Bayesian approach for approximating the general distributions in queuing systems based on former work. Secondly, we will introduce a Bayesian algorithm based on the birth-death MCMC approach of (26) in order to fit this model to data. The use of finite mixture distributions is very common and the Bayesian approach provides an important tool in semi-parametric density estimation, see for instance (9, 22). Recently, MCMC methods for fully Bayesian mixture models of unknown dimension have been proposed; see (10). (13) introduced the reversible jump technique (RJ-MCMC) and (21) used this methodology to analyze Normal mixtures. This type of algorithm was used by (17) for mixture of Exponential distributions, (29) for mixture of Gamma distributions and (3) for mixture of Erlang distributions. More recently, in the context with this methodology, (26) rekindled interest in the use of continuous time birth-death methodology (BDMCMC) for variable dimension problems. This type of methodology was used by (4) for mixture of Erlang distributions and (20) for the mixture of Pareto distributions. Moreover, (6) investigated the similarity between the reversible jump and birth-death methodology. The paper is structured as follows. In section 6.3, we illustrate the M/G/1 queuing system with optional second service where we consider a mixture of Gamma distributions to approximate the general densities of service and re-service times. Then, we use some

6.3 The M/G/1 queuing system with optional second service

113

results obtained by (23, 24)which allow us to estimate the mean number of customers in the system, mean busy period and probability of the idle period for this queuing system. In section 6.4, we explain our Bayesian approach by defining prior distributions, obtaining conditional distributions and propose a birth-death MCMC algorithm to obtain a sample from the posterior distributions of the parameters of the predictive service and re-service times distributions. In section 6.5 we explain how to approximate the general densities of service and re-service times by using the data generated from the birth-death MCMC algorithm. In section 6.6, we demonstrate how to estimate the system parameters and some performance measures of our queuing system from the BD-MCMC output. In section 6.7, we illustrate our methodology by performing several simulation studies. We conclude the paper with a discussion of the relevance of the various extensions in section 6.8.

6.3

The M/G/1 queuing system with optional second service

Throughout, we are considering an M/G/1 queue, in which some customers with probability p need re-service, with First Come First Serve discipline, and independence between inter-arrival and service times. In this queuing system failed items are stockpiled in a failed queue (FQ) and re-serviced only after all customers in main queue (MQ) are serviced. After completion of re-service of all items in FQ, the server returns to MQ if there are any customers waiting in MQ; otherwise, the server is idle. So, in this queuing system we have two queues and one server. The variable T is the inter-arrival time with an exponential distribution. For service ˜ times are independent and have times, we suppose that service (S) and re-service (S) general distributions, denoted by B1 (.) and B2 (.) with means µ1 , µ2 and variances δ1 , δ1 respectively. For these general distributions, we need a model, flexible enough to deal with typical features in service and re-service time distributions (skewness, multimodality, lots of mass near zero, even possibly a mode) and permits usual computations in queuing applications. Thus, we propose a semi-parametric model based on a mixture of Gamma distributions with a Bayesian framework. If S is a service time, we assume B1 (s|θ1 ) =

k1 X

π1i G(s|α1i , β1i ),

s>0

1=1

where θ1 = (k1 , π 1 , α1 , β 1 ), k1 is the unknown number of mixture components, π 1 =

114

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

(π11 , π12 , ..., π1k1 ) are weights and G(s|α1i , β1i ) represents the Gamma density function, for i = 1, ..., k1 , that is, G(s|α1i , β1i ) =

(β1i )α1i α1i −1 −β1i s s e ,, Γ(α1i )

s > 0.

Likewise, if S˜ is a re-service time, we have, B2 (˜ s|θ1 ) =

k2 X

π2i G(˜ s|α2i , β2i ),

s˜ > 0

1=1

where θ2 = (k2 , π 2 , α2 , β 2 ), k2 and π 2 = (π21 , π22 , ..., π2k2 ) have the same interpretation as for the service times density. Thus, we have a queuing system with two queues, main queue and failed queue, and one server. Therefore, all parameters of these queuing system are (λ, θ1 , θ2 , p), in which λ is the parameter of inter-arrival times and p is the probability of the items needing reservice.

6.3.1

Some performance measures in this queuing system

We assume that the queuing system is in equilibrium. This assumption is equivalent with assuming that the traffic intensity, ρ, is less than one (18). For this queuing system ρ = ρ1 + pρ2 , in which ρ1 = λµ1 is traffic intensity in MQ and ρ2 = λµ2 is traffic intensity in FQ. As a result (6.1)

ρ = λ(µ1 + pµ2 ). Under this steady state condition, other performance measures will be obtained. The expectation of busy period and probability of idle period To acquire the expectation of busy period for this queuing system we have E[busy period] = E[busy period in MQ] + pE[busy period in FQ],

which the first expressions is equal to µ1 /(1 − λµ1 ) and the second is pµ2 /(1 − λµ1 ). Therefore, by some computation E(busy period) =

µ1 + p2 µ2 . 1 − λµ1

(6.2)

6.3 The M/G/1 queuing system with optional second service

115

Furthermore, the probability of idle period is P r(idle period) =

1 − λµ1 . 1 + p2 λµ2

(6.3)

For more details see (23, 24). The expectation of the system size For our queuing system, suppose that Xn is the number of customers remaining in MQ at the completion of the nth customer’s service time also Yn is the number of customers remaining in FQ at the completion of the th customer’s service time in the steady state. (19), by using the joint probability general function of (Xn , Yn ) obtained the following expression of the mean system size. Theorem 6.1. (19) Mean number of customers in MQ and FQ are as below i)

2

pρ22

ρ22

λ2 δ1 +ρ1 (2−ρ1 ) 1−ρ1

p λ δ2 + + λ2 δ1 + ρ21 + E(Xn ) = ρ1 + 2(1 − ρ1 ) 2 pρ2 + (1 − ρ1 )Ψ(1 − p + pB2∗ (λ))

(6.4)

p 2(1 − ρ1 )2 + λ2 δ1 + ρ1 (1 − ρ1 ) E(Yn ) = 2 pρ2 + (1 − ρ1 )Ψ(1 − p + pB2∗ (λ))

(6.5)

ii)

respectively, where Ψ(u) = uB1∗ λ(1 − Ψ(u)) , B1∗ (.) and B2∗ (.) are the Laplace Stieltjes Transform (LST) of the service and re-service times density, respectively. According to mixture of Gamma distributions for service and re-service times, the LST of service and re-service times are given by Bj∗ (t) =

kj X

πji

i=1

αji t + βji

!αji ,

j = 1, 2

and the variance of service and re-service times are given by δj =

kj X i=1

2 πji

αji 2 βji

! ,

j = 1, 2.

116

6.4

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

Bayesian inference

In this section we propose the Bayesian approach to infer the system parameters (λ, θ1 , θ2 , p). t s1 , ns1 service times s1 = {s1i }ni=1 , ns2 reWe observe nt inter-arrival times t = {ti }ni=1 np ns2 service times s2 = {s2i }i=1 and np indicators u = {ui }i=1 , in which ui = 1 if customer need re-service and ui = 0 if customer does not need re-service. We assume independence between the arrival, service times, re-service times and the probability of re-service. For the arrival rate, λ, we assume a Gamma prior distribution, λ ∼ G(ξ, ψ). Conditional on arrival data, the posterior distribution of λ is also Gamma distributed as G(ξ + P t nt , ψ + ni=1 ti ). For the parameter p, we assume a Beta prior distribution, p ∼ Beta(a, b). The posterior Pnp Pnp ui , b + np − i=1 distribution of p given u is Beta(a + i=1 ui ). In the following section, we propose a Bayesian framework for mixture of Gamma distributions to approximate the general distribution of the service and re-service times, B1 (.) and B2 (.).

6.4.1

Bayesian inference for finite Gamma Mixture

To determine a Bayesian framework for the general distributions and based on mixture of Gamma distribution we assume that B(s|θ) =

k X

πi G(s|αi , βi ),

s>0

i=1

where θ = (k, π, α, β). First, as is usually done in mixture models (e.g. (9)), we use a data augmentation algorithm, introducing for each datum, Sj , component indicator variables, Zj , such that P Zj = i|k, π = πi , i = 1, ..., k Then, the conditional service time density, Sj , given that Zj , is Sj |Zj = i ∼ G(sj |αi , βi ), j = 1, ..., ns . Following (21), we assume that the joint prior distribution on the mixture Gamma parameters, θ = (k, π, α, β) can be factorized as f (k, π, α, β, z) ∝ f (k)f (π|k)f (z|π, k)f (α|k)f (β|k).

6.4 Bayesian inference

117

To determine the prior distributions for the parameters of mixture distribution, first, we assume a truncated Poisson distribution for the mixture size, k, as below P (K = k) ∝

γk , k = 1, ..., kmax . k!

(6.6)

We define prior distributions for remaining parameters given that , as below π|k ∼ D(φ1 , ..., φk )

(6.7)

αi |k ∼ G(ν, υ), i = 1, ..., k

(6.8)

βi |k ∼ G(η, τ ), i = 1, ..., k

(6.9)

where D(φ1 , ..., φk ) denotes a Dirichlet distribution with parameters φi > 0. Typically, we might set φi = 1, for all i = 1, ..., k, giving a uniform U (0, 1) prior for the weights. Given k and the data, s, then it is straightforward to show the required posterior condition distributions for the MCMC algorithm are as below P Zj = i|s, k, π, α, β ∝ πi G(sj |αi , βi ), i = 1, ..., k, π|s, z, k ∼ D(φ1 + n1 , ..., φk + nk ) X βi |s, z, k ∼ G(η + ni αi , τ + sj ), i = 1, ..., k j:zj =i

f (αi |s, z, k, β) ∝

βiαi Γ(αi )

n

αi

 Y 

i

sj  αiν−1 e−υαi ,

(6.10)

j:zj =i

where ni = ♯ zj = i for i = 1, ..., k. This mixture model is invariant to permutation of the labels i = 1, ..., k. For identifiability, it is important to adopt a unique labeling. Unless stated otherwise, we use that the πi are increasing; thus the prior distributions of the parameters are k! times the product of the individual Gamma densities, restricted to the set πi < ... < πk , for more details see (27) and (25). In order to sample the posterior distributions, there are two main approaches in the context of mixture modeling with an unknown number of components: one approach is (13) reversible jump MCMC (RJ-MCMC) methodology. Another alternative approach is (26) birth-death MCMC (BD-MCMC) methodology. (26) introduced continuous time birth-

118

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

death MCMC processes for variable dimension problems. (6) showed that the essential mechanism in this approach was the same as with RJ-MCMC algorithm. Here we apply the BD-MCMC, which is simpler to implement and we have found to give better results in practice. We briefly outline this algorithm in the following section. For more details, including details of the construction of the BD-MCMC methodology see (26) and (10).

6.4.2

BD-MCMC algorithm

In this subsection, we obtain a sample from the posterior distributions of the parameters, θ = (k, π, α, β), by a BD-MCMC algorithm. This algorithm is based on a birth-death process and was introduced by (26) in the context of Normal mixtures. With this approach, the model parameters are interpreted as observations from a marked point process and the mixture size, k, changes so that births and deaths of the mixture components occur in continuous time. The rates at which this happens determine the stationary distribution of the process. In the birth-death process, births of mixture components occur at a constant rate which we might set equal to the parameter, γ, from the prior distribution of k in (6.6). A birth increases the number of mixture components by one. Whenever a new component is born, its weight is generated from a Beta distribution with parameters (1, k) and the remaining parameters are sampled from their posterior distributions. To include the new component, the old component weights are scaled down proportionally to make all the weights, including the new one, sum to 1, i.e. πi := πi /(1 + π). The death rate of every mixture component is a likelihood ratio of the model with and without this component, given by ∆=

ns Y r=1

B(sr ) − πj G(sr |αj , βj ) (1 − πj )B(sr )

! , j = 1, ..., k.

(6.11)

P The total death rate, ∆ = j ∆j , of the process at any time is the sum of the individual death rates. A death decreases the number of mixture components by one. The birth and death processes are independent Poisson processes, thus, the time of birth/death event is exponentially distributed with mean 1/(∆ + γ). Therefore, a birth or death occurs with probabilities proportional to gamma and ∆, respectively. With this explanation, we define an BD-MCMC algorithm based on (26) as follows. Step one of the algorithm is the birthdeath process described above. Following (26), we have chosen t0 = 1 and a birth rate equal to the parameter, γ. As expected, we have found in practice that larger values of the birth rate produce better mixing but require more time in the computation of the algorithm.

6.4 Bayesian inference

1

119

Algorithm 3.1 Starting with initial values k (0) , π (0) , α(0) and β (0) , iterate the following steps: 1. Run the birth-death process for a fixed time and the birth rate γ 1.1. Compute P the death rates for each component, ∆j , and the total death rate, ∆ = j ∆j 1.2. Simulate the time to the next jump from an exponential distribution with mean 1/(∆ + γ) 1.3. If the run time is less than t0 continue otherwise proceed with step 2 1.4. Simulate the type of jump: birth or death with probabilities Pr(birth) =

2. 3. 4.

5.

∆ γ , Pr(death) = (γ + ∆) (γ + ∆)

1.5. Adjust the mixture components MCMC steps conditional on k Update the latent variables by sampling from z (i+1) ∼ z|s, k (i+1) , π (i) , α(i) , β (i) Update the weights by sampling from π (i+1) ∼ π|s, k (i+1) , z (i+1) For r = 1, ..., k (i+1) 4.1. Update the means by sampling from β r(i+1) ∼ β r |s, k (i+1) , z (i+1) 4.2. Update αr using a Metropolis-Hastings Set i = i + 1 and go to step 1.

Steps 2, 3 and 4.1 are standard Gibbs sampling moves, whereby the model parameters are updated conditional on the mixture size, k. The only complicated is step 4.2, where we introduce a Metropolis-Hastings step (16), to sample from the posterior distribution of αi . From the shape of the target distribution, (6.10), we propose a Gamma distribution with parameters G(ν, υ). With this proposal distribution the acceptance probability for a candidate point, α∗ , becomes   α∗ −αr    n   Γ(αr ) r  Y ∗  β s P r(αr , α ) = min 1, r j   Γ(α∗ )   j:z =r j

Remark 3.1 Due to the overall similarity of the shape of the proposal distribution and the target distribution, the acceptance probability is much better – in simulation studies, Sect. 6.7, we get acceptance probabilities of around 20% compared to 1% elsewhere – than previous work (29, 3). Algorithm 3.1 produces a sample from the posterior distributions. Thus, we can run this BD-MCMC algorithm for the parameters of service and re-service time densities, θ1 =

120

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

k1 , π 1 , α1 , β 1 and θ2 = k2 , π 2 , α2 , β 2 , respectively.

6.4.3

Model identification

The parameter k1 and k2 are model parameters, identifying models of a particular complexity. In this subsection, we discuss briefly how to perform “ideal Bayesian model identification”and a comparison with other approaches. There are many model selection criteria such as AIC, BIC, DIC, DIC+, MDL, Bayesian p-values and posterior predictive checks (for more details see (8)). But most of them are either unsuitable for mixture models or complex (7). This diversity of approaches, especially for variable-dimension parameters, reflects the different flavours of the model determination question that statisticians face. In reality there are a number of reasons why this simple idealized view fails to reflect practical applications. We briefly describe some fundamental issues that face the practitioner wishing to perform model choice for a real Bayesian problem. We omit further details as they are covered in (14) and (15). First at all, prior model probabilities may be fictional: the ideal Bayesian has real prior probabilities reflecting scientific judgment or belief across the model space. In practice, however, such priors may not be commonly available. Secondly, Bayesian models have no chance of passing the test of a sensitivity analysis: in ordinary parametric problems we commonly find that inferences are rather insensitive to moderately large variations in prior assumptions, except when there are very few data. In fact, the opposite case, of high sensitivity, poses a greater challenge to the non-Bayesian as perhaps the data carry less information than hoped. Moreover, there may be improper parameter prior problems: in ordinary parametric problems it is commonly true that it is safe to use improper priors, specifically when posterior distributions are well-defined as limits of a sequence of approximating proper priors. However, when comparing models, improper parameter priors make Bayes factors indeterminate. For this reason, we use the marginal posterior probabilities for k1 and k2 to do model inference. In fact, rather than selecting the “best”model, these probabilities allow the Bayesian to use model averaging strategies or more qualitative model comparisons.

6.5

Predictive densities

By using the BD-MCMC algorithm we can produce samples from the posterior distributions of the service and re-service times distribution. Thus, given the BD-MCMC output

6.6 Estimation of some performance measures via the BD-MCMC output

121

of size N after a burn-in period, for θ1 and θ2 , and suitable regularity conditions [see, e.g., (28, page 65)], these quantities of interest can be consistently estimated by the sample path averages. We first estimate the mixture size of service and re-service distribution, k1 and k2 . The estimates of the marginal posterior distributions of k1 and k2 are n o (n) P r(Kr = k|data) = limN →∞ N1 ♯ n : kr = k n o (n) 1 (6.12) ≈ N ♯ n : kr = k , r = 1, 2. This probability provides a tool for determining the number of phases of service and reservice distributions. We can estimate the predictive density of the service and re-service time distributions using (j)

N kr 1 XX (j) (j) (j) ˆ Br (t|s1 ) ≈ πri G(t|αri , βri ), N j=1 i=1

r = 1, 2,

(6.13)

also, the predictive density of the service and re-service time distributions can be estimated by ˆr k X ˆr (t|θr ) = B π ˆi G(t|ˆ αri , βˆri ), s > 0 i=1

where kˆr has a maximum posterior probability in (6.12). Note that in the case where the posterior distribution of kr , αr , and β r is fairly spread out or even multi-modal, these plugin estimates would give a poor approximation of the predictive densities.

6.6

Estimation of some performance measures via the BD-MCMC output

Given a sample realization of the MCMC output and a sample from f (λ|t) and f (p|x) of equal size, we can estimate performance measures. For example, given sample data, we would like to assess whether or not the system is stable. The system is stable if and only if the traffic intensity, ρ, is less than one. Thus, the estimation of the probability of having a stable system is o 1 n (n) P r(ρ < 1|t, x, s1 , s2 ) ≈ ♯ ρ < 1 N

(6.14)

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

122

where according to 6.1 we have

ρ(n)

in which

  (n) (n) k1 k (n) (n) 2 X X (n) α1i (n) α2i  π1i (n) + p(n) π2i (n) = λ(n)  , β β i=1 i=1 1i 2i

(n) (n) (n) k1 , π 1 , α1 , β (n) 1

N

(n) (n) (n) k2 , π 2 , α2 , β (n) 2

N

are the samples i=1n n oN oN (i) (i) of size N obtained from the BD-MCMC algorithm, also λ and p are the i=1 i=1 samples of size N generated from the posterior distributions of λ and p, respectively. A consistent estimator of the traffic intensity is   (n) (n) k2 k1 N (n) (n) X (n) α  1 X X (n) α1i 2i π2i (n) π1i (n) + p(n) (6.15) E(ρ|t, x, s1 , s2 ) ≈ E(λ|t)  , N i=1 i=1 β1i β2i i=1 where E(λ|t) = (ξ + nt )/(ψ +

and

i=1

Pnt

i=1 ti ).

Moreover, by using the MCMC estimations of the parameters system, i.e. (λ, θ1 , θ2 , p), we can estimate the ρ1 , ρ2 , ρ, δ1 , δ2 , the mean system size 6.4 and 6.5, the mean busy period 6.2 and the probability of idle period of the system 6.3, as you see in the simulation study below.

6.7

Simulations

This section illustrates the accuracy of the Bayesian methodology in two simulation examples of the M/G/1 queuing system with optional second service. In the first example we assume both real density of service and re-service times are mixture of Gamma distributions, mixture of two Gamma distributions for service and mixture of three Gamma distributions for re-service times. In order to test how our methodology deals with model misspecification, the second example considers a more complicated model, in which the true density of service is a mixture of two truncated Normal distributions and the true density of re-service is Log-Normal distribution. The R codes are available at http://www.math.rug.nl/stat under the research link and will soon be available as R-package.

6.7 Simulations

6.7.1

123

Simulation study: mixture of Gammas

Without loss of generality, we assume that the inter-arrival rate, λ, is known and equal to 0.26 and probability of re-service, p, is also known and equal to 0.3. We consider samples of 1, 000 service data from mixture of two Gamma distributions as below B1 (s) = 0.6G(12, 1) + 0.4G(3, 2). Also, for the re-service, 1, 000 data simulated from a mixture of three Gamma distributions as below B2 (˜ s) = 0.6G(100, 100/3) + 0.3G(200, 50) + 0.1G(300, 60). We assumed a Poisson prior distribution for k1 with parameter γ = 2 which is truncated in point 100, and for remaining parameters we assume φ1 = ... = φk = 1, ν = √ 2 22 2 ) and in 6.6, 6.8, and 6.9, respectively. For the re-service , υ = 1/ ν, η = (¯ s1 /σs1 s¯21 /σs1 data we take γ = 3 and the other parameters the same as their service equivalents. For a service and re-service data set, we carried out the Bayesian approach described in section 6.4.2. We ran 200, 000 iteration of the BD-MCMC algorithm with 100, 000 iteration as burn-in. From the diagnostics it is clear that these numbers exceed what is needed for reliable results. Methods for choosing a burn-in time and number of iterations to use after burn-in are discussed in (12).

Fig. 6.1 Predictive densities (solid line) and the true densities (dotted line) for (left) service time data and (right) re-service time data.

Figure 6.1 provides the histograms of generated data set with the estimation of the predictive densities from formula 6.13 for service and re-service times. This figure shows that the productive densities for service and re-service times compare quite well with the

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

124 true densities.

Fig. 6.2 (Left) A trace of k1 for 200, 000 iterations after 100, 000 burn-in iterations and (right) for the k2 .

Figure 6.2 (left) illustrates the mixing properties of the algorithm in terms of the evolution of the mixture size k1 and (right) is for k2 . An essential element of the performance of our BD-MCMC algorithm is its ability to move between different values of k1 and k2 . The chains appear to be mixing quite well, visiting many states, for both service and re-service times.

Fig. 6.3 (Left) The estimation of posterior distribution of k1 , and (light) for k2 .

Figure 6.3 on the left shows the estimation of posterior distribution of k1 , which is obtained from formula 6.12, and on the right for k2 . A useful check on the stationary is given by the plot of the cumulative occupancy fractions for different values of k1 and k2 against the number of iterations. These are represented in Figure 6.4 for the service and re-service data set, where it can be seen that the burn-in is more than adequate to achieve stability in the occupancy fractions.

6.7 Simulations

125

Fig. 6.4 The cumulative occupancy fractions for (left) service data, k1 , (right) the re-service data, k2 , for a complete run including burn-in.

In the first and second row of table 6.1, we respectively tabulate the true values and the estimated values of traffic intensity, probability of having a stable system, expectation of number of customers in MQ, expectation of number of customers in FQ, expectation of busy period and probability of idle period of the system. They have been obtained from formula 6.2, 6.3, 6.4, 6.5, 6.14, and 6.15, respectively. The third row of the table shows the standard deviation (SD) of these estimates. True value Estimates SD

E(busy period) P(idle period) E(Xn ) E(Yn ) P (ρ < 1|data) E(ρ|data) 9.053 0.298 1.446 0.003 0.965 9.036 0.299 1.423 0.003 0.967 0.967 0.821 0.017 0.176 0.000 0.023 0.022

Table 6.1 True values and estimations of mean busy period, probability of idle period, mean number of customers in the system, and probabilities that system is stable. Third row is the standard deviation (SD) for these estimates.

Also, the real value of the traffic intensity, ρ, from formula 6.1 is equal with 0.952. A 95% credible interval is roughly given as the estimate ±1.96 × SD. Considering this criterion, it is concluded that all the true values lie inside their 95% credible intervals.

6.7.2

Simulation study: Mixture of truncated Normal and Log-Normal

In this section we would like to access the effect of model misspecification on our estimation procedure. Like in previous example, we assume that the inter-arrival rate and probability of re-service are known, λ = 0.28 and p = 0.45. We consider samples of 1, 000 service data from mixture of two truncated Normal distributions on interval (−∞, 0) as

126

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

below B1 (s) = 0.4T N(0,∞) (1.4, 2.3) + 0.6T N(0,∞) (0.2, 0.3). Also, for the re-service, 1, 000 data simulated from a single Log-Normal distribution as below B2 (˜ s) = LN (1, 0.5). With the same assumptions in previous example, we ran 200, 000 iteration of the BDMCMC algorithm with 100, 000 iteration as burn-in.

Fig. 6.5 Predictive densities (solid line) and the true densities (dotted line) for (left) service time data and (right) re-service time data.

Figure 6.5 shows the histograms of generated data set with real densities for service and re-service times and the predictive densities from formula (13). This figure shows the predictive densities for service and re-service times in comparison with the true densities. Despite the fact that the true model is not part of our model class, it is sufficiently rich to approximate quite general densities. Figure 6.6 (left) illustrates the mixing properties of the algorithm in terms of the evolution of the mixture size k1 and (right) is for k2 . An essential element of the performance of our BD-MCMC algorithm is its ability to move between different values of k1 and k1 . The chains appear to be mixing quite well, visiting many states, for both service and re-service times. Figure 6.7 in the left shows the estimation of posterior distribution of k1 , which obtained from 6.12, and in the right is for k1 . For checking the stationary, figure 6.8 shows the cumulative occupancy fractions for different values of k1 and k2 against the number of iterations. These are represented for the

6.7 Simulations

127

Fig. 6.6 (Left) A trace of k1 for 200, 000 iterations after 100, 000 burn-in iterations and (right) for the k2 .

Fig. 6.7 (Left) The estimation of posterior distribution of k1 , and (light) for k2 .

Fig. 6.8 The cumulative occupancy fractions for (left) service data, k1 , (right) the re-service data, k2 , for a complete run including burn-in.

Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue

128

service and re-service data set, where it can be seen that the burn-in is more than adequate to achieve stability in the occupancy fractions. Table 6.2 respectively shows the true values and the estimated values of traffic intensity, probability of having a stable system, expectation of number of customers in MQ, expectation of number of customers in FQ, expectation of busy period and probability of idle period of the system. True value Estimates SD

E(busy period) P(idle period) E(Xn ) E(Yn ) P (ρ < 1|data) E(ρ|data) 5.797 0.381 1.661 0.227 0.983 5.845 0.379 1.586 0.230 0.973 0.972 0.079 0.003 0.060 0.002 0.008 0.008

Table 6.2 True values and estimations of mean busy period, probability of idle period, mean number of customers in the system, and probabilities that system is stable. Third row is the standard deviation (SD) for these estimations.

Also ρ = 0.983, the real value of the traffic intensity from formula (6.1). Considering the values of SD, all the true values lie inside their 95% credible intervals.

6.8

Discussion and future directions

This paper has developed a Bayesian approach to make inference and prediction for an M/G/1 queuing system with optional second service. It has developed a density estimation method based on a mixture of Gamma distributions in order to approximate the general service and re-service time distributions. A BD-MCMC algorithm has been proposed to make inference on the service and re-service parameters. This algorithm is based on births and deaths of mixture components making use of the birth-death technique proposed by (26). Some important measures of this queuing system, such as the system size mean, the mean busy period and probability of idle period have been estimated. This methodology has been illustrated with simulation study. For our Bayesian approach, we used the mixture of Gamma distributions, since all distributions on the positive real can be approximated by such mixture and features such as skewness can be well-modeled by asymmetric distributions. There is previous work in Gamma mixture and Erlang mixture distributions (29, 3, 4, 5) but our Bayesian framework is different and its computation is easier and the results are more accurate. For instance, in the BD-MCMC algorithm, step 4.2., the Metropolis-Hastings step, in this article, the probability of acceptance is around 20% which is more reasonable in comparison with previous work (e.g. see (3, 29)).

6.8 Discussion and future directions

129

Our proposal is not limited to this queuing application, and comparisons with standard mixtures based on truncated normal distributions and skewed truncated normal distributions, see (11), are part of future work. Also, density estimation problems seem suitable for our approach. Work is currently in progress on these models. An alternative to the BD-MCMC methodology is the RJ-MCMC introduced by (13). This type of algorithm had been used by (29) for mixture of Gamma distributions and (3, 4) for the mixture of Erlang distributions. In practice, we have found that both schemes perform similarly. In the BD-MCMC algorithm, as we have indicated, larger values of the birth rate produce better mixing, but also increase the computational cost. We have experienced some problems of non-convergence of the algorithm if the birth rate is selected too high. Thus, it would be useful to explore methods for selection of this parameter in order to optimize the algorithm.

Acknowledgments The authors are grateful to the anonymous reviewers for their detailed and insightful comments.

130

References

References [1] Armero, C. and Bayarri, M. (1994a). Bayesian prediction in m/m/1 queues. Queueing Systems, 15(1-4):401–417. [2] Armero, C. and Bayarri, M. (1994b). Prior assessments for prediction in queues. The Statistician, pages 139–153. [3] Ausın, M., Wiper, M. P., and Lillo, R. E. (2004). Bayesian estimation for the m/g/1 queue using a phase-type approximation. Journal of Statistical Planning and Inference, 118(1):83–101. [4] Ausín, M. C., Lillo, R. E., and Wiper, M. P. (2007). Bayesian control of the number of servers in a gi/m/c queueing system. Journal of Statistical Planning and inference, 137(10):3043–3057. [5] Ausín, M. C., Wiper, M. P., and Lillo, R. E. (2008). Bayesian prediction of the transient behaviour and busy period in short-and long-tailed gi/g/1 queueing systems. Computational Statistics & Data Analysis, 52(3):1615–1635. [6] Cappé, O., Robert, C. P., and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):679–700. [7] Celeux, G., Forbes, F., Robert, C. P., Titterington, D. M., et al. (2006). Deviance information criteria for missing data models. Bayesian Analysis, 1(4):651–673. [8] Claeskens, G. and Hjort, N. L. (2008). Model selection and model averaging, volume 330. Cambridge University Press Cambridge. [9] Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through bayesian sampling. Journal of the Royal Statistical Society. Series B (Methodological), pages 363–375. [10] Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models: Modeling and Applications to Random Processes. Springer. [11] Frühwirth-Schnatter, S. and Pyne, S. (2010). Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics, 11(2):317–336.

References

131

[12] Gilks, W., Richardson, S., Spiegelhalter, D., et al. Markov chain monte carlo in practice. 1996. New York: Chapman Hall/CRC, 486. [13] Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4):711–732. [14] Green, P. J. (2003). Trans-dimensional markov chain monte carlo. Oxford Statistical Science Series, pages 179–198. [15] Hastie, D. I. and Green, P. J. (2012). Model choice using reversible jump markov chain monte carlo. Statistica Neerlandica, 66(3):309–338. [16] Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109. [17] Insua, D. R., Wiper, M., and Ruggeri, F. (1998). Bayesian analysis of m/er/1 and m/h_k/1 queues. Queueing Systems, 30(3-4):289–308. [18] Medhi, J. (1994). Stochastic processes. New Age International. [19] Mohammadi, A. and Salehi-Rad, M. R. (2012). Bayesian inference and prediction in an m/g/1 with optional second service. Communications in Statistics-Simulation and Computation, 41(3):419–435. [20] Ramirez, P., Lillo, R. E., and Wiper, M. P. (2008). Bayesian analysis of a queueing system with a long-tailed arrival process. Communications in Statistics-Simulation and Computation, 37(4):697–712. [21] Richardson, S. and Green, P. J. (1997). On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B (statistical methodology), 59(4):731–792. [22] Robert, C. P. (1996). Mixtures of distributions: inference and estimation. Springer. [23] Salehi-Rad, M. and Mengersen, K. (2000). Reservicing some customers in m/g/1 queues, under two disciplines. From Adam Smith to Michael Porter: Evolution of Competitiveness Theory, page 267. [24] Salehi-Rad, M., Mengersen, K., and Shahkar, G. (2004). Reservicing some customers in m/g/1 queues under three disciplines. International Journal of mathematics and mathematical sciences, 2004(32):1715–1723.

132

References

[25] Sperrin, M., Jaki, T., and Wit, E. (2010). Probabilistic relabelling strategies for the label switching problem in bayesian mixture models. Statistics and Computing, 20(3):357–366. [26] Stephens, M. (2000a). Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics, pages 40–74. [27] Stephens, M. (2000b). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):795–809. [28] Tierney, L. (1996). Introduction to general state-space markov chain theory. In Markov chain Monte Carlo in practice, pages 59–74. Springer. [29] Wiper, M., Insua, D. R., and Ruggeri, F. (2001). Mixtures of gamma distributions with applications. Journal of Computational and Graphical Statistics, 10(3).

Summary In this thesis, we address several problems related to modeling complex systems. The difficulty of modeling complex systems lies partly in their topology and how they form rather complex networks. From this perspective, our interest in networks (graphs) is part of a broader current of research on complex systems. Graphical models provide powerful tools to model and make statistical inference regarding complex relationships among variables. In this context, Gaussian graphical models are commonly used, since inference in such models is often tractable. In Chapter 2, we introduce a novel Bayesian framework for Gaussian graphical model determination. We carry out the posterior inference by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on birth-death process. In particular, we construct an efficient search algorithm which explores the graph space to detect the underlying graph with high accuracy. We cover the theory and computational details of the proposed method. We then apply the method to large-scale real applications from mammary gene expression studies to show its empirical usefulness. The method that we propose in chapter 2 is limited only to the data that follows the Gaussianity assumption. In Chapter 3, we propose a Bayesian approach for graphical model determination based on a Gaussian copula approach that can deal with continuous, discrete, or mixed data. We embed a graph selection procedure inside a semi-parametric Gaussian copula. We implement our approach to discovering potential risk factors related to Dupuytren disease. In chapter 4, we introduce an R package BDgraph which efficiently performs the Bayesian approaches that proposed in chapters 2 and 3. The core of the BDgraph package efficiently implemented in C++ to maximize computational speed. The most promising statistical model that can be used for network modelling is the class of Exponential Random Graph Models (ERGMs). However, they are restricted to the models that regarded the network as given. In chapter 5, we develop a novel Bayesian statistical framework which combines the class of ERGMs with graphical models capable of modelling non-observed networks. Our proposed method greatly extends the scope of

134

References

the ERGMs to more applied research areas. In chapter 6, we introduce a Bayesian framework in an M/G/1 queuing system with an optional second service. We estimate system parameters, predictive densities and some performance measures related to this queuing system such as stationary system size and waiting time.

Samenvatting In dit proefschrift bekijken we een aantal problemen met betrekking tot het modelleren van complexe systemen. De moeilijkheid in het modelleren van complexe systemen ligt deels in hun topologie en deels in hoe ze tamelijk complexe netwerken vormen. Onze interesse in netwerken (grafen) is onderdeel van een bredere beweging in de richting van onderzoek naar complexe systemen. Na de introductie, in hoofdstuk 2, introduceren we een nieuw Bayesiaans kader voor Gaussische grafische modelbepaling, namelijk een trans-dimensionale Markov Chain Monte Carlo (MCMC) aanpak op basis van een continue-tijd geboorte-dood-proces. We behandelen de theorie en computationele details van de voorgestelde methode. Vervolgens passen we de methode toe op borstklier genexpressie studies om zijn empirische nut tonen. De werkwijze die we in hoofdstuk 2 voorstellen is beperkt tot de gegevens die de Gaussiaanse aanname volgt. In hoofdstuk 3 introduceren we een Bayesiaanse aanpak gebaseerd op de Gaussiaanse copula benadering die tegelijkertijd met binaire, gewone of continue variabelen kan werken. We verwerken een grafische selectieprocedure binnen een semiparametrische Gaussische copula. Wij passen een posterieure inferentie toe door gebruik te maken van een efficiënt bemonsteringsschema: een trans-dimensionale MCMC aanpak op basis van een geboorte-dood-proces. Wij implementeren onze aanpak met als doel het ontdekken van potentiële risicofactoren bij de ziekte van Dupuytren. In hoofdstuk 4, introduceren we een op R gebaseerd softwarepakket, BDgraph, dat efficiënt de Bayesiaanse benaderingen die in de hoofdstukken 2 en 3 zijn geïntroduceerd uitvoert. De kern van het BDgraph pakket is efficiënt geïmplementeerd in C++ om de rekensnelheid te bevorderen. De meest veelbelovende statistische modellen die kunnen worden gebruikt voor netwerkmodellering zijn de exponentiële Random Graph Models (ERGMs). Ze zijn echter beperkt tot de modellen die het netwerk als gegeven beschouwen. In hoofdstuk 5 ontwikkelen we een nieuw Bayesiaans statistisch kader, welke de klasse van ERGMs combineert met grafische modellen die in staat zijn niet-waargenomen netwerken te modelleren. Onze voorgestelde methode vergroot het bereik van de ERGMs met meer toegepaste onderzoeks-

136

References

gebieden. In hoofdstuk 6, stellen we een Bayesiaans kader voor in een M/G/1 wachtrijsysteem met een optionele tweede dienstverlening (service). Wij schatten parameters van het systeem, voorspellen dichtheden en een aantal prestatie-indicatoren met betrekking tot dit wachtrijsysteem zoals stationaire systeemgrootte en de wachttijd.

Acknowledgements First of all I would like to acknowledge Prof. Ernst Wit, who gave me the opportunity to pursue my PhD project in his group. His continuous, trust support and guidance have been invaluable to the success of my PhD project. I hope we will have more fruitful collaborations in the future. I truly benefit from his ideas and his enthusiasm. I wish to thank Prof. Nial Friel, Prof. Nicolai Petkov and Prof. Goeran Kauermann for serving as the reading committee. Chapter 3 of this thesis is richer as a result of collaborations with Prof. Edwin van den Heuvel and Dr Fentaw Abegaz and Chapter 6 with Dr Mohammadreza Salehi-Rad. I would like to thanks them with their helpful collaborations. I owe my loving thanks to my Pariya for her supports during my PhD study. Without her encouragement and understanding, it would have been impossible for me to finish my work. I owe my gratitude to my family for their unconditional supports. I am also grateful toward my family-in-low, specially my mother-in-low and my father-in-low, for their encouragements and their supports during my PhD study. I would like to thank to my friends, Javier, Nynke, Ivan, Antonio, Vladimir and Lotsi who make my stay in Groningen so full of good memories. I am very lucky to meet Dick and Anneke whom I never forget their kindness and their continuous supports. My special thanks to Dr. Frank Brokken whom I learned a lot in his great C++ course. We had very informative talks about C++ programming. I acknowledge Sourabh for his help in building BDgraph package and Erik for translating the summary and abstract into Dutch. Abdolreza Mohammadi Groningen April, 2015

Lihat lebih banyak...

Bayesian Model Determination in Complex Systems

Descripción

Comentarios