RECOMB 2015 CCB
Descripción
RECOMB 2015 CCB Dirichlet Process Mixture Model with Bayesian LASSO for Consistent Clustering of Survival Times with Molecular Data Ashar Ahmad & Dr.Holger Fröhlich Institute for Computer Science, University of Bonn Bonn-Aachen International Center for IT (B-IT) Bonn, Germany
Outline of the talk ●
Motivation
●
Methods
●
Simulation and Application
●
Future Work and Possible Extensions
Section A: Motivation ●
Cancer Subtype Identification
●
Prognostic Signature, biomarker discovery
●
Personalized Medicine
●
●
Molecular Differences , however, sometimes DO NOT result in significant Clinical Data differences (survival times etc.) Need to integrate Molecular and Clinical Data for Subtype Identification
Previous Methodology ●
●
●
Cancer subtype discovery is usually achieved with unsupervised clustering over the molecular data. A post-hoc association of the clusters to the clinical data. Statistical Tests or Supervised Feature Selection identifies discriminating biomarkers/ signatures.
Verhaak, Roel GW, et al. (2010).
Our proposed Method ●
●
●
Integration of Molecular Data and Clinical Data for consistent and coherent clustering between the two data types Model based feature selection for the relevant discriminating Molecular features. Model based selection of the number of subtypes.
Section B: Model Description ●
●
Hierarchical Multivariate Gaussian Mixture Model for Molecular data (Görur, Dilan, et al. 2010) A Bayesian Least Absolute Shrinkage and Selection Operator (BLASSO) for the Accelerated Failure Time.
●
A Dirichlet Process prior over the cluster assignments.
●
Censored Survival Times treated as hidden variables
Completely Bayesian Model
A Simple 2D example
Model Inference ●
Gibbs Sampling for parameters.
●
Rejection Sampling for some hyper-parameters
●
●
As the overall model (Molecular + Clinical) likelihood has a non-conjugate distribution (w.r.t parameters), we use Auxiliary Variables to update the cluster membership variable (Neal, Radford M. 2000) Full posterior distributions are obtained.
Initialization ●
●
●
In order to avoid computationally expensive DP updates and possible local minimas in the loglikelihood landscape, we use K-means and corresponding penalized linear regression fits for the initialization. Some hyper-parameters are initialized using Empirical Bayes Estimates from the Data. All the other parameters and hyper-parameters are initialized from their corresponding prior distribution.
Section C: Simulations/Practical Application ●
A number of Simulation were run in order to understand the limitations of the model.
●
A simple case scenario was simulated with:
●
200 Samples, 50 Dimensions and 2 clusters
●
●
4 randomly selected Dimensions were chosen to generate the survival times. The relevant dimensions shared a covariance structure.
Simulation ●
The non-relevant features were chosen from independent Gaussian distributions.
●
5% of Survival times were censored
●
Overlap of Molecular Data clusters – 5%
●
Noise in the Survival Time – 5%
●
Samples collected after 100 Burn-in iterations and over 300 Gibb's iterations with Thinning =10
Simulation Results over 10 repeats
Application on TCGA Glioblastoma Data ●
Expression Matrix obtained from Verhaak et al. 2010.
●
196 Patients, 12 censored
●
●
●
Global Test used to rank KEGG pathways related to Survival ( in order of adjusted p.values) Top 30 KEGG pathways selected Mean expression of all the Genes in a Pathway is calculated as a Pathway score for every patient
Results on TCGA Data
Section D : Future Work and Extensions ●
●
●
More simulations are needed to understand the performance of the model. Inclusion of more than one type of Molecular data (Gene Expression, Methylation, Copy number etc.) Inclusion of more than one Clinical data (Karnofsky Index etc.)
References ●
●
●
●
●
●
●
Verhaak, Roel GW, et al. "Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1." Cancer cell 17.1 (2010): 98-110. Görür, Dilan, and Carl Edward Rasmussen. "Dirichlet process gaussian mixture models: Choice of the base distribution." Journal of Computer Science and Technology 25.4 (2010): 653-664. Savage, Richard S., et al. "Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data." arXiv preprint arXiv:1304.3577 (2013). Neal, Radford M. "Markov chain sampling methods for Dirichlet process mixture models." Journal of computational and graphical statistics 9.2 (2000): 249-265. Park, Trevor, and George Casella. "The bayesian lasso." Journal of the American Statistical Association 103.482 (2008): 681-686. Sha, Naijun, Mahlet G. Tadesse, and Marina Vannucci. "Bayesian variable selection for the analysis of microarray data with censored outcomes." Bioinformatics 22.18 (2006): 2262-2268. Goeman, Jelle J., et al. "A global test for groups of genes: testing association with a clinical outcome." Bioinformatics 20.1 (2004): 93-99.
Lihat lebih banyak...
Comentarios