RECOMB 2015 CCB

Share Embed


Descripción

RECOMB 2015 CCB Dirichlet Process Mixture Model with Bayesian LASSO for Consistent Clustering of Survival Times with Molecular Data Ashar Ahmad & Dr.Holger Fröhlich Institute for Computer Science, University of Bonn Bonn-Aachen International Center for IT (B-IT) Bonn, Germany

Outline of the talk ●

Motivation



Methods



Simulation and Application



Future Work and Possible Extensions

Section A: Motivation ●

Cancer Subtype Identification



Prognostic Signature, biomarker discovery



Personalized Medicine





Molecular Differences , however, sometimes DO NOT result in significant Clinical Data differences (survival times etc.) Need to integrate Molecular and Clinical Data for Subtype Identification

Previous Methodology ●





Cancer subtype discovery is usually achieved with unsupervised clustering over the molecular data. A post-hoc association of the clusters to the clinical data. Statistical Tests or Supervised Feature Selection identifies discriminating biomarkers/ signatures.

Verhaak, Roel GW, et al. (2010).

Our proposed Method ●





Integration of Molecular Data and Clinical Data for consistent and coherent clustering between the two data types Model based feature selection for the relevant discriminating Molecular features. Model based selection of the number of subtypes.

Section B: Model Description ●



Hierarchical Multivariate Gaussian Mixture Model for Molecular data (Görur, Dilan, et al. 2010) A Bayesian Least Absolute Shrinkage and Selection Operator (BLASSO) for the Accelerated Failure Time.



A Dirichlet Process prior over the cluster assignments.



Censored Survival Times treated as hidden variables

Completely Bayesian Model

A Simple 2D example

Model Inference ●

Gibbs Sampling for parameters.



Rejection Sampling for some hyper-parameters





As the overall model (Molecular + Clinical) likelihood has a non-conjugate distribution (w.r.t parameters), we use Auxiliary Variables to update the cluster membership variable (Neal, Radford M. 2000) Full posterior distributions are obtained.

Initialization ●





In order to avoid computationally expensive DP updates and possible local minimas in the loglikelihood landscape, we use K-means and corresponding penalized linear regression fits for the initialization. Some hyper-parameters are initialized using Empirical Bayes Estimates from the Data. All the other parameters and hyper-parameters are initialized from their corresponding prior distribution.

Section C: Simulations/Practical Application ●

A number of Simulation were run in order to understand the limitations of the model.



A simple case scenario was simulated with:



200 Samples, 50 Dimensions and 2 clusters





4 randomly selected Dimensions were chosen to generate the survival times. The relevant dimensions shared a covariance structure.

Simulation ●

The non-relevant features were chosen from independent Gaussian distributions.



5% of Survival times were censored



Overlap of Molecular Data clusters – 5%



Noise in the Survival Time – 5%



Samples collected after 100 Burn-in iterations and over 300 Gibb's iterations with Thinning =10

Simulation Results over 10 repeats

Application on TCGA Glioblastoma Data ●

Expression Matrix obtained from Verhaak et al. 2010.



196 Patients, 12 censored







Global Test used to rank KEGG pathways related to Survival ( in order of adjusted p.values) Top 30 KEGG pathways selected Mean expression of all the Genes in a Pathway is calculated as a Pathway score for every patient

Results on TCGA Data

Section D : Future Work and Extensions ●





More simulations are needed to understand the performance of the model. Inclusion of more than one type of Molecular data (Gene Expression, Methylation, Copy number etc.) Inclusion of more than one Clinical data (Karnofsky Index etc.)

References ●













Verhaak, Roel GW, et al. "Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1." Cancer cell 17.1 (2010): 98-110. Görür, Dilan, and Carl Edward Rasmussen. "Dirichlet process gaussian mixture models: Choice of the base distribution." Journal of Computer Science and Technology 25.4 (2010): 653-664. Savage, Richard S., et al. "Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data." arXiv preprint arXiv:1304.3577 (2013). Neal, Radford M. "Markov chain sampling methods for Dirichlet process mixture models." Journal of computational and graphical statistics 9.2 (2000): 249-265. Park, Trevor, and George Casella. "The bayesian lasso." Journal of the American Statistical Association 103.482 (2008): 681-686. Sha, Naijun, Mahlet G. Tadesse, and Marina Vannucci. "Bayesian variable selection for the analysis of microarray data with censored outcomes." Bioinformatics 22.18 (2006): 2262-2268. Goeman, Jelle J., et al. "A global test for groups of genes: testing association with a clinical outcome." Bioinformatics 20.1 (2004): 93-99.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.