Technologies such as micro-expression arrays and high-throughput sequenc- ing assays have accelerated research of genetic transcription in biological cells. Furthermore, many links between the gene expression levels and the pheno- typic characteristics of cells have been discovered. Our current understanding of transcriptomics as an intermediate regulatory layer between genomics and proteomics raises hope that we will soon be able to decipher many more cel- lular mechanisms through the exploration of gene transcription.
However, although large amounts of expression data are measured, only lim- ited information can be extracted. One general problem is the large set of considered genomic features. Expression levels are often analyzed individually because of limited computational resources and unknown statistical dependen- cies among the features. This leads to multiple testing issues or can lead to overfitting models, commonly referred to as the “curse of dimensionality.”
Another problem can arise from ignorance of measurement uncertainty. In particular, approaches that consider statistical significance can suffer from underestimating uncertainty for weakly expressed genes and consequently re- quire subjective manual measures to produce consistent results (e.g., domain- specific gene filters).
In this thesis, we lay out a theoretical foundation for a Bayesian interpretation of gene expression data based on subtle assumptions. Expression measure- ments are related to latent information (e.g., the transcriptome composition), which we formulate as a probability distribution that represents the uncer- tainty over the composition of the original sample.
Instead of analyzing univariate gene expression levels, we use the multivari- ate transcriptome composition space. To realize computational feasibility, we develop a scalable dimensional reduction that aims to produce the best approximation that can be used with the computational resources available.
To enable the deconvolution of gene expression, we describe subtissue specific probability distributions of expression profiles. We demonstrate the suitabil- ity of our approach with two deconvolution applications: first, we infer the composition of immune cells, and second we reconstruct tumor-specific ex- pression patterns from bulk-RNA-seq data of prostate tumor tissue samples.:1 Introduction 1
1.1 State of the Art and Motivation 2
1.2 Scope of this Thesis 5
2 Notation and Abbreviations 7
2.1 Notations 7
2.2 Abbreviations 9
3 Methods 10
3.1 The Convolution Assumption 10
3.2 Principal Component Analysis 11
3.3 Expression Patterns 11
3.4 Bayes’ Theorem 12
3.5 Inference Algorithms 13
3.5.1 Inference Through Sampling 13
3.5.2 Variationa lInference 14
4 Prior and Conditional Probabilities 16
4.1 Mixture Coefficients 16
4.2 Distribution of Tumor Cell Content 18
4.2.1 Optimal Tumor Cell Content Drawing 20
4.3 Transcriptome Composition Distribution 21
4.3.1 Sequencing Read Distribution 21
4.3.1.1 Empirical Plausibility Investigation 25
4.3.2 Dirichletand Normality 29
4.3.3 Theta◦logTransformation 29
4.3.4 Variance Stabilization 32
4.4 Cell and Tissue-Type-Specific Expression Pattern Distributions 32
4.4.1 Method of Moments and Factor Analysis 33
4.4.1.1 Tumor Free Cells 33
4.4.1.2 Tumor Cells 34
4.4.2 Characteristic Function 34
4.4.3 Gaussian Mixture Model 37
4.5 Prior Covariance Matrix Distribution 37
4.6 Bayesian Survival Analysis 38
4.7 Demarcation from Existing Methods 40
4.7.1 Negative Binomial Distribution 40
4.7.2 Steady State Assumption 41
4.7.3 Partial Correlation 41
4.7.4 Interaction Networks 42
5 Feasibility via Dimensional Reduction 43
5.1 DR for Deconvolution of Expression Patterns 44
5.1.1 Systematically Differential Expression 45
5.1.2 Internal Distortion 46
5.1.3 Choosinga DR 46
5.1.4 Testing the DR 47
5.2 Transformed Density Functions 49
5.3 Probability Distribution of Mixtures in DR Space 50
5.3.1 Likelihood Gradient 51
5.3.2 The Theorem 52
5.3.3 Implementation 52
5.4 DR for Inference of Cell Composition 53
5.4.1 Problem Formalization 53
5.4.2 Naive PCA 54
5.4.3 Whitening 55
5.4.3.1 Covariance Inflation 56
5.4.4 DR Through Optimization 56
5.4.4.1 Starting Point 57
5.4.4.2 The Optimization Process 58
5.4.5 Results 59
5.5 Interpretation of DR 61
5.6 Comparison to Other DRs 62
5.6.1 Weighted Correlation Network Analysis 62
5.6.2 t-Distributed Stochastic Neighbor Embedding 65
5.6.3 Diffusion Map 66
5.6.4 Non-negativeMatrix Factorization 66
5.7 Conclusion 67
6 Data for Example Application 68
6.1 Immune Cell Data 68
6.1.1 Provided List of Publicly Available Data 68
6.1.2 Obtaining the Publicly Available RNA-seq Data 69
6.1.3 Obtaining the Publicly Available Expression Microarray Data 71
6.1.4 Data Sanitization 71
6.1.4.1 A Tagging Tool 72
6.1.4.2 Tagging Results 73
6.1.4.3 Automatic Sanitization 74
6.1.5 Data Unification 75
6.1.5.1 Feature Mapping 76
6.1.5.2 Feature Selection 76
6.2 Examples of Mixtures with Gold Standard 79
6.2.1 Expression Microarray Data 81
6.2.2 Normalized Expression 81
6.2.3 Composition of the Gold Standard 82
6.3 Tumor Expression Data 82
6.3.1 Tumor Content 82
6.4 Benchmark Reference Study 83
6.4.1 Methodology 83
6.4.2 Reproduction 84
6.4.3 Reference Hazard Model 85
7 Bayesian Models in Example Applications 87
7.1 Inference of Cell Composition 87
7.1.1 The Expression Pattern Distributions (EPDs) 88
7.1.2 The Complete Model 89
7.1.3 Start Values 89
7.1.4 Resource Limits 90
7.2 Deconvolution of Expression Patterns 91
7.2.1 The Distribution of Expression Pattern Distribution 91
7.2.2 The Complete Model 92
7.2.3 SingleSampleDeconvolution 93
7.2.4 A Simplification 94
7.2.5 Start Values 94
8 Results of Example Applications 96
8.1 Inference of Cell Composition 96
8.1.1 Single Composition Output 96
8.1.2 ELBO Convergence in Variational Inference 97
8.1.3 Difficulty-Divergence 97
8.1.3.1 Implementing an Alternative Stick-Breaking 98
8.1.3.2 Using MoreGeneral Inference Methods 99
8.1.3.3 UsingBetterData 100
8.1.3.4 Restriction of Variance of Cell-Type-Specific EPDs 100
8.1.3.5 Doing Fewer Iterations 100
8.1.4 Difficulty-Bias 101
8.1.5 Comparison to Gold Standard 101
8.1.6 Comparison to Competitors 101
8.1.6.1 Submission-Aginome-XMU 105
8.1.6.2 Submission-Biogem 105
8.1.6.3 Submission-DA505 105
8.1.6.4 Submission-AboensisIV 105
8.1.6.5 Submission-mittenTDC19 106
8.1.6.6 Submission-CancerDecon 106
8.1.6.7 Submission-CCB 106
8.1.6.8 Submission-D3Team 106
8.1.6.9 Submission-ICTD 106
8.1.6.10 Submission-Patrick 107
8.1.6.11 Conclusion for the Competitor Review 107
8.1.7 Implementation 107
8.1.8 Conclusion 108
8.2 Deconvolution of Expression Patterns 108
8.2.1 Difficulty-Multimodality 109
8.2.1.1 Order of Kernels 109
8.2.1.2 Posterior EPD Complexity 110
8.2.1.3 Tumor Cell Content Estimate 110
8.2.2 Difficulty-Time 110
8.2.3 The Inference Process 111
8.2.3.1 ELBO Convergence in Variational Inference 111
8.2.4 Posterior of Tumor Cell Content 112
8.2.5 Posterior of Tissue Specific Expression 112
8.2.6 PosteriorHazardModel 113
8.2.7 Gene Marker Study with Deconvoluted Tumor Expression 115
8.2.8 Hazard Model Comparison Overview 116
8.2.9 Implementation 116
9 Discussion 117
9.1 Limitations 117
9.1.1 Simplifying Assumptions 117
9.1.2 Computation Resources 118
9.1.3 Limited Data and Suboptimal Format 118
9.1.4 ItIsJustConsistency 119
9.1.5 ADVI Uncertainty Estimation 119
9.2 Outlook 119
9.3 Conclusion 121
A Appendix 123
A.1 Optimalα 123
A.2 Digamma Function and Logarithm 123
A.3 Common Normalization 124
A.3.1 CPMNormalization 124
A.3.2 TPMNormalization 124
A.3.3 VSTNormalization 125
A.3.4 PCA After Different Normalizations 125
A.4 Mixture Prior Per Tissue Source 125
A.5 Data 125
A.6 Cell Type Characterization without Whitening 133
B Proofs 137
Bibliography 140
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:75763 |
Date | 23 August 2021 |
Creators | Otto, Dominik |
Contributors | Universität Leipzig |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0023 seconds