In biology and bioinformatics, a variety of data share a common property that challenges numerous cutting-edge research studies: heterogeneities at the individual level with respect to more than one factor. Examples of such heterogeneities include but are not limited to: 1) unequal susceptibility of different patients, and 2) large diversity in gene length, GC content, etc., along with the resulting gene characteristics. For many biological data analysis studies, the critical first step is usually to infer null probability distribution of observed data with the heterogeneities in multiple (confounding) factors taken into account, so that we can further investigate the impact of other factor(s) of interest. Obviously, the heterogeneities heavily influence the potential conclusions that we may draw from statistical analyses of the data.
However, modeling such heterogeneities has been challenging, not only due to the inapplicable explicit modeling of all factors with heterogeneous effects on the data, but also because of the non-independence of many factors from one another. Existing methods, either partially/fully neglected the heterogeneity issue at all, or took care of each factor's heterogeneity in isolation. Evidences have shown the insufficiency of such strategies and the errors they may produce in downstream analyses.
The emergence of large-scale data sets provides the opportunity to directly and comprehensively learn the heterogeneity from the data without explicitly modeling the mechanisms behind or exerting strong assumptions. The data, as often stored or organized as multidimensional contingency tensors, lead to a natural perspective of modeling heterogeneity with each impact factor of interest being one dimension. The heterogeneity in each factor's impact on the variable of interest can be captured by the marginal property of the data tensor with respect to the corresponding dimension. For instance, in a single-cell sequencing dataset, which can be organized as a matrix with each row representing a gene and each column representing a cell, the heterogeneity caused by both the gene and cell factors can be modeled.
In this dissertation, we develop a novel model, Conditional Multifactorial Contingency (CMC), that models the intertwined heterogeneities in all dimensions of the data tensor and infers the probability distribution of each entry of the data tensor jointly conditioned on these heterogeneities. In the proposed CMC model, the problem is formulated as a maximum entropy problem of the contingency tensor's probability distribution subject to the marginal constraints, under the assumption that the individuals within each dimension are independent. The marginal constraints are applied to the expected value instead of observed trial outcomes, which plays a key role in avoiding the innumerable combinations of trial outcomes and leading to an elegant expression form of the entry's probability distribution. The model is first developed for 3D binary data matrix, then extended to multidimensional data tensors and integer data tensors. Furthermore, missing values are taken into account and CMC is extended to be compatible with data with missing values.
Being empowered by CMC, we conducted four case studies for real-world bioinformatics research problems: (1) driving transcription factor (TF) identification; (2) scRNA-seq data normalization; (3) cancer-associated gene identification; (4) cell similarity quantification. For each of these case studies, we proposed a whole analysis framework and specific adaptation design for CMC. For the driving-TF identification, compared with traditional methods, we considered the variations in the gene's binding affinity in addition to the typically considered variations in TF's binding affinity. The driving TFs were identified by comparing the observed binding state and the estimated binding probability conditioned on TF/gene binding affinities. For the scRNA-seq data normalization, besides gene factor and cell factor, we figured out one more factor impacting the read counts, cDNA length, and applied CMC to comprehensively analyze the three factors. For cancer-associated gene identification, the CMC model is applied to systematically model the patient, gene, and mutation type factors in the mutation count data. As for the last application, to the best of our knowledge, our solution is the first proposed cell-to-cell-type similarity quantification method, thanks to the availability of CMC to systematically model and remove the impact of cell and gene factors.
We studied the theoretical properties of the proposed model and validated the effectiveness and efficiency of our method through experiments. The uniqueness of the probability solution and the convergence of the algorithm was proved. In the endeavor to identify true driving TFs, CMC significantly boosted the best record of success rate, which was proved using data with ground truth. Besides, in an exploratory study without ground truth, in addition to the previously known TFs, Olig1 (ranks 2nd), Olig2 (ranks 3rd), and Sox10 (ranks 4th), we successfully identified Ppp1r14b (ranks 1st) and Zfp36l1 (ranks 6th) that function in oligodendrocyte lineage development, which was validated via biological knock-out experiments and, has led to genuine biological discoveries. In the scRNA-seq data normalization, experimental results show that, by taking the cell, gene, and cDNA-length factors into account, the normalized data achieves lower variances for housekeeping genes than the peer methods. Besides, the data normalized by the CMC model leads to better accuracy of downstream DEG detection than that normalized by peer normalization methods. In cancer-associated gene identification, the CMC model is able to eliminate most of the likely artefactual findings resulted by considering the hidden factors separately. In the cell similarity quantification, CMC based model enables the identification of cell types by establishing between-species cell similarity quantification, regardless of contamination in scRNA-seq data. / Doctor of Philosophy / Biological data are complicated and typically influenced by numerous factors, including characteristics of biological subjects, physical or chemical properties of molecules, artifacts created by experimental operations, and so on. The information of real interest in a biology/bioinformatics study can be buried in all sorts of irrelevant factors and their impacts on the data. Consider a simple example where a study is conducted to figure out if an association exists between a specific gene and a cancer. Although this gene shows obviously different frequencies of mutation in two groups of people, patients and the normal, we cannot safely confirm the association from this observation. Such differential mutation levels can also be a result of the diversity among all these people in how easily this gene is mutated in a person (related to many characteristics of this person besides "cancer/not"). We call this diversity "heterogeneity", and it actually can be seen everywhere, in people, in genes, in cells, and in cell types, etc. One needs to take good care of such heterogeneities so as to draw firm statistical hence scientific conclusions.
However, handling the heterogeneities is far from trivial. On the one hand, it is generally impossible to fully understand the mechanisms behind those diversities, let alone to explicitly and rigorously formulate them. One the other hand, it is not rare that multiple factors intertwine with one another, in which case all these factors must be considered systematically in order to model the data precisely. Existing methods, either partially/fully neglected the heterogeneity issue at all, or took care of each factor's heterogeneity in isolation. Evidences have shown the insufficiency of such strategies and the errors they may produce in downstream analyses.
As the exact mechanisms behind heterogeneities are usually not available, we aim to learn and infer the heterogeneities' effects on data from data itself. A large group of biological data can be stored or organized as multidimensional contingency tensors, with each impact factor of interest being one dimension. The heterogeneity in each factor's impact on the variable of interest can be captured by the marginal property of the data tensor with respect to the corresponding dimension, for example, the row sum and the column sum in a 2D tensor.
In this dissertation, under the assumption that the individuals of each dimension are independent, we proposed a novel model, Conditional Multifactorial Contingency (CMC), that models the intertwined heterogeneities in all dimensions of the data tensor and infers the probability distribution of each entry of the data tensor jointly conditioned on these heterogeneities. The eventual and most comprehensive version of CMC can work on multidimensional binary or integer data tensors, even in cases where some values in the tensor are missing. CMC was initiated from elegant and simple statistical principles, derived through rigorous theoretical proofs, but ended up as a powerful tool being widely applicable to real-world biology/bioinformatics studies.
Being empowered by CMC, we conducted four case studies for real-world bioinformatics research problems: (1) driving transcription factor (TF) identification; (2) scRNA-seq data normalization; (3) cancer-associated gene identification; (4) cell similarity quantification. For each of these case studies, we proposed a whole analysis framework and specific adaptation design for CMC. In each of them, our method based on CMC outperformed existing methods and provided inspiring clues for biological discoveries, which have been validated by biological experiments.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/113211 |
Date | 17 January 2023 |
Creators | Cheng, Zuolin |
Contributors | Electrical Engineering, Yu, Guoqiang, Baumann, William T., Mili, Lamine M., Zhang, Liqing, Wang, Yue J. |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.003 seconds