1 |
Unsupervised Signal Deconvolution for Multiscale Characterization of Tissue HeterogeneityWang, Niya 29 June 2015 (has links)
Characterizing complex tissues requires precise identification of distinctive cell types, cell-specific signatures, and subpopulation proportions. Tissue heterogeneity, arising from multiple cell types, is a major confounding factor in studying individual subpopulations and repopulation dynamics. Tissue heterogeneity cannot be resolved directly by most global molecular and genomic profiling methods. While signal deconvolution has widespread applications in many real-world problems, there are significant limitations associated with existing methods, mainly unrealistic assumptions and heuristics, leading to inaccurate or incorrect results. In this study, we formulate the signal deconvolution task as a blind source separation problem, and develop novel unsupervised deconvolution methods within the Convex Analysis of Mixtures (CAM) framework, for characterizing multi-scale tissue heterogeneity. We also explanatorily test the application of Significant Intercellular Genomic Heterogeneity (SIGH) method.
Unlike existing deconvolution methods, CAM can identify tissue-specific markers directly from mixed signals, a critical task, without relying on any prior knowledge. Fundamental to the success of our approach is a geometric exploitation of tissue-specific markers and signal non-negativity. Using a well-grounded mathematical framework, we have proved new theorems showing that the scatter simplex of mixed signals is a rotated and compressed version of the scatter simplex of pure signals and that the resident markers at the vertices of the scatter simplex are the tissue-specific markers. The algorithm works by geometrically locating the vertices of the scatter simplex of measured signals and their resident markers. The minimum description length (MDL) criterion is applied to determine the number of tissue populations in the sample. Based on CAM principle, we integrated nonnegative independent component analysis (nICA) and convex matrix factorization (CMF) methods, developed CAM-nICA/CMF algorithm, and applied them to multiple gene expression, methylation and protein datasets, achieving very promising results validated by the ground truth or gene enrichment analysis. We integrated CAM with compartment modeling (CM) and developed multi-tissue compartment modeling (MTCM) algorithm, tested on real DCE-MRI data derived from mouse models with consistent and plausible results. We also developed an open-source R-Java software package that implements various CAM based algorithms, including an R package approved by Bioconductor specifically for tumor-stroma deconvolution.
While intercellular heterogeneity is often manifested by multiple clones with distinct sequences, systematic efforts to characterize intercellular genomic heterogeneity must effectively distinguish significant genuine clonal sequences from probabilistic fake derivatives. Based on the preliminary studies originally targeting immune T-cells, we tested and applied the SIGH algorithm to characterize intercellular heterogeneity directly from mixed sequencing reads. SIGH works by exploiting the statistical differences in both the sequencing error rates at different nucleobases and the read counts of fake sequences in relation to genuine clones of variable abundance. / Ph. D.
|
2 |
Machine Learning Approaches for Modeling and Correction of Confounding Effects in Complex Biological DataWu, Chiung Ting 09 June 2021 (has links)
With the huge volume of biological data generated by new technologies and the booming of new machine learning based analytical tools, we expect to advance life science and human health at an unprecedented pace. Unfortunately, there is a significant gap between the complex raw biological data from real life and the data required by mathematical and statistical tools. This gap is contributed by two fundamental and universal problems in biological data that are both related to confounding effects. The first is the intrinsic complexities of the data. An observed sample could be the mixture of multiple underlying sources and we may be only interested in one or part of the sources. The second type of complexities come from the acquisition process of the data. Different samples may be gathered at different time and/or from different locations. Therefore, each sample is associated with specific distortion that must be carefully addressed. These confounding effects obscure the signals of interest in the acquired data. Specifically, this dissertation will address the two major challenges in confounding effects removal: alignment and deconvolution.
Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention time (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. To utilize this information, we develop an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. We applied ncGTW to two large-scale metabolomics LC-MS datasets, which identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods.
When the desired signal is buried in a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of individual sources, instead of mixtures. Though there are some promising supervised deconvolution methods, when there is no a priori information, unsupervised deconvolution is still needed. Among current unsupervised methods, Convex Analysis of Mixtures (CAM) is the most theoretically solid and strongest performing one. However, there are some major limitations of this method. Most importantly, the overall time complexity can be very high, especially when analyzing a large dataset or a dataset with many sources. Also, since there are some stochastic and heuristic steps, the deconvolution result is not accurate enough. To address these problems, we redesigned the modules of CAM. In the feature clustering step, we propose a clustering method, radius-fixed clustering, which could not only control the space size of the cluster, but also find out the outliers simultaneously. Therefore, the disadvantages of K-means clustering, such as instability and the need of cluster number are avoided. Moreover, when identifying the convex hull, we replace Quickhull with linear programming, which decreases the computation time significantly. To avoid the not only heuristic but also approximated step in optimal simplex identification, we propose a greedy search strategy instead. The experimental results demonstrate the vast improvement of computation time. The accuracy of the deconvolution is also shown to be higher than the original CAM. / Doctor of Philosophy / Due to the complexity of biological data, there are two major pre-processing steps: alignment and deconvolution. The alignment step corrects the time and location related data acquisition distortion by aligning the detected signals to a reference signal. Though many alignment methods are proposed for biological data, most of them fail to consider the relationships among samples carefully. This piece of structure information can help alignment when the data is noisy and/or irregular. To utilize this information, we develop a new method, Neighbor-wise Compound-specific Graphical Time Warping (ncGTW), inspired by graph theory. This new alignment method not only utilizes the structural information but also provides a reference-free solution. We show that the performance of our new method is better than other methods in both simulations and real datasets.
When the signal is from a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of single sources, instead of mixtures. There is a classic unsupervised deconvolution method: Convex Analysis of Mixtures (CAM). However, there are some limitations of this method. For example, the time complexity of some steps is very high. Thus, when facing a large dataset or a dataset with many sources, the computation time would be extremely long. Also, since there are some stochastic and heuristic steps, the deconvolution result may be not accurate enough. We improved CAM and the experimental results show that the speed and accuracy of the deconvolution is significantly improved.
|
3 |
Computational Dissection of Composite Molecular Signatures and Transcriptional ModulesGong, Ting 22 January 2010 (has links)
This dissertation aims to develop a latent variable modeling framework with which to analyze gene expression profiling data for computational dissection of molecular signatures and transcriptional modules.
The first part of the dissertation is focused on extracting pure gene expression signals from tissue or cell mixtures. The main goal of gene expression profiling is to identify the pure signatures of different cell types (such as cancer cells, stromal cells and inflammatory cells) and estimate the concentration of each cell type. In order to accomplish this, a new blind source separation method is developed, namely, nonnegative partially independent component analysis (nPICA), for tissue heterogeneity correction (THC). The THC problem is formulated as a constrained optimization problem and solved with a learning algorithm based on geometrical and statistical principles.
The second part of the dissertation sought to identify gene modules from gene expression data to uncover important biological processes in different types of cells. A new gene clustering approach, nonnegative independent component analysis (nICA), is developed for gene module identification. The nICA approach is completed with an information-theoretic procedure for input sample selection and a novel stability analysis approach for proper dimension estimation. Experimental results showed that the gene modules identified by the nICA approach appear to be significantly enriched in functional annotations in terms of gene ontology (GO) categories.
The third part of the dissertation moves from gene module level down to DNA sequence level to identify gene regulatory programs by integrating gene expression data and protein-DNA binding data. A sparse hidden component model is first developed for this problem, taking into account a well-known biological principle, i.e., a gene is most likely regulated by a few regulators. This is followed by the development of a novel computational approach, motif-guided sparse decomposition (mSD), in order to integrate the binding information and gene expression data.
These computational approaches are primarily developed for analyzing high-throughput gene expression profiling data. Nevertheless, the proposed methods should be able to be extended to analyze other types of high-throughput data for biomedical research. / Ph. D.
|
4 |
Mathematical Modeling and Deconvolution for Molecular Characterization of Tissue HeterogeneityChen, Lulu 22 January 2020 (has links)
Tissue heterogeneity, arising from intermingled cellular or tissue subtypes, significantly obscures the analyses of molecular expression data derived from complex tissues. Existing computational methods performing data deconvolution from mixed subtype signals almost exclusively rely on supervising information, requiring subtype-specific markers, the number of subtypes, or subtype compositions in individual samples. We develop a fully unsupervised deconvolution method to dissect complex tissues into molecularly distinctive tissue or cell subtypes directly from mixture expression profiles. We implement an R package, deconvolution by Convex Analysis of Mixtures (debCAM) that can automatically detect tissue or cell-specific markers, determine the number of constituent sub-types, calculate subtype proportions in individual samples, and estimate tissue/cell-specific expression profiles. We demonstrate the performance and biomedical utility of debCAM on gene expression, methylation, and proteomics data. With enhanced data preprocessing and prior knowledge incorporation, debCAM software tool will allow biologists to perform a deep and unbiased characterization of tissue remodeling in many biomedical contexts.
Purified expression profiles from physical experiments provide both ground truth and a priori information that can be used to validate unsupervised deconvolution results or improve supervision for various deconvolution methods. Detecting tissue or cell-specific expressed markers from purified expression profiles plays a critical role in molecularly characterizing and determining tissue or cell subtypes. Unfortunately, classic differential analysis assumes a convenient test statistic and associated null distribution that is inconsistent with the definition of markers and thus results in a high false positive rate or lower detection power. We describe a statistically-principled marker detection method, One Versus Everyone Subtype Exclusively-expressed Genes (OVESEG) test, that estimates a mixture null distribution model by applying novel permutation schemes. Validated with realistic synthetic data sets on both type 1 error and detection power, OVESEG-test applied to benchmark gene expression data sets detects many known and de novo subtype-specific expressed markers. Subsequent supervised deconvolution results, obtained using markers detected by the OVESEG-test, showed superior performance when compared with popular peer methods.
While the current debCAM approach can dissect mixed signals from multiple samples into the 'averaged' expression profiles of subtypes, many subsequent molecular analyses of complex tissues require sample-specific deconvolution where each sample is a mixture of 'individualized' subtype expression profiles. The between-sample variation embedded in sample-specific subtype signals provides critical information for detecting subtype-specific molecular networks and uncovering hidden crosstalk. However, sample-specific deconvolution is an underdetermined and challenging problem because there are more variables than observations. We propose and develop debCAM2.0 to estimate sample-specific subtype signals by nuclear norm regularization, where the hyperparameter value is determined by random entry exclusion based cross-validation scheme. We also derive an efficient optimization approach based on ADMM to enable debCAM2.0 application in large-scale biological data analyses. Experimental results on realistic simulation data sets show that debCAM2.0 can successfully recover subtype-specific correlation networks that is unobtainable otherwise using existing deconvolution methods. / Doctor of Philosophy / Tissue samples are essentially mixtures of tissue or cellular subtypes where the proportions of individual subtypes vary across different tissue samples. Data deconvolution aims to dissect tissue heterogeneity into biologically important subtypes, their proportions, and their marker genes. The physical solution to mitigate tissue heterogeneity is to isolate pure tissue components prior to molecular profiling. However, these experimental methods are time-consuming, expensive and may alter the expression values during isolation. Existing literature primarily focuses on supervised deconvolution methods which require a priori information. This approach has an inherent problem as it relies on the quality and accuracy of the a priori information. In this dissertation, we propose and develop a fully unsupervised deconvolution method - deconvolution by Convex Analysis of Mixtures (debCAM) that can estimate the mixing proportions and 'averaged' expression profiles of individual subtypes present in heterogeneous tissue samples. Furthermore, we also propose and develop debCAM2.0 that can estimate 'individualized' expression profiles of participating subtypes in complex tissue samples.
Subtype-specific expressed markers, or marker genes (MGs), serves as critical a priori information for supervised deconvolution. MGs are exclusively and consistently expressed in a particular tissue or cell subtype while detecting such unique MGs involving many subtypes constitutes a challenging task. We propose and develop a statistically-principled method - One Versus Everyone Subtype Exclusively-expressed Genes (OVESEG-test) for robust detection of MGs from purified profiles of many subtypes.
|
5 |
Estudo \"in vivo\" da influência de heterogeneidade de tecidos na distribuição de dose em terapias com raios x de alta energia / In vivo study of the influence of tissue inhomogeneity on dose distribution in high energy X ray therapyAldred, Martha Aurelia 21 April 1987 (has links)
O estudo dos efeitos da heterogeneidade de tecidos nas doses administradas em radioterapia tem sido alvo de diversos trabalhos. A maioria deles mostra medidas \"in vitro\" em fantomas homogêneos nos quais são introduzidos materiais que simulam a presença de ossos ou de cavidades de ar. No presente trabalho optou-se por um estudo \"in vivo\" para o qual utilizou-se como instrumento de medida a dosimetria termoluminescente, bastante apropriada por motivos bem conhecidos. Foram escolhidos oito pacientes com tumor na região pélvica, tratados com raios x do Acelerador Linear de 4 MeV do Instituto de Radioterapia Osvaldo Cruz. A relação entre as doses medidas na entrada e na saída do feixe quando comparada ao valor previsto para fantomas homogêneos, indica a influência da heterogeneidade na distribuição da dose. A análise das medidas efetuadas nos oito pacientes, mostrou que a dose administrada ao centro do tumor pode ser 8% menor do que aquela planejada sem levar-se em conta a referida heterogeneidade. / Several authors investigated the effect of the heterogeneity of tissue in the dose distribution in a radiation therapy. Practically all of them carried out \"in vitro\" measurements using a solid body immersed in a water phantom, in order to simulate the inhomogeneity, such as bone, air cavity, etc. In the present work, \"in vivo\" measurements were performed utilizing thermoluminescent dosimeters, whose appropriateness and convenience are well known. Eight patients at Instituto de Radioterapia Oswaldo Cruz were selected, that were under irradiation treatments in their pelvic region. The ratio between body entry radiation dose and the corresponding exit dose, when compared to the same ratio for a homogeneous phantom, gives the influence of the heterogeneity of the tissue the radiation crosses. The results found in those eight patients have shown that \"in vivo\" measurements present a ratio about 8% smaller than in homogenous phantom case.
|
6 |
Estudo \"in vivo\" da influência de heterogeneidade de tecidos na distribuição de dose em terapias com raios x de alta energia / In vivo study of the influence of tissue inhomogeneity on dose distribution in high energy X ray therapyMartha Aurelia Aldred 21 April 1987 (has links)
O estudo dos efeitos da heterogeneidade de tecidos nas doses administradas em radioterapia tem sido alvo de diversos trabalhos. A maioria deles mostra medidas \"in vitro\" em fantomas homogêneos nos quais são introduzidos materiais que simulam a presença de ossos ou de cavidades de ar. No presente trabalho optou-se por um estudo \"in vivo\" para o qual utilizou-se como instrumento de medida a dosimetria termoluminescente, bastante apropriada por motivos bem conhecidos. Foram escolhidos oito pacientes com tumor na região pélvica, tratados com raios x do Acelerador Linear de 4 MeV do Instituto de Radioterapia Osvaldo Cruz. A relação entre as doses medidas na entrada e na saída do feixe quando comparada ao valor previsto para fantomas homogêneos, indica a influência da heterogeneidade na distribuição da dose. A análise das medidas efetuadas nos oito pacientes, mostrou que a dose administrada ao centro do tumor pode ser 8% menor do que aquela planejada sem levar-se em conta a referida heterogeneidade. / Several authors investigated the effect of the heterogeneity of tissue in the dose distribution in a radiation therapy. Practically all of them carried out \"in vitro\" measurements using a solid body immersed in a water phantom, in order to simulate the inhomogeneity, such as bone, air cavity, etc. In the present work, \"in vivo\" measurements were performed utilizing thermoluminescent dosimeters, whose appropriateness and convenience are well known. Eight patients at Instituto de Radioterapia Oswaldo Cruz were selected, that were under irradiation treatments in their pelvic region. The ratio between body entry radiation dose and the corresponding exit dose, when compared to the same ratio for a homogeneous phantom, gives the influence of the heterogeneity of the tissue the radiation crosses. The results found in those eight patients have shown that \"in vivo\" measurements present a ratio about 8% smaller than in homogenous phantom case.
|
Page generated in 0.0936 seconds