Global ETD Search

1	Pattern analysis of microarray data. / 基因芯片數據中的模式分析 / CUHK electronic theses & dissertations collection / Ji yin xin pian shu ju zhong de mo shi fen xi January 2009 (has links) DNA microarray technology is the most notable high throughput technology which emerged for functional genomics in recent years. Patterns in microarray data provide clues of gene functions, cell types, and interactions among genes or gene products. Since the scale of microarray data keeps on growing, there is an urgent need for the development of methods and tools for the analysis of these huge amounts of complex data. / Interesting patterns in microarray data can be patterns appearing with significant frequencies or patterns appearing special trends. Firstly, an algorithm to find biclusters with coherent values is proposed. For these biclusters the subset of genes (or samples) show some similarities, such as low Euclidean distance or high Pearson correlation coefficient. We propose Average Correlation Value (ACV) to measure the homogeneity of a bicluster. ACV outperforms other alternatives for being applicable for biclusters of more types. Our algorithm applies dominant set approach to create sets of sorting vectors for rows of the data matrix. In this way, the co-expressed rows of the data matrix could be gathered. By alternatively sorting and transposing the data matrix the blocks of co-expressed subset are gathered. Weighted correlation coefficient is used to measure the similarity in the gene level and the sample level. Their weights are updated each time using the sorting vector of the previous iteration. Genes/samples which are assigned higher weights contribute more to the similarity measure when they are used as features for the other dimension. Unlike the two-way clustering or divide and conquer algorithm, our approach does not break the structure of the whole data and can find multiple overlapping biclusters. Also the method has low computation cost comparing to the exhaustive enumeration and distribution parameter identification methods. / Next, algorithms to find biclusters with coherent evolutions, more specific, the order preserving patterns, are proposed. In an Order Preserving Cluster (OP-Cluster) genes induce the same relative order on samples, while the exact magnitude of the data are not regarded. By converting each gene expression vector into an ordered label sequence, we transfer the problem into finding frequent orders appearing in the sequence set. Two heuristic algorithms, Growing Prefix and Suffix (GPS) and Growing Frequent Position (GFP) are presented. The results show these methods both have good scale-up properties. They output larger OP-Clusters more efficiently and have lower space and computation space cost comparing to the existing methods. / We propose the idea of Discovering Distinct Patterns (DDP) in gene expression data. The distinct patterns correspond to genes with significantly different patterns. DDP is useful to scale-down the analysis when there is little prior knowledge. A DDP algorithm is proposed by iteratively picking out pairs of genes with the largest dissimilarities. Experiments are implemented on both synthetic data sets and real microarray data. The results show the effectiveness and efficiency in finding functional significant genes. The usefulness of genes with distinct patterns for constructing simplified gene regulatory network is further discussed. / Teng, Li. / Adviser: Laiwan Chan. / Source: Dissertation Abstracts International, Volume: 71-01, Section: B, page: 0446. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2009. / Includes bibliographical references (leaves 118-128). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. DNA microarrays Gene expression--Data processing
2	The development and application of informatics-based systems for the analysis of the human transcriptome. Kelso, Janet January 2003 (has links) <p>Despite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile &ndash / the location and timing of transcript expression &ndash / provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed.<br /> <br /> In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.</p> Gene expression, Data processing Genetics, Data processing Genomes, Data processing.
3	HARP: a practical projected clustering algorithm for mining gene expression data Yip, Yuk-Lap, Kevin., 葉旭立. January 2003 (has links) published_or_final_version / abstract / toc / Computer Science and Information Systems / Master / Master of Philosophy Gene expression - Data processing. Computer algorithms. Cluster analysis. Data mining.
4	The development and application of informatics-based systems for the analysis of the human transcriptome. Kelso, Janet January 2003 (has links) <p>Despite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile &ndash / the location and timing of transcript expression &ndash / provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed.<br /> <br /> In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.</p> Gene expression, Data processing Genetics, Data processing Genomes, Data processing.
5	Low dimensional structure in single cell data Kunes, Russell Allen Zhang January 2024 (has links) This thesis presents the development of three methods, each of which concerns the estimation of interpretable low dimensional representations of high dimensional data. The first two chapters consider methods for fitting low dimensional nonlinear representations. In Chapter 1, we discuss the deterministic input, noisy "and" gate (DINA) model and in Chapter 2, binary variational autoencoders. We present an example of application to single cell assay for transposase accessible chromatin sequencing data (single cell ATACseq), where the DINA model uncovers meaningful discrete representations of cell state. In scientific applications, practitioners have substantial prior knowledge of the latent components driving variation in the data. The third Chapter develops a supervised matrix factorization method, Spectra, that leverages annotations from experts and previous biological experiments to uncover latent representations of single cell RNAseq data. Variational inference for the DINA model: The deterministic input, noisy "and" gate (DINA) model allows for matrix decomposition where latent factors are allowed to interact via an "and" relationship. We develop a variational inference approach for estimating the parameters of the DINA model. Previous approaches based on variational inference enumerate the space of latent binary parameters (requiring exponential numbers of parameters) and cannot fit an unknown number of latent components. Here, we report that a practical mean field variational inference approach relying on a nonparametric cumulative shrinkage process prior and stochastic coordinate ascent updates achieves competitive results with existing methods while simultaneously determining the number of latent components. This approach allows scaling exploratory Q-matrix estimation to datasets of practical size with minimal hyperparameter tuning. Gradient estimation for binary latent variable models: In order to fit binary variational autoencoders, the gradient of the objective function must be estimated. Generally speaking, gradient estimation is often necessary for fitting generative models with discrete latent variables. Examples of this occur in contexts such as reinforcement learning and variational autoencoder (VAE) training. The DisARM estimator (Yin et al. 2020; Dong, Mnih, and Tucker 2020) achieves state of the art gradient variance for Bernoulli latent variable models in many contexts. However, DisARM and other estimators have potentially exploding variance near the boundary of the parameter space, where solutions tend to lie. To ameliorate this issue, we propose a new gradient estimator bitflip-1 that has lower variance at the boundaries of the parameter space. As bitflip-1 has complementary properties to existing estimators, we introduce an aggregated estimator, unbiased gradient variance clipping (UGC) that uses either a bitflip-1 or a DisARM gradient update for each coordinate. We theoretically prove that UGC has uniformly lower variance than DisARM.Empirically, we observe that UGC achieves the optimal value of the optimization objectives in toy experiments, discrete VAE training, and in a best subset selection problem. The Spectra model for supervised matrix decomposition: Factor analysis decomposes single-cell gene expression data into a minimal set of gene programs that correspond to processes executed by cells in a sample. However, matrix factorization methods are prone to technical artifacts and poor factor interpretability. We address these concerns with Spectra, an algorithm that combines user-provided gene programs with the detection of novel programs that together best explain expression covariation. Spectra incorporates existing gene sets and cell type labels as prior biological information. It explicitly models cell type and represents input gene sets as a gene-gene knowledge graph, using a penalty function to guide factorization towards the input graph. We show that Spectra outperforms existing approaches in challenging tumor immune contexts: it finds factors that change under immune checkpoint therapy, disentangles the highly correlated features of CD8+ T-cell tumor reactivity and exhaustion, finds a program that explains continuous macrophage state changes under therapy, and identifies cell-type-specific immune metabolic programs. Statistics Biostatistics RNA--Data processing Gene expression--Data processing Latent variables
6	In silico prediction of cis-regulatory elements of genes involved in hypoxic-ischaemic insult Fu, Wai, 符慧 January 2006 (has links) published_or_final_version / abstract / Paediatrics and Adolescent Medicine / Master / Master of Philosophy Cerebral anoxia. Cerebral ischemia. Brain - Wounds and injuries. Genetic regulation - Data processing. Gene expression - Data processing. Transcription factors.
7	Confounding effects in gene expression and their impact on downstream analysis Lachmann, Alexander January 2016 (has links) The reconstruction of gene regulatory networks is one of the milestones of computational system biology. We introduce a new implementation of ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) to reverse engineer transcriptional regulatory networks with improved mutual information estimators and significant improvement in performance. In the context of data driven network inference we identify two major confounding biases and introduce solutions to remove some of the discussed biases. First we identify prevalent spatial biases in gene expression studies derived from plate based designs. We investigate the gene expression profiles of a million samples from the LINCS dataset and find that the vast majority (96%) of the tested plates is affected by significant spatial bias. We can show that our proposed method to correct these biases results in a significant improvement of similarity between biological replicates assayed in different plates. Lastly we discuss the effect of CNV on gene expression and its confounding effect on the correlation landscape of genes in the context of cancer samples. We propose a method that removes the variance in gene expression explained by CNV and show that TF target predictions can be significantly improved. Genetic regulation Genetic regulation--Data processing Gene expression Gene expression--Data processing Bioinformatics Gene regulatory networks Genetics
8	Algorithms for the analysis of gene expression data Venet, David 07 December 2004 (has links) High-throughput gene expression data have been generated on a large scale by biologists.<p>The thesis describe a set of tools for the analysis of such data. It is specially gearded towards microarray data. / Doctorat en sciences appliquées / info:eu-repo/semantics/nonPublished Sciences de l'ingénieur Biologie Gene expression -- Data processing Genetic algorithms Bioinformatics Expression génique -- Informatique Algorithmes génétiques Bio-informatique gene expression data bioinformatics
9	Identification and assessment of gene signatures in human breast cancer / Identification et évaluation de signatures géniques dans le cancer du sein humain Haibe-Kains, Benjamin 02 April 2009 (has links) This thesis addresses the use of machine learning techniques to develop clinical diagnostic tools for breast cancer using molecular data. These tools are designed to assist physicians in their evaluation of the clinical outcome of breast cancer (referred to as prognosis).<p>The traditional approach to evaluating breast cancer prognosis is based on the assessment of clinico-pathologic factors known to be associated with breast cancer survival. These factors are used to make recommendations about whether further treatment is required after the removal of a tumor by surgery. Treatment such as chemotherapy depends on the estimation of patients' risk of relapse. Although current approaches do provide good prognostic assessment of breast cancer survival, clinicians are aware that there is still room for improvement in the accuracy of their prognostic estimations.<p>In the late nineties, new high throughput technologies such as the gene expression profiling through microarray technology emerged. Microarrays allowed scientists to analyze for the first time the expression of the whole human genome ("transcriptome"). It was hoped that the analysis of genome-wide molecular data would bring new insights into the critical, underlying biological mechanisms involved in breast cancer progression, as well as significantly improve prognostic prediction. However, the analysis of microarray data is a difficult task due to their intrinsic characteristics: (i) thousands of gene expressions are measured for only few samples; (ii) the measurements are usually "noisy"; and (iii) they are highly correlated due to gene co-expressions. Since traditional statistical methods were not adapted to these settings, machine learning methods were picked up as good candidates to overcome these difficulties. However, applying machine learning methods for microarray analysis involves numerous steps, and the results are prone to overfitting. Several authors have highlighted the major pitfalls of this process in the early publications, shedding new light on the promising but overoptimistic results. <p>Since 2002, large comparative studies have been conducted in order to identify the key characteristics of successful methods for class discovery and classification. Yet methods able to identify robust molecular signatures that can predict breast cancer prognosis have been lacking. To fill this important gap, this thesis presents an original methodology dealing specifically with the analysis of microarray and survival data in order to build prognostic models and provide an honest estimation of their performance. The approach used for signature extraction consists of a set of original methods for feature transformation, feature selection and prediction model building. A novel statistical framework is presented for performance assessment and comparison of risk prediction models.<p>In terms of applications, we show that these methods, used in combination with a priori biological knowledge of breast cancer and numerous public microarray datasets, have resulted in some important discoveries. In particular, the research presented here develops (i) a robust model for the identification of breast molecular subtypes and (ii) a new prognostic model that takes into account the molecular heterogeneity of breast cancers observed previously, in order to improve traditional clinical guidelines and state-of-the-art gene signatures./Cette thèse concerne le développement de techniques d'apprentissage (machine learning) afin de mettre au point de nouveaux outils cliniques basés sur des données moleculaires. Nous avons focalisé notre recherche sur le cancer du sein, un des cancers les plus fréquemment diagnostiqués. Ces outils sont développés dans le but d'aider les médecins dans leur évaluation du devenir clinique des patients cancéreux (cf. le pronostique).<p>Les approches traditionnelles d'évaluation du pronostique d'un patient cancéreux se base sur des critères clinico-pathologiques connus pour être prédictifs de la survie. Cette évaluation permet aux médecins de décider si un traitement est nécessaire après l'extraction de la tumeur. Bien que les outils d'évaluation traditionnels sont d'une aide importante, les cliniciens sont conscients de la nécessité d'améliorer de tels outils.<p>Dans les années 90, de nouvelles technologies à haut-débit, telles que le profilage de l'expression génique par biopuces à ADN (microarrays), ont été mises au point afin de permettre aux scientifiques d'analyser l'expression de l'entièreté du génôme de cellules cancéreuses. Ce nouveau type de données moléculaires porte l'espoir d'améliorer les outils pronostiques traditionnels et d'approfondir nos connaissances concernant la génèse du cancer du sein. Cependant ces données sont extrêmement difficiles à analyser à cause (i) de leur haute dimensionalité (plusieurs dizaines de milliers de gènes pour seulement quelques centaines d'expériences); (ii) du bruit important dans les mesures; (iii) de la collinéarité entre les mesures dûe à la co-expression des gènes.<p>Depuis 2002, des études comparatives à grande échelle ont permis d'identifier les méthodes performantes pour l'analyse de groupements et la classification de données microarray, négligeant l'analyse de survie pertinente pour le pronostique dans le cancer du sein. Pour pallier ce manque, cette thèse présente une méthodologie originale adaptée à l'analyse de données microarray et de survie afin de construire des modèles pronostiques performants et robustes. <p>En termes d'applications, nous montrons que cette méthodologie, utilisée en combinaison avec des connaissances biologiques a priori et de nombreux ensembles de données publiques, a permis d'importantes découvertes. En particulier, il résulte de la recherche presentée dans cette thèse, le développement d'un modèle robuste d'identification des sous-types moléculaires du cancer du sein et de plusieurs signatures géniques améliorant significativement l'état de l'art au niveau pronostique. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Informatique générale Sciences exactes et naturelles Breast -- Cancer -- Data processing DNA microarrays Gene expression -- Data processing Sein -- Cancer -- Informatique Puces à ADN Expression génique -- Informatique apprentissage automatique machine learning

Search results