• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Spatiotemporal Gene Networks from ISH Images

Puniyani, Kriti 01 September 2013 (has links)
As large-scale techniques for studying and measuring gene expressions have been developed, automatically inferring gene interaction networks from expression data has emerged as a popular technique to advance our understanding of cellular systems. Accurate prediction of gene interactions, especially in multicellular organisms such as Drosophila or humans, requires temporal and spatial analysis of gene expressions, which is not easily obtainable from microarray data. New image based techniques using in-sit hybridization(ISH) have recently been developed to allowlarge-scale spatial-temporal profiling of whole body mRNA expression. However, analysis of such data for discovering new gene interactions still remains an open challenge. This thesis studies the question of predicting gene interaction networks from ISH data in three parts. First, we present SPEX2, a computer vision pipeline to extract informative features from ISH data. Next, we present an algorithm, GINI, for learning spatial gene interaction networks from embryonic ISH images at a single time step. GINI combines multi-instance kernels with recent work in learning sparse undirected graphical models to predict interactions between genes. Finally, we propose NP-MuScL (nonparanormal multi source learning) to estimate a gene interaction network that is consistent with multiple sources of data, having the same underlying relationships between the nodes. NP-MuScL casts the network estimation problem as estimating the structure of a sparse undirected graphical model. We use the semiparametric Gaussian copula to model the distribution of the different data sources, with the different copulas sharing the same covariance matrix, and show how to estimate such a model in the high dimensional scenario. We apply our algorithms on more than 100,000 Drosophila embryonic ISH images from the Berkeley Drosophila Genome Project. Each of the 6 time steps in Drosophila embryonic development is treated as a separate data source. With spatial gene interactions predicted via GINI, and temporal predictions combined via NP-MuScL, we are finally able to predict spatiotemporal gene networks from these images.
2

Distributed Bootstrap for Massive Data

Yang Yu (12466911) 27 April 2022 (has links)
<p>Modern massive data, with enormous sample size and tremendous dimensionality, are usually stored and processed using a cluster of nodes in a master-worker architecture. A shortcoming of this architecture is that inter-node communication can be over a thousand times slower than intra-node computation, which makes communication efficiency a desirable feature when developing distributed learning algorithms. In this dissertation, we tackle this challenge and propose communication-efficient bootstrap methods for simultaneous inference in the distributed computational framework.</p> <p>  </p> <p>First, we propose two generic distributed bootstrap methods, \texttt{k-grad} and \texttt{n+k-1-grad}, which apply multiplier bootstrap at the master node on the gradients communicated across nodes. Based on them, we develop a communication-efficient method of producing an $\ell_\infty$-norm confidence region using distributed data with dimensionality not exceeding the local sample size. Our theory establishes the communication efficiency by providing a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency and showing that $\tau_{\min}$ only increases logarithmically with the number of workers and the dimensionality. Our simulation studies validate our theory.</p> <p>  </p> <p>Then, we extend \texttt{k-grad} and \texttt{n+k-1-grad} to the high-dimensional regime and propose a distributed bootstrap method for simultaneous inference on high-dimensional distributed data. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset.</p>
3

Uncovering Structure in High-Dimensions: Networks and Multi-task Learning Problems

Kolar, Mladen 01 July 2013 (has links)
Extracting knowledge and providing insights into complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Statistical modeling has become ubiquitous in the analysis of high dimensional functional data in search of better understanding of cognition mechanisms, in the exploration of large-scale gene regulatory networks in hope of developing drugs for lethal diseases, and in prediction of volatility in stock market in hope of beating the market. Statistical analysis in these high-dimensional data sets is possible only if an estimation procedure exploits hidden structures underlying data. This thesis develops flexible estimation procedures with provable theoretical guarantees for uncovering unknown hidden structures underlying data generating process. Of particular interest are procedures that can be used on high dimensional data sets where the number of samples n is much smaller than the ambient dimension p. Learning in high-dimensions is difficult due to the curse of dimensionality, however, the special problem structure makes inference possible. Due to its importance for scientific discovery, we put emphasis on consistent structure recovery throughout the thesis. Particular focus is given to two important problems, semi-parametric estimation of networks and feature selection in multi-task learning.
4

Graph Matching Based on a Few Seeds: Theoretical Algorithms and Graph Neural Network Approaches

Liren Yu (17329693) 03 November 2023 (has links)
<p dir="ltr">Since graphs are natural representations for encoding relational data, the problem of graph matching is an emerging task and has attracted increasing attention, which could potentially impact various domains such as social network de-anonymization and computer vision. Our main interest is designing polynomial-time algorithms for seeded graph matching problems where a subset of pre-matched vertex-pairs (seeds) is revealed. </p><p dir="ltr">However, the existing work does not fully investigate the pivotal role of seeds and falls short of making the most use of the seeds. Notably, the majority of existing hand-crafted algorithms only focus on using ``witnesses'' in the 1-hop neighborhood. Although some advanced algorithms are proposed to use multi-hop witnesses, their theoretical analysis applies only to \ER random graphs and requires seeds to be all correct, which often do not hold in real applications. Furthermore, a parallel line of research, Graph Neural Network (GNN) approaches, typically employs a semi-supervised approach, which requires a large number of seeds and lacks the capacity to distill knowledge transferable to unseen graphs.</p><p dir="ltr">In my dissertation, I have taken two approaches to address these limitations. In the first approach, we study to design hand-crafted algorithms that can properly use multi-hop witnesses to match graphs. We first study graph matching using multi-hop neighborhoods when partially-correct seeds are provided. Specifically, consider two correlated graphs whose edges are sampled independently from a parent \ER graph $\mathcal{G}(n,p)$. A mapping between the vertices of the two graphs is provided as seeds, of which an unknown fraction is correct. We first analyze a simple algorithm that matches vertices based on the number of common seeds in the $1$-hop neighborhoods, and then further propose a new algorithm that uses seeds in the $D$-hop neighborhoods. We establish non-asymptotic performance guarantees of perfect matching for both $1$-hop and $2$-hop algorithms, showing that our new $2$-hop algorithm requires substantially fewer correct seeds than the $1$-hop algorithm when graphs are sparse. Moreover, by combining our new performance guarantees for the $1$-hop and $2$-hop algorithms, we attain the best-known results (in terms of the required fraction of correct seeds) across the entire range of graph sparsity and significantly improve the previous results. We then study the role of multi-hop neighborhoods in matching power-law graphs. Assume that two edge-correlated graphs are independently edge-sampled from a common parent graph with a power-law degree distribution. A set of correctly matched vertex-pairs is chosen at random and revealed as initial seeds. Our goal is to use the seeds to recover the remaining latent vertex correspondence between the two graphs. Departing from the existing approaches that focus on the use of high-degree seeds in $1$-hop neighborhoods, we develop an efficient algorithm that exploits the low-degree seeds in suitably-defined $D$-hop neighborhoods. Our result achieves an exponential reduction in the seed size requirement compared to the best previously known results.</p><p dir="ltr">In the second approach, we study GNNs for seeded graph matching. We propose a new supervised approach that can learn from a training set how to match unseen graphs with only a few seeds. Our SeedGNN architecture incorporates several novel designs, inspired by our theoretical studies of seeded graph matching: 1) it can learn to compute and use witness-like information from different hops, in a way that can be generalized to graphs of different sizes; 2) it can use easily-matched node-pairs as new seeds to improve the matching in subsequent layers. We evaluate SeedGNN on synthetic and real-world graphs and demonstrate significant performance improvements over both non-learning and learning algorithms in the existing literature. Furthermore, our experiments confirm that the knowledge learned by SeedGNN from training graphs can be generalized to test graphs of different sizes and categories.</p>

Page generated in 0.1686 seconds