Spelling suggestions: "subject:"cachine learning - amathematical models"" "subject:"cachine learning - dmathematical models""
21 |
A probabilistic framework and algorithms for modeling and analyzing multi-instance dataBehmardi, Behrouz 28 November 2012 (has links)
Multi-instance data, in which each object (e.g., a document) is a collection of instances
(e.g., word), are widespread in machine learning, signal processing, computer vision,
bioinformatic, music, and social sciences. Existing probabilistic models, e.g., latent
Dirichlet allocation (LDA), probabilistic latent semantic indexing (pLSI), and discrete
component analysis (DCA), have been developed for modeling and analyzing multiinstance
data. Such models introduce a generative process for multi-instance data which
includes a low dimensional latent structure. While such models offer a great freedom
in capturing the natural structure in the data, their inference may present challenges.
For example, the sensitivity in choosing the hyper-parameters in such models, requires
careful inference (e.g., through cross-validation) which results in large computational
complexity. The inference for fully Bayesian models which contain no hyper-parameters
often involves slowly converging sampling methods. In this work, we develop approaches
for addressing such challenges and further enhancing the utility of such models.
This dissertation demonstrates a unified convex framework for probabilistic modeling
of multi-instance data. The three main aspects of the proposed framework are as follows.
First, joint regularization is incorporated into multiple density estimation to simultaneously
learn the structure of the distribution space and infer each distribution. Second,
a novel confidence constraints framework is used to facilitate a tuning-free approach to
control the amount of regularization required for the joint multiple density estimation
with theoretical guarantees on correct structure recovery. Third, we formulate the problem
using a convex framework and propose efficient optimization algorithms to solve
it.
This work addresses the unique challenges associated with both discrete and continuous
domains. In the discrete domain we propose a confidence-constrained rank minimization
(CRM) to recover the exact number of topics in topic models with theoretical
guarantees on recovery probability and mean squared error of the estimation. We provide
a computationally efficient optimization algorithm for the problem to further the
applicability of the proposed framework to large real world datasets. In the continuous
domain, we propose to use the maximum entropy (MaxEnt) framework for multi-instance
datasets. In this approach, bags of instances are represented as distributions using the
principle of MaxEnt. We learn basis functions which span the space of distributions for
jointly regularized density estimation. The basis functions are analogous to topics in a
topic model.
We validate the efficiency of the proposed framework in the discrete and continuous
domains by extensive set of experiments on synthetic datasets as well as on real world
image and text datasets and compare the results with state-of-the-art algorithms. / Graduation date: 2013
|
22 |
Learning General Features From Images and Audio With Stacked Denoising AutoencodersNifong, Nathaniel H. 23 January 2014 (has links)
One of the most impressive qualities of the brain is its neuro-plasticity. The neocortex has roughly the same structure throughout its whole surface, yet it is involved in a variety of different tasks from vision to motor control, and regions which once performed one task can learn to perform another. Machine learning algorithms which aim to be plausible models of the neocortex should also display this plasticity. One such candidate is the stacked denoising autoencoder (SDA). SDA's have shown promising results in the field of machine perception where they have been used to learn abstract features from unlabeled data. In this thesis I develop a flexible distributed implementation of an SDA and train it on images and audio spectrograms to experimentally determine properties comparable to neuro-plasticity. Specifically, I compare the visual-auditory generalization between a multi-level denoising autoencoder trained with greedy, layer-wise pre-training (GLWPT), to one trained without. I test a hypothesis that multi-modal networks will perform better than uni-modal networks due to the greater generality of features that may be learned. Furthermore, I also test the hypothesis that the magnitude of improvement gained from this multi-modal training is greater when GLWPT is applied than when it is not. My findings indicate that these hypotheses were not confirmed, but that GLWPT still helps multi-modal networks adapt to their second sensory modality.
|
23 |
A Survey of Systems for Predicting Stock Market Movements, Combining Market Indicators and Machine Learning ClassifiersCaley, Jeffrey Allan 14 March 2013 (has links)
In this work, we propose and investigate a series of methods to predict stock market movements. These methods use stock market technical and macroeconomic indicators as inputs into different machine learning classifiers. The objective is to survey existing domain knowledge, and combine multiple techniques into one method to predict daily market movements for stocks. Approaches using nearest neighbor classification, support vector machine classification, K-means classification, principal component analysis and genetic algorithms for feature reduction and redefining the classification rule were explored. Ten stocks, 9 companies and 1 index, were used to evaluate each iteration of the trading method. The classification rate, modified Sharpe ratio and profit gained over the test period is used to evaluate each strategy. The findings showed nearest neighbor classification using genetic algorithm input feature reduction produced the best results, achieving higher profits than buy-and-hold for a majority of the companies.
|
24 |
The role of model implementation in neuroscientific applications of machine learningAbe, Taiga January 2024 (has links)
In modern neuroscience, large scale machine learning models are becoming increasingly critical components of data analysis. Despite the accelerating adoption of these large scale machine learning tools, there are fundamental challenges to their use in scientific applications that remain largely unaddressed. In this thesis, I focus on one such challenge: variability in the predictions of large scale machine learning models relative to seemingly trivial differences in their implementation.
Existing research has shown that the performance of large scale machine learning models (more so than traditional model like linear regression) is meaningfully entangled with design choices such as the hardware components, operating system, software dependencies, and random seed that the corresponding model depends upon. Within the bounds of current practice, there are few ways of controlling this kind of implementation variability across the broad community of neuroscience researchers (making data analysis less reproducible), and little understanding of how data analyses might be designed to mitigate these issues (making data analysis unreliable). This dissertation will present two broad research directions that address these shortcomings.
First, I will describe a novel, cloud-based platform for sharing data analysis tools reproducibly and at scale. This platform, called NeuroCAAS, enables developers of novel data analyses to precisely specify an implementation of their entire data analysis, which can then be used automatically by any other user on custom built cloud resources. I show that this approach is able to efficiently support a wide variety of existing data analysis tools, as well as novel tools which would not be feasible to build and share outside of a platform like NeuroCAAS.
Second, I conduct two large-scale studies on the behavior of deep ensembles. Deep ensembles are a class of machine learning model which uses implementation variability to improve the quality of model predictions; in particular, by aggregating the predictions of deep networks over stochastic initialization and training. Deep ensembles simultaneously provide a way to control the impact of implementation variability (by aggregating predictions across random seeds) and also to understand what kind of predictive diversity is generated by this particular form of implementation variability. I present a number of surprising results that contradict widely held intuitions about the performance of deep ensembles as well as the mechanisms behind their success, and show that in many aspects, the behavior of deep ensembles is similar to that of an appropriately chosen single neural network. As a whole, this dissertation presents novel methods and insights focused on the role of implementation variability in large scale machine learning models, and more generally upon the challenges of working with such large models in neuroscience data analysis. I conclude by discussing other ongoing efforts to improve the reproducibility and accessibility of large scale machine learning in neuroscience, as well as long term goals to speed the adoption and reliability of such methods in a scientific context.
|
25 |
Computational modeling for identification of low-frequency single nucleotide variantsHao, Yangyang 16 November 2015 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Reliable detection of low-frequency single nucleotide variants (SNVs) carries great significance in many applications. In cancer genetics, the frequencies of somatic variants from tumor biopsies tend to be low due to contamination with normal tissue and tumor heterogeneity. Circulating tumor DNA monitoring also faces the challenge of detecting low-frequency variants due to the small percentage of tumor DNA in blood. Moreover, in population genetics, although pooled sequencing is cost-effective compared with individual sequencing, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by multiple sources of errors, especially next-generation sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 5%; most fail to consider differential, context-specific sequencing artifacts. To face this challenge, we developed a computational and experimental framework, RareVar, to reliably identify low-frequency SNVs from high-throughput sequencing data. For optimized performance, RareVar utilized a supervised learning framework to model artifacts originated from different components of a specific sequencing pipeline. This is enabled by a customized, comprehensive benchmark data enriched with known low-frequency SNVs from the sequencing pipeline of interest. Genomic-context-specific sequencing error model was trained on the benchmark data to characterize the systematic sequencing artifacts, to derive the position-specific detection limit for sensitive low-frequency SNV detection. Further, a machine-learning algorithm utilized sequencing quality features to refine SNV candidates for higher specificity. RareVar outperformed existing approaches, especially at 0.5% to 5% frequency. We further explored the influence of statistical modeling on position specific error modeling and showed zero-inflated negative binomial as the best-performed statistical distribution. When replicating analyses on an Illumina MiSeq benchmark dataset, our method seamlessly adapted to technologies with different biochemistries. RareVar enables sensitive detection of low-frequency SNVs across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.
|
Page generated in 0.1198 seconds