1 |
Genomic Data Augmentation with Variational AutoencoderThyrum, Emily 12 1900 (has links)
In order to treat cancer effectively, medical practitioners must predict pathological stages accurately, and machine learning methods can be employed to make such predictions. However, biomedical datasets, including genomic datasets, often have disproportionately more samples from people of European ancestry than people of other ethnic or racial groups, which can cause machine learning methods to perform better on the European samples than on the people of the under-represented groups. Data augmentation can be employed as a potential solution in order to artificially increase the number of samples from people of under-represented racial groups, and can in turn improve pathological stage predictions for future patients from such under-represented groups. Genomic data augmentation has been explored previously, for example using a Generative Adversarial Network, but to the best of our knowledge, the use of the variational autoencoder for the purpose of genomic data augmentation remains largely unexplored. Here we utilize a geometry-based variational autoencoder that models the latent space as a Riemannian manifold so that samples can be generated without the use of a prior distribution to show that the variational autoencoder can indeed be used to reliably augment genomic data. Using TCGA prostate cancer genotype data, we show that our VAE-generated data can improve pathological stage predictions on a test set of European samples. Because we only had European samples that were labeled in terms of pathological stage, we were not able to validate the African generated samples in this way, but we still attempt to show how such samples may be realistic. / Computer and Information Science
|
2 |
Data Augmentation with Seq2Seq ModelsGranstedt, Jason Louis 06 July 2017 (has links)
Paraphrase sparsity is an issue that complicates the training process of question answering systems: syntactically diverse but semantically equivalent sentences can have significant disparities in predicted output probabilities. We propose a method for generating an augmented paraphrase corpus for the visual question answering system to make it more robust to paraphrases. This corpus is generated by concatenating two sequence to sequence models. In order to generate diverse paraphrases, we sample the neural network using diverse beam search. We evaluate the results on the standard VQA validation set.
Our approach results in a significantly expanded training dataset and vocabulary size, but has slightly worse performance when tested on the validation split. Although not as fruitful as we had hoped, our work highlights additional avenues for investigation into selecting more optimal model parameters and the development of a more sophisticated paraphrase filtering algorithm. The primary contribution of this work is the demonstration that decent paraphrases can be generated from sequence to sequence models and the development of a pipeline for developing an augmented dataset. / Master of Science / For a machine, processing language is hard. All possible combinations of words in a language far exceed a computer’s ability to directly memorize them. Thus, generalizing language into a form that a computer can reason with is necessary for a machine to understand raw human input. Various advancements in machine learning have been particularly impressive in this regard. However, they require a corpus, or a body of information, in order to learn. Collecting this corpus is typically expensive and time consuming, and does not necessarily contain all of the information that a system would need to know - the machine would not know how to handle a word that it has never seen before, for example.
This thesis examines the possibility of using a large, general corpus to expand the vocabulary size of a specialized corpus in order to improve performance on a specific task. We use Seq2Seq models, a recent development in neural networks that has seen great success in translation tasks to do so. The Seq2Seq model is trained on the general corpus to learn the language and then applied to the specialized corpus to generate paraphrases similar to the format in the specialized corpus. We were able to significantly expand the volume and vocabulary size of the specialized corpus via this approach, we have demonstrated that decent paraphrases can be generated from Seq2Seq models, and we developed a pipeline for augmenting other specialized datasets.
|
3 |
Rise and Pitfalls of Synthetic Data for Abusive Language DetectionCasula, Camilla 28 October 2024 (has links)
Synthetic data has been proposed as a method to potentially mitigate a number of issues with existing models and datasets for abusive language detection online, such as negative psychological impact on annotators, privacy issues, dataset obsolescence and representation bias. However, previous work on the topic has mostly focused on downstream task performance of models, without paying much attention to the evaluation of other aspects. In this thesis, we carry out a series of experiments and analyses on synthetic data for abusive language detection going beyond performance, with the goal of assessing both the potential and the pitfalls of synthetic data from a qualitative point of view. More specifically, we study synthetic data for abusive language detection in English focusing on four aspects: robustness, examining the ability of models trained on synthetic data to generalize to out-of-distribution scenarios; fairness, with an exploration of the representation of identity groups; privacy, exploring the use of entirely synthetic datasets to avoid sharing user-generated data; and finally we consider the quality of the synthetic data, through a manual annotation and analysis of how realistic and representative of real data synthetic data can be with regards to abusive language.
|
4 |
Outcome-Driven Clustering of Microarray DataHsu, Jessie 17 September 2012 (has links)
The rapid technological development of high-throughput genomics has given rise to complex high-dimensional microarray datasets. One strategy for reducing the dimensionality of microarray experiments is to carry out a cluster analysis to find groups of genes with similar expression patterns. Though cluster analysis has been studied extensively, the clinical context in which the analysis is performed is usually considered separately if at all. However, allowing clinical outcomes to inform the clustering of microarray data has the potential to identify gene clusters that are more useful for describing the clinical course of disease. The aim of this dissertation is to utilize outcome information to drive the clustering of gene expression data. In Chapter 1, we propose a joint clustering model that assumes a relationship between gene clusters and a continuous patient outcome. Gene expression is modeled using cluster specific random effects such that genes in the same cluster are correlated. A linear combination of these random effects is then used to describe the continuous clinical outcome. We implement a Markov chain Monte Carlo algorithm to iteratively sample the unknown parameters and determine the cluster pattern. Chapter 2 extends this model to binary and failure time outcomes. Our strategy is to augment the data with a latent continuous representation of the outcome and specify that the risk of the event depends on the latent variable. Once the latent variable is sampled, we relate it to gene expression via cluster specific random effects and apply the methods developed in Chapter 1. The setting of clustering longitudinal microarrays using binary and survival outcomes is considered in Chapter 3. We propose a model that incorporates a random intercept and slope to describe the gene expression time trajectory. As before, a continuous latent variable that is linearly related to the random effects is introduced into the model and a Markov chain Monte Carlo algorithm is used for sampling. These methods are applied to microarray data from trauma patients in the Inflammation and Host Response to Injury research project. The resulting partitions are visualized using heat maps that depict the frequency with which genes cluster together.
|
5 |
Bayesian Joint Modeling of Binomial and Rank Response DataBarney, Bradley 2011 August 1900 (has links)
We present techniques for joint modeling of binomial and rank response data using the Bayesian paradigm for inference. The motivating application consists of results from a series of assessments on several primate species. Among 20 assessments representing 6 paradigms, 6 assessments are considered to produce a rank response and the remaining 14 are considered to have a binomial response. In order to model each of the 20 assessments simultaneously, we use the popular technique of data augmentation so that the observed responses are based on latent variables. The modeling uses Bayesian techniques for modeling the latent variables using random effects models. Competing models are specified in a consistent fashion which easily allows comparisons across assessments and across models. Non-local priors are readily admitted to enable more effective testing of random effects should Bayes factors be used for model comparison. The model is also extended to allow assessment-specific conditional error variances for the latent variables. Due to potential difficulties in calculating Bayes factors, discrepancy measures based on pivotal quantities are adapted to test for the presence of random effects and for the need to allow assessment-specific conditional error variances. In order to facilitate implementation, we describe in detail the joint prior distribution and a Markov chain Monte Carlo (MCMC) algorithm for posterior sampling. Results from the primate intelligence data are presented to illustrate the methodology. The results indicate substantial paradigm-specific differences between species. These differences are supported by the discrepancy measures as well as model posterior summaries. Furthermore, the results suggest that meaningful and parsimonious inferences can be made using the proposed techniques and that the discrepancy measures can effectively differentiate between necessary and unnecessary random effects. The contributions should be particularly useful when binomial and rank data are to be jointly analyzed in a parsimonious fashion.
|
6 |
Heavy-Tailed Innovations in the R Package stochvolKastner, Gregor January 2015 (has links) (PDF)
We document how sampling from a conditional Student's t distribution is implemented in stochvol. Moreover, a simple example using EUR/CHF exchange rates illustrates how to use the augmented sampler. We conclude with results and implications. (author's abstract)
|
7 |
Design Space Exploration of MobileNet for Suitable Hardware DeploymentDEBJYOTI SINHA (8764737) 28 April 2020 (has links)
<p> Designing self-regulating machines that can see and
comprehend various real world objects around it are the main purpose of the AI
domain. Recently, there has been marked
advancements in the field of deep learning to create state-of-the-art DNNs for
various CV applications. It is
challenging to deploy these DNNs into resource-constrained micro-controller
units as often they are quite memory intensive. Design Space Exploration is a technique which makes CNN/DNN memory
efficient and more flexible to be deployed into resource-constrained
hardware. MobileNet is small DNN architecture
which was designed for embedded and mobile vision, but still researchers faced
many challenges in deploying this model into resource limited real-time processors.</p><p> This thesis, proposes three new DNN architectures, which are
developed using the Design Space Exploration technique. The state-of-the art
MobileNet baseline architecture is used as foundation to propose these DNN architectures
in this study. They are enhanced versions of the baseline MobileNet
architecture. DSE techniques like data augmentation, architecture tuning, and architecture
modification have been done to improve the baseline architecture. First, the
Thin MobileNet architecture is proposed which uses more intricate block modules
as compared to the baseline MobileNet. It is a compact, efficient and flexible
architecture with good model accuracy. To get a more compact models, the
KilobyteNet and the Ultra-thin MobileNet DNN architecture is proposed.
Interesting techniques like channel depth alteration and hyperparameter tuning
are introduced along-with some of the techniques used for designing the Thin
MobileNet. All the models are trained and validated from scratch on the CIFAR-10 dataset. The experimental results (training and testing) can be visualized using the live accuracy and logloss graphs provided by the Liveloss package. The Ultra-thin MobileNet model is more balanced in terms of the model accuracy and model size out of the three and hence it is deployed into the NXP i.MX RT1060 embedded hardware unit for image classification application.</p>
|
8 |
COLOR HALFTONING AND ACOUSTIC ANOMALY DETECTION FOR PRINTING SYSTEMSChin-ning Chen (9128687) 12 October 2021 (has links)
<p>In the first chapter, we illustrate a big picture of the printing systems and the concentration of this dissertation. </p><p><br></p><p>In the second chapter, we present a tone-dependent fast error diffusion algorithm for color images, in which the quantizer is based on a simulated linearized printer space and the filter weight function depends on the ratio of the luminance of the current pixel to the maximum luminance value. The pixels are processed according to a serpentine scan instead of the classic raster scan. We compare the results of our algorithm to those achieved using</p>
<p>the fixed Floyd-Steinberg weights and processing the image according to a raster scan ordering. In the third chapter, we first design a defect generator to generate the synthetic abnormal</p>
<p>printer sounds, and then develop or explore three features for sound-based anomaly detection. In the fourth chapter, we explore six classifiers as our anomaly detection models, and explore or develop six augmentation methods to see whether or not an augmented dataset can improve the model performance. In the fifth chapter, we illustrate the data arrangement and the evaluation methods. Finally, we show the evaluation results based on</p>
<p>different inputs, different features, and different classifiers.</p>
<p><br></p><p>In the last chapter, we summarize the contributions of this dissertation.</p>
|
9 |
Design Space Exploration of MobileNet for Suitable Hardware DeploymentSinha, Debjyoti 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Designing self-regulating machines that can see and comprehend various real world objects around it are the main purpose of the AI domain. Recently, there has been marked advancements in the field of deep learning to create state-of-the-art DNNs for various CV applications. It is challenging to deploy these DNNs into resource-constrained micro-controller units as often they are quite memory intensive. Design Space Exploration is a technique which makes CNN/DNN memory efficient and more flexible to be deployed into resource-constrained hardware. MobileNet is small DNN architecture which was designed for embedded and mobile vision, but still researchers faced many challenges in deploying this model into resource limited real-time processors.
This thesis, proposes three new DNN architectures, which are developed using the Design Space Exploration technique. The state-of-the art MobileNet baseline architecture is used as foundation to propose these DNN architectures in this study. They are enhanced versions of the baseline MobileNet architecture. DSE techniques like data augmentation, architecture tuning, and architecture modification have been done to improve the baseline architecture. First, the Thin MobileNet architecture is proposed which uses more intricate block modules as compared to the baseline MobileNet. It is a compact, efficient and flexible architecture with good model accuracy. To get a more compact models, the KilobyteNet and the Ultra-thin MobileNet DNN architecture is proposed. Interesting techniques like channel depth alteration and hyperparameter tuning are introduced along-with some of the techniques used for designing the Thin MobileNet. All the models are trained and validated from scratch on the CIFAR-10 dataset. The experimental results (training and testing) can be visualized using the live accuracy and logloss graphs provided by the Liveloss package. The Ultra-thin MobileNet model is more balanced in terms of the model accuracy and model size out of the three and hence it is deployed into the NXP i.MX RT1060 embedded hardware unit for image classification application.
|
10 |
Data Centric Defenses for Privacy AttacksAbhyankar, Nikhil Suhas 14 August 2023 (has links)
Recent research shows that machine learning algorithms are highly susceptible to attacks trying to extract sensitive information about the data used in model training. These attacks called privacy attacks, exploit the model training process. Contemporary defense techniques make alterations to the training algorithm. Such defenses are computationally expensive, cause a noticeable privacy-utility tradeoff, and require control over the training process. This thesis presents a data-centric approach using data augmentations to mitigate privacy attacks.
We present privacy-focused data augmentations to change the sensitive data submitted to the model trainer. Compared to traditional defenses, our method provides more control to the individual data owner to protect one's private data. The defense is model-agnostic and does not require the data owner to have any sort of control over the model training. Privacypreserving augmentations are implemented for two attacks namely membership inference and model inversion using two distinct techniques. While the proposed augmentations offer a better privacy-utility tradeoff on CIFAR-10 for membership inference, they reduce the reconstruction rate to ≤ 1% while reducing the classification accuracy by only 2% against model inversion attacks. This is the first attempt to defend model inversion and membership inference attacks using decentralized privacy protection. / Master of Science / Privacy attacks are threats posed to extract sensitive information about the data used to train machine learning models. As machine learning is used extensively for many applications, they have access to private information like financial records, medical history, etc depending on the application. It has been observed that machine learning models can leak the information they contain. As models tend to 'memorize' training data to some extent, even removing the data from the training set cannot prevent privacy leakage. As a result, the research community has focused its attention on developing defense techniques to prevent this information leakage.
However, the existing defenses rely heavily on making alterations to the way a machine learning model is trained. This approach is termed as a model-centric approach wherein the model owner is responsible to make changes to the model algorithm to preserve data privacy.
By doing this, the model performance is degraded while upholding data privacy. Our work introduces the first data-centric defense which provides the tools to protect the data to the data owner. We demonstrate the effectiveness of the proposed defense in providing protection while ensuring that the model performance is maintained to a great extent.
|
Page generated in 0.1255 seconds