Global ETD Search

1	Genomic Data Augmentation with Variational Autoencoder Thyrum, Emily 12 1900 (has links) In order to treat cancer effectively, medical practitioners must predict pathological stages accurately, and machine learning methods can be employed to make such predictions. However, biomedical datasets, including genomic datasets, often have disproportionately more samples from people of European ancestry than people of other ethnic or racial groups, which can cause machine learning methods to perform better on the European samples than on the people of the under-represented groups. Data augmentation can be employed as a potential solution in order to artificially increase the number of samples from people of under-represented racial groups, and can in turn improve pathological stage predictions for future patients from such under-represented groups. Genomic data augmentation has been explored previously, for example using a Generative Adversarial Network, but to the best of our knowledge, the use of the variational autoencoder for the purpose of genomic data augmentation remains largely unexplored. Here we utilize a geometry-based variational autoencoder that models the latent space as a Riemannian manifold so that samples can be generated without the use of a prior distribution to show that the variational autoencoder can indeed be used to reliably augment genomic data. Using TCGA prostate cancer genotype data, we show that our VAE-generated data can improve pathological stage predictions on a test set of European samples. Because we only had European samples that were labeled in terms of pathological stage, we were not able to validate the African generated samples in this way, but we still attempt to show how such samples may be realistic. / Computer and Information Science Computer science Genomic data augmentation Variational autoencoder
2	Rise and Pitfalls of Synthetic Data for Abusive Language Detection Casula, Camilla 28 October 2024 (has links) Synthetic data has been proposed as a method to potentially mitigate a number of issues with existing models and datasets for abusive language detection online, such as negative psychological impact on annotators, privacy issues, dataset obsolescence and representation bias. However, previous work on the topic has mostly focused on downstream task performance of models, without paying much attention to the evaluation of other aspects. In this thesis, we carry out a series of experiments and analyses on synthetic data for abusive language detection going beyond performance, with the goal of assessing both the potential and the pitfalls of synthetic data from a qualitative point of view. More specifically, we study synthetic data for abusive language detection in English focusing on four aspects: robustness, examining the ability of models trained on synthetic data to generalize to out-of-distribution scenarios; fairness, with an exploration of the representation of identity groups; privacy, exploring the use of entirely synthetic datasets to avoid sharing user-generated data; and finally we consider the quality of the synthetic data, through a manual annotation and analysis of how realistic and representative of real data synthetic data can be with regards to abusive language.
3	Outcome-Driven Clustering of Microarray Data Hsu, Jessie 17 September 2012 (has links) The rapid technological development of high-throughput genomics has given rise to complex high-dimensional microarray datasets. One strategy for reducing the dimensionality of microarray experiments is to carry out a cluster analysis to ﬁnd groups of genes with similar expression patterns. Though cluster analysis has been studied extensively, the clinical context in which the analysis is performed is usually considered separately if at all. However, allowing clinical outcomes to inform the clustering of microarray data has the potential to identify gene clusters that are more useful for describing the clinical course of disease. The aim of this dissertation is to utilize outcome information to drive the clustering of gene expression data. In Chapter 1, we propose a joint clustering model that assumes a relationship between gene clusters and a continuous patient outcome. Gene expression is modeled using cluster speciﬁc random effects such that genes in the same cluster are correlated. A linear combination of these random effects is then used to describe the continuous clinical outcome. We implement a Markov chain Monte Carlo algorithm to iteratively sample the unknown parameters and determine the cluster pattern. Chapter 2 extends this model to binary and failure time outcomes. Our strategy is to augment the data with a latent continuous representation of the outcome and specify that the risk of the event depends on the latent variable. Once the latent variable is sampled, we relate it to gene expression via cluster speciﬁc random effects and apply the methods developed in Chapter 1. The setting of clustering longitudinal microarrays using binary and survival outcomes is considered in Chapter 3. We propose a model that incorporates a random intercept and slope to describe the gene expression time trajectory. As before, a continuous latent variable that is linearly related to the random effects is introduced into the model and a Markov chain Monte Carlo algorithm is used for sampling. These methods are applied to microarray data from trauma patients in the Inﬂammation and Host Response to Injury research project. The resulting partitions are visualized using heat maps that depict the frequency with which genes cluster together. Bayesian clustering data augmentation gene expression microarray biostatistics
4	Heavy-Tailed Innovations in the R Package stochvol Kastner, Gregor January 2015 (has links) (PDF) We document how sampling from a conditional Student's t distribution is implemented in stochvol. Moreover, a simple example using EUR/CHF exchange rates illustrates how to use the augmented sampler. We conclude with results and implications. (author's abstract)
5	Design Space Exploration of MobileNet for Suitable Hardware Deployment DEBJYOTI SINHA (8764737) 28 April 2020 (has links) <p> Designing self-regulating machines that can see and comprehend various real world objects around it are the main purpose of the AI domain. Recently, there has been marked advancements in the field of deep learning to create state-of-the-art DNNs for various CV applications. It is challenging to deploy these DNNs into resource-constrained micro-controller units as often they are quite memory intensive. Design Space Exploration is a technique which makes CNN/DNN memory efficient and more flexible to be deployed into resource-constrained hardware. MobileNet is small DNN architecture which was designed for embedded and mobile vision, but still researchers faced many challenges in deploying this model into resource limited real-time processors.</p><p> This thesis, proposes three new DNN architectures, which are developed using the Design Space Exploration technique. The state-of-the art MobileNet baseline architecture is used as foundation to propose these DNN architectures in this study. They are enhanced versions of the baseline MobileNet architecture. DSE techniques like data augmentation, architecture tuning, and architecture modification have been done to improve the baseline architecture. First, the Thin MobileNet architecture is proposed which uses more intricate block modules as compared to the baseline MobileNet. It is a compact, efficient and flexible architecture with good model accuracy. To get a more compact models, the KilobyteNet and the Ultra-thin MobileNet DNN architecture is proposed. Interesting techniques like channel depth alteration and hyperparameter tuning are introduced along-with some of the techniques used for designing the Thin MobileNet. All the models are trained and validated from scratch on the CIFAR-10 dataset. The experimental results (training and testing) can be visualized using the live accuracy and logloss graphs provided by the Liveloss package. The Ultra-thin MobileNet model is more balanced in terms of the model accuracy and model size out of the three and hence it is deployed into the NXP i.MX RT1060 embedded hardware unit for image classification application.</p> Computer Engineering Design Space Exploration DNN iMX RT1060 Data augmentation
6	COLOR HALFTONING AND ACOUSTIC ANOMALY DETECTION FOR PRINTING SYSTEMS Chin-ning Chen (9128687) 12 October 2021 (has links) <p>In the first chapter, we illustrate a big picture of the printing systems and the concentration of this dissertation. </p><p><br></p><p>In the second chapter, we present a tone-dependent fast error diffusion algorithm for color images, in which the quantizer is based on a simulated linearized printer space and the filter weight function depends on the ratio of the luminance of the current pixel to the maximum luminance value. The pixels are processed according to a serpentine scan instead of the classic raster scan. We compare the results of our algorithm to those achieved using</p> <p>the fixed Floyd-Steinberg weights and processing the image according to a raster scan ordering. In the third chapter, we first design a defect generator to generate the synthetic abnormal</p> <p>printer sounds, and then develop or explore three features for sound-based anomaly detection. In the fourth chapter, we explore six classifiers as our anomaly detection models, and explore or develop six augmentation methods to see whether or not an augmented dataset can improve the model performance. In the fifth chapter, we illustrate the data arrangement and the evaluation methods. Finally, we show the evaluation results based on</p> <p>different inputs, different features, and different classifiers.</p> <p><br></p><p>In the last chapter, we summarize the contributions of this dissertation.</p> Computer Engineering halftoning algorithms anomaly detection applications data augmentation
7	Design Space Exploration of MobileNet for Suitable Hardware Deployment Sinha, Debjyoti 05 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Designing self-regulating machines that can see and comprehend various real world objects around it are the main purpose of the AI domain. Recently, there has been marked advancements in the field of deep learning to create state-of-the-art DNNs for various CV applications. It is challenging to deploy these DNNs into resource-constrained micro-controller units as often they are quite memory intensive. Design Space Exploration is a technique which makes CNN/DNN memory efficient and more flexible to be deployed into resource-constrained hardware. MobileNet is small DNN architecture which was designed for embedded and mobile vision, but still researchers faced many challenges in deploying this model into resource limited real-time processors. This thesis, proposes three new DNN architectures, which are developed using the Design Space Exploration technique. The state-of-the art MobileNet baseline architecture is used as foundation to propose these DNN architectures in this study. They are enhanced versions of the baseline MobileNet architecture. DSE techniques like data augmentation, architecture tuning, and architecture modification have been done to improve the baseline architecture. First, the Thin MobileNet architecture is proposed which uses more intricate block modules as compared to the baseline MobileNet. It is a compact, efficient and flexible architecture with good model accuracy. To get a more compact models, the KilobyteNet and the Ultra-thin MobileNet DNN architecture is proposed. Interesting techniques like channel depth alteration and hyperparameter tuning are introduced along-with some of the techniques used for designing the Thin MobileNet. All the models are trained and validated from scratch on the CIFAR-10 dataset. The experimental results (training and testing) can be visualized using the live accuracy and logloss graphs provided by the Liveloss package. The Ultra-thin MobileNet model is more balanced in terms of the model accuracy and model size out of the three and hence it is deployed into the NXP i.MX RT1060 embedded hardware unit for image classification application. Design Space Exploration DNN iMX RT1060 Data augmentation
8	Data Augmentation GUI Tool for Machine Learning Models Sharma, Sweta 30 October 2023 (has links) The industrial production of semiconductor assemblies is subject to high requirements. As a result, several tests are needed in terms of component quality. In the long run, manual quality assurance (QA) is often connected with higher expenditures. Using a technique based on machine learning, some of these tests may be carried out automatically. Deep neural networks (NN) have shown to be very effective in a diverse range of computer vision applications. Especially convolutional neural networks (CNN), which belong to a subset of NN, are an effective tool for image classification. Deep NNs have the disadvantage of requiring a significant quantity of training data to reach excellent performance. When the dataset is too small a phenomenon known as overfitting can occur. Massive amounts of data cannot be supplied in certain contexts, such as the production of semiconductors. This is especially true given the relatively low number of rejected components in this field. In order to prevent overfitting, a variety of image augmentation methods may be used to the process of artificially creating training images. However, many of those methods cannot be used in certain fields due to their inapplicability. For this thesis, Infineon Technologies AG provided the images of a semiconductor component generated by an ultrasonic microscope. The images can be categorized as having a sufficient number of good and a minority of rejected components, with good components being defined as components that have been deemed to have passed quality control and rejected components being components that contain a defect and did not pass quality control. The accomplishment of the project, the efficacy with which it is carried out, and its level of quality may be dependent on a number of factors; however, selecting the appropriate tools is one of the most important of these factors because it enables significant time and resource savings while also producing the best results. We demonstrate a data augmentation graphical user interface (GUI) tool that has been widely used in the domain of image processing. Using this method, the dataset size has been increased while maintaining the accuracy-time trade-off and optimizing the robustness of deep learning models. The purpose of this work is to develop a user-friendly tool that incorporates traditional, advanced, and smart data augmentation, image processing, and machine learning (ML) approaches. More specifically, the technique mainly uses are zooming, rotation, flipping, cropping, GAN, fusion, histogram matching, autoencoder, image restoration, compression etc. This focuses on implementing and designing a MATLAB GUI for data augmentation and ML models. The thesis was carried out for the Infineon Technologies AG in order to address a challenge that all semiconductor industries experience. The key objective is not only to create an easy- to-use GUI, but also to ensure that its users do not need advanced technical experiences to operate it. This GUI may run on its own as a standalone application. Which may be implemented everywhere for the purposes of data augmentation and classification. The objective is to streamline the working process and make it easy to complete the Quality assurance job even for those who are not familiar with data augmentation, machine learning, or MATLAB. In addition, research will investigate the benefits of data augmentation and image processing, as well as the possibility that these factors might contribute to an improvement in the accuracy of AI models. info:eu-repo/classification/ddc/000 ddc:000
9	A statistical framework for estimating output-specific efficiencies Gstach, Dieter January 2003 (has links) (PDF) This paper presents a statistical framework for estimating output-specific efficiencies for the 2-output case based upon a DEA frontier estimate. The key to the approach is the concept of target output-mix. Being usually unobserved, target output-mixes of firms are modelled as missing data. Using this concept the relevant data generating process can be formulated. The resulting likelihood function is analytically intractable, so a data augmented Bayesian approach is proposed for estimation purposes. This technique is adapted to the present purpose. Some implementation issues are discussed leading to an empirical Bayes setup with data informed priors. A prove of scale invariance is provided. (author's abstract) / Series: Department of Economics Working Paper Series JEL D24, C11, C15
10	Data Augmentation and Dynamic Linear Models Frühwirth-Schnatter, Sylvia January 1992 (has links) (PDF) We define a subclass of dynamic linear models with unknown hyperparameters called d-inverse-gamma models. We then approximate the marginal p.d.f.s of the hyperparameter and the state vector by the data augmentation algorithm of Tanner/Wong. We prove that the regularity conditions for convergence hold. A sampling based scheme for practical implementation is discussed. Finally, we illustrate how to obtain an iterative importance sampling estimate of the model likelihood. (author's abstract) / Series: Forschungsberichte / Institut für Statistik

Search results