Global ETD Search

1	<b>Using ICU Admission as a Predictor for Maternal Mortality: Identifying Essential Features for Accurate Classification</b> Dairian Haulani Ly Balai (18415224) 20 April 2024 (has links) <p dir="ltr">Maternal mortality (MM) is a pressing global health issue that results in thousands of mothers dying annually from pregnancy-related complications. Despite spending trillions of dollars on the healthcare industry, the U.S. continues to experience one of the highest rates of maternal death (MD) compared to other developed countries. This ongoing public health crisis highlights the urgent need for innovative strategies to detect and mitigate adverse maternal outcomes. This study introduces a novel approach, utilizing admission to the ICU as a proxy for MM. By analyzing 14 years of natality birth data, this study aims to explore the complex web of factors that elevate the chances of MD. The primary goal of this study is to identify features that are most influential in predicting ICU admission cases. These factors hold the potential to be applied to MM, as they can serve as early warning signs that complications may arise, allowing healthcare professionals to step in and intervene before adverse maternal outcomes occur. Two supervised machine learning models were employed in this study, specifically Logistic Regression (LR) and eXtreme Gradient Boosting (XGBoost). The models were executed twice for each dataset: once incorporating all available features and again utilizing only the most significant features. Following model training, XGBoost’s feature selection technique was employed to identify the top 10 influential features that are most important to the classification process. Our analysis revealed a diverse range of factors that are important for the prediction of ICU admission cases. In this study, we identified maternal transfusion, labor and delivery characteristics, delivery methods, gestational age, maternal attributes, and newborn conditions as the most influential factors to categorize maternal ICU admission cases. In terms of model performance, the XGBoost consistently outperformed LR across various datasets, demonstrating higher accuracy, precision, and F1 scores. For recall, however, LR maintained higher scores, surpassing those of XGBoost. Moreover, the models consistently achieved higher scores when trained with all available features compared to those trained solely with the top features. Although the models demonstrated satisfactory performance in some evaluation metrics, there were notable deficiencies in recall and precision, which suggests further model refinement is needed to effectively predict these cases.</p> Data engineering and data science Machine Learning Feature Selection Classification Maternal Mortality
2	ADVANCES IN MACHINE LEARNING METHODOLOGIES FOR BUSINESS ANALYTICS, VIDEO SUPER-RESOLUTION, AND DOCUMENT CLASSIFICATION Tianqi Wang (18431280) 26 April 2024 (has links) <p dir="ltr">This dissertation encompasses three studies in distinct yet impactful domains: B2B marketing, real-time video super-resolution (VSR), and smart office document routing systems. In the B2B marketing sphere, the study addresses the extended buying cycle by developing an algorithm for customer data aggregation and employing a CatBoost model to predict potential purchases with 91% accuracy. This approach enables the identification of high-potential<br>customers for targeted marketing campaigns, crucial for optimizing marketing efforts.<br>Transitioning to multimedia enhancement, the dissertation presents a lightweight recurrent network for real-time VSR. Developed for applications requiring high-quality video with low latency, such as video conferencing and media playback, this model integrates an optical flow estimation network for motion compensation and leverages a hidden space for the propagation of long-term information. The model demonstrates high efficiency in VSR. A<br>comparative analysis of motion estimation techniques underscores the importance of minimizing information loss.<br>The evolution towards smart office environments underscores the importance of an efficient document routing system, conceptualized as an online class-incremental image classification challenge. This research introduces a one-versus-rest parametric classifier, complemented by two updating algorithms based on passive-aggressiveness, and adaptive thresholding methods to manage low-confidence predictions. Tested on 710 labeled real document<br>images, the method reports a cumulative accuracy rate of approximately 97%, showcasing the effectiveness of the chosen aggressiveness parameter through various experiments.</p> Signal processing Image processing Video processing Data engineering and data science Deep learning video super-resolution image classification Conversion prediction
3	A STUDY ON THE IMPACT OF PREPROCESSING STEPS ON MACHINE LEARNING MODEL FAIRNESS Sathvika Kotha (18370548) 17 April 2024 (has links) <p dir="ltr">The success of machine learning techniques in widespread applications has taught us that with respect to accuracy, the more data, the better the model. However, for fairness, data quality is perhaps more important than quantity. Existing studies have considered the impact of data preprocessing on the accuracy of ML model tasks. However, the impact of preprocessing on the fairness of the downstream model has neither been studied nor well understood. Throughout this thesis, we conduct a systematic study of how data quality issues and data preprocessing steps impact model fairness. Our study evaluates several preprocessing techniques for several machine learning models trained over datasets with different characteristics and evaluated using several fairness metrics. It examines different data preparation techniques, such as changing categories into numbers, filling in missing information, and smoothing out unusual data points. The study measures fairness using standards that check if the model treats all groups equally, predicts outcomes fairly, and gives similar chances to everyone. By testing these methods on various types of data, the thesis identifies which combinations of techniques can make the models both accurate and fair.The empirical analysis demonstrated that preprocessing steps like one-hot encoding, imputation of missing values, and outlier treatment significantly influence fairness metrics. Specifically, models preprocessed with median imputation and robust scaling exhibited the most balanced performance across fairness and accuracy metrics, suggesting a potential best practice guideline for equitable ML model preparation. Thus, this work sheds light on the importance of data preparation in ML and emphasizes the need for careful handling of data to support fair and ethical use of ML in society.</p> Data engineering and data science Data quality data preprocessing workflow ML pipeline ML Fairness
4	Extending Synthetic Data and Data Masking Procedures using Information Theory Tyler J Lewis (15361780) 26 April 2023 (has links) <p>The two primarily methodologies discussed in this thesis are the nonparametric entropy-based synthetic timeseries (NEST) and Directed infusion of data (DIOD) algorithms. </p> <p><br></p> <p>The former presents a novel synthetic data algorithm that is shown to outperform sismilar state-of-the-art, including generative networks, in terms of utility and data consistency. Majority of data used are open-source, and are cited where appropriate.</p> <p><br></p> <p>DIOD presents a novel data masking paradigm that presevres the utility, privacy, and efficiency required by the current industrial paradigm, and presents a cheaper alternative to many state-of-the-art. Data used include simulation data (source code cited), equations-based data, and open-source images (cited as needed). </p> Data engineering and data science Machine Learning Neural Network Data Science Information Theory Synthetic Data Data Masking Information Security
5	Assessing Viability of Open-Source Battery Cycling Data for Use in Data-Driven Battery Degradation Models Ritesh Gautam (17582694) 08 December 2023 (has links) <p dir="ltr">Lithium-ion batteries are being used increasingly more often to provide power for systems that range all the way from common cell-phones and laptops to advanced electric automotive and aircraft vehicles. However, as is the case for all battery types, lithium-ion batteries are prone to naturally occurring degradation phenomenon that limit their effective use in these systems to a finite amount of time. This degradation is caused by a plethora of variables and conditions including things like environmental conditions, physical stress/strain on the body of the battery cell, and charge/discharge parameters and cycling. Accurately and reliably being able to predict this degradation behavior in battery systems is crucial for any party looking to implement and use battery powered systems. However, due to the complicated non-linear multivariable processes that affect battery degradation, this can be difficult to achieve. Compared to traditional methods of battery degradation prediction and modeling like equivalent circuit models and physics-based electrochemical models, data-driven machine learning tools have been shown to be able to handle predicting and classifying the complex nature of battery degradation without requiring any prior knowledge of the physical systems they are describing.</p><p dir="ltr">One of the most critical steps in developing these data-driven neural network algorithms is data procurement and preprocessing. Without large amounts of high-quality data, no matter how advanced and accurate the architecture is designed, the neural network prediction tool will not be as effective as one trained on high quality, vast quantities of data. This work aims to gather battery degradation data from a wide variety of sources and studies, examine how the data was produced, test the effectiveness of the data in the Interfacial Multiphysics Laboratory’s autoencoder based neural network tool CD-Net, and analyze the results to determine factors that make battery degradation datasets perform better for use in machine learning/deep learning tools. This work also aims to relate this work to other data-driven models by comparing the CD-Net model’s performance with the publicly available BEEP’s (Battery Evaluation and Early Prediction) ElasticNet model. The reported accuracy and prediction models from the CD-Net and ElasticNet tools demonstrate that larger datasets with actively selected training/testing designations and less errors in the data produce much higher quality neural networks that are much more reliable in estimating the state-of-health of lithium-ion battery systems. The results also demonstrate that data-driven models are much less effective when trained using data from multiple different cell chemistries, form factors, and cycling conditions compared to more congruent datasets when attempting to create a generalized prediction model applicable to multiple forms of battery cells and applications.</p> Aerospace materials Data engineering and data science Neural networks Lithium-ion Batteries Machine Learning Models Battery degradation data preprocessing efforts
6	INVESTIGATING OFFENDER TYPOLOGIES AND VICTIM VULNERABILITIES IN ONLINE CHILD GROOMING Siva sahitya Simhadri (17522730) 02 December 2023 (has links) <p dir="ltr">One of the issues on social media that is expanding the fastest is children being exposed to predators online [ 1 ]. Due to the ease with which a larger segment of the younger population may now access the Internet, online grooming activity on social media has grown to be a significant social concern. Child grooming, in which adults and minors exchange sexually explicit text and media via social media platforms, is a typical component of online child exploitation. An estimated 500,000 predators operate online every day. According to estimates, Internet chat rooms and instant messaging are where 89% of sexual approaches against children take place. The child may face a variety of unpleasant consequences following a grooming event, including shame, anger, anxiety, tension, despair, and substance abuse which make it more difficult for them to report the exploitation. A substantial amount of research in this domain has focused on identifying certain vulnerabilities of the victims of grooming. These vulnerabilities include specific age groups, gender, psychological factors, no family support, and lack of good social relations which make young people more vulnerable to grooming. So far no technical work has been done to apply statistical analysis on these vulnerability profiles and observe how these patterns change between different victim types and offender types. This work presents a detailed analysis of the effect of Offender type (contact and fantasy) and victim type (Law Enforcement Officers, Real Victims and Decoys (Perverted Justice)) on representation of different vulnerabilities in grooming conversations. Comparison of different victim groups would provide insights into creating the right training material for LEOs and decoys and help in the training process for online sting operations. Moreover, comparison of different offender types would help create targeted prevention strategies to tackle online child grooming and help the victims.</p> Data engineering and data science Statistics not elsewhere classified Online grooming Child sexual abuse -- Investigation Chat rooms ANOVA statistics analysis Post-hoc analysis vulnerabilities ---
7	Node Centric Community Detection and Evolutional Prediction in Dynamic Networks Oluwafolake A Ayano (13161288) 27 July 2022 (has links) <p> </p> <p>Advances in technology have led to the availability of data from different platforms such as the web and social media platforms. Much of this data can be represented in the form of a network consisting of a set of nodes connected by edges. The nodes represent the items in the networks while the edges represent the interactions between the nodes. Community detection methods have been used extensively in analyzing these networks. However, community detection in evolving networks has been a significant challenge because of the frequent changes to the networks and the need for real-time analysis. Using Static community detection methods for analyzing dynamic networks will not be appropriate because static methods do not retain a network’s history and cannot provide real-time information about the communities in the network.</p> <p>Existing incremental methods treat changes to the network as a sequence of edge additions and/or removals; however, in many real-world networks, changes occur when a node is added with all its edges connecting simultaneously. </p> <p>For efficient processing of such large networks in a timely manner, there is a need for an adaptive analytical method that can process large networks without recomputing the entire network after its evolution and treat all the edges involved with a node equally. </p> <p>We proposed a node-centric community detection method that incrementally updates the community structure in the network using the already known structure of the network to avoid recomputing the entire network from the scratch and consequently achieve a high-quality community structure. The results from our experiments suggest that our approach is efficient for incremental community detection of node-centric evolving networks. </p> Data engineering and data science Data mining and knowledge discovery Graph, social and multimedia data Community Detection Dynamic Networks IP Networks Clustering Big Data Analytics
8	Towards Privacy and Communication Efficiency in Distributed Representation Learning Sheikh S Azam (12836108) 10 June 2022 (has links) <p>Over the past decade, distributed representation learning has emerged as a popular alternative to conventional centralized machine learning training. The increasing interest in distributed representation learning, specifically federated learning, can be attributed to its fundamental property that promotes data privacy and communication savings. While conventional ML encourages aggregating data at a central location (e.g., data centers), distributed representation learning advocates keeping data at the source and instead transmitting model parameters across the network. However, since the advent of deep learning, model sizes have become increasingly large often comprising million-billions of parameters, which leads to the problem of communication latency in the learning process. In this thesis, we propose to tackle the problem of communication latency in two different ways: (i) learning private representation of data to enable its sharing, and (ii) reducing the communication latency by minimizing the corresponding long-range communication requirements.</p> <p><br></p> <p>To tackle the former goal, we first start by studying the problem of learning representations that are private yet informative, i.e., providing information about intended ''ally'' targets while hiding sensitive ''adversary'' attributes. We propose Exclusion-Inclusion Generative Adversarial Network (EIGAN), a generalized private representation learning (PRL) architecture that accounts for multiple ally and adversary attributes, unlike existing PRL solutions. We then address the practical constraints of the distributed datasets by developing Distributed EIGAN (D-EIGAN), the first distributed PRL method that learns a private representation at each node without transmitting the source data. We theoretically analyze the behavior of adversaries under the optimal EIGAN and D-EIGAN encoders and the impact of dependencies among ally and adversary tasks on the optimization objective. Our experiments on various datasets demonstrate the advantages of EIGAN in terms of performance, robustness, and scalability. In particular, EIGAN outperforms the previous state-of-the-art by a significant accuracy margin (47% improvement), and D-EIGAN's performance is consistently on par with EIGAN under different network settings.</p> <p><br></p> <p>We next tackle the latter objective - reducing the communication latency - and propose two timescale hybrid federated learning (TT-HF), a semi-decentralized learning architecture that combines the conventional device-to-server communication paradigm for federated learning with device-to-device (D2D) communications for model training. In TT-HF, during each global aggregation interval, devices (i) perform multiple stochastic gradient descent iterations on their individual datasets, and (ii) aperiodically engage in consensus procedure of their model parameters through cooperative, distributed D2D communications within local clusters. With a new general definition of gradient diversity, we formally study the convergence behavior of TT-HF, resulting in new convergence bounds for distributed ML. We leverage our convergence bounds to develop an adaptive control algorithm that tunes the step size, D2D communication rounds, and global aggregation period of TT-HF over time to target a sublinear convergence rate of O(1/t) while minimizing network resource utilization. Our subsequent experiments demonstrate that TT-HF significantly outperforms the current art in federated learning in terms of model accuracy and/or network energy consumption in different scenarios where local device datasets exhibit statistical heterogeneity. Finally, our numerical evaluations demonstrate robustness against outages caused by fading channels, as well favorable performance with non-convex loss functions.</p> Knowledge representation and reasoning Pattern recognition Data and information privacy Data engineering and data science Cloud computing Adversarial machine learning Deep learning Neural networks representation learning Federated learning Adversarial learning Deep Learning Framework
9	Myson Burch Thesis Myson C Burch (16637289) 08 August 2023 (has links) <p>With the completion of the Human Genome Project and many additional efforts since, there is an abundance of genetic data that can be leveraged to revolutionize healthcare. Now, there are significant efforts to develop state-of-the-art techniques that reveal insights about connections between genetics and complex diseases such as diabetes, heart disease, or common psychiatric conditions that depend on multiple genes interacting with environmental factors. These methods help pave the way towards diagnosis, cure, and ultimately prediction and prevention of complex disorders. As a part of this effort, we address high dimensional genomics-related questions through mathematical modeling, statistical methodologies, combinatorics and scalable algorithms. More specifically, we develop innovative techniques at the intersection of technology and life sciences using biobank scale data from genome-wide association studies (GWAS) and machine learning as an effort to better understand human health and disease. <br> <br> The underlying principle behind Genome Wide Association Studies (GWAS) is a test for association between genotyped variants for each individual and the trait of interest. GWAS have been extensively used to estimate the signed effects of trait-associated alleles, mapping genes to disorders and over the past decade about 10,000 strong associations between genetic variants and one (or more) complex traits have been reported. One of the key challenges in GWAS is population stratification which can lead to spurious genotype-trait associations. Our work proposes a simple clustering-based approach to correct for stratification better than existing methods. This method takes into account the linkage disequilibrium (LD) while computing the distance between the individuals in a sample. Our approach, called CluStrat, performs Agglomerative Hierarchical Clustering (AHC) using a regularized Mahalanobis distance-based GRM, which captures the population-level covariance (LD) matrix for the available genotype data.<br> <br> Linear mixed models (LMMs) have been a popular and powerful method when conducting genome-wide association studies (GWAS) in the presence of population structure. LMMs are computationally expensive relative to simpler techniques. We implement matrix sketching in LMMs (MaSk-LMM) to mitigate the more expensive computations. Matrix sketching is an approximation technique where random projections are applied to compress the original dataset into one that is significantly smaller and still preserves some of the properties of the original dataset up to some guaranteed approximation ratio. This technique naturally applies to problems in genetics where we can treat large biobanks as a matrix with the rows representing samples and columns representing SNPs. These matrices will be very large due to the large number of individuals and markers in biobanks and can benefit from matrix sketching. Our approach tackles the bottleneck of LMMs directly by using sketching on the samples of the genotype matrix as well as sketching on the markers during the computation of the relatedness or kinship matrix (GRM). <br> <br> Predictive analytics have been used to improve healthcare by reinforcing decision-making, enhancing patient outcomes, and providing relief for the healthcare system. These methods help pave the way towards diagnosis, cure, and ultimately prediction and prevention of complex disorders. The prevalence of these complex diseases varies greatly around the world. Understanding the basis of this prevalence difference can help disentangle the interaction among different factors causing complex disorders and identify groups of people who may be at a greater risk of developing certain disorders. This could become the basis of the implementation of early intervention strategies for populations at higher risk with significant benefits for public health.<br> <br> This dissertation broadens our understanding of empirical population genetics. It proposes a data-driven perspective to a variety of problems in genetics such as confounding factors in genetic structure. This dissertation highlights current computational barriers in open problems in genetics and provides robust, scalable and efficient methods to ease the analysis of genotype data.</p> Applications in health Applications in life sciences Data engineering and data science computational biology and chemistry Statistical methods and models numerical linear algebra Genetics & Genomics big data challenges
10	Decomposition and Stability of Multiparameter Persistence Modules Cheng Xin (16750956) 04 August 2023 (has links) <p>The only datasets used in my thesis work are from TUDatasets, <a href="https://chrsmrrs.github.io/datasets/">TUDataset \| TUD Benchmark datasets (chrsmrrs.github.io)</a>, a collection of public benchmark datasets for graph classification and regression.</p><p><br></p> Data engineering and data science persistence homology computational topology computational geometry persistent homology multiparameter persistence modules persistence module Decomposition (Mathematics) stability algorithm

Search results