Global ETD Search

1	<b>Using ICU Admission as a Predictor for Maternal Mortality: Identifying Essential Features for Accurate Classification</b> Dairian Haulani Ly Balai (18415224) 20 April 2024 (has links) <p dir="ltr">Maternal mortality (MM) is a pressing global health issue that results in thousands of mothers dying annually from pregnancy-related complications. Despite spending trillions of dollars on the healthcare industry, the U.S. continues to experience one of the highest rates of maternal death (MD) compared to other developed countries. This ongoing public health crisis highlights the urgent need for innovative strategies to detect and mitigate adverse maternal outcomes. This study introduces a novel approach, utilizing admission to the ICU as a proxy for MM. By analyzing 14 years of natality birth data, this study aims to explore the complex web of factors that elevate the chances of MD. The primary goal of this study is to identify features that are most influential in predicting ICU admission cases. These factors hold the potential to be applied to MM, as they can serve as early warning signs that complications may arise, allowing healthcare professionals to step in and intervene before adverse maternal outcomes occur. Two supervised machine learning models were employed in this study, specifically Logistic Regression (LR) and eXtreme Gradient Boosting (XGBoost). The models were executed twice for each dataset: once incorporating all available features and again utilizing only the most significant features. Following model training, XGBoost’s feature selection technique was employed to identify the top 10 influential features that are most important to the classification process. Our analysis revealed a diverse range of factors that are important for the prediction of ICU admission cases. In this study, we identified maternal transfusion, labor and delivery characteristics, delivery methods, gestational age, maternal attributes, and newborn conditions as the most influential factors to categorize maternal ICU admission cases. In terms of model performance, the XGBoost consistently outperformed LR across various datasets, demonstrating higher accuracy, precision, and F1 scores. For recall, however, LR maintained higher scores, surpassing those of XGBoost. Moreover, the models consistently achieved higher scores when trained with all available features compared to those trained solely with the top features. Although the models demonstrated satisfactory performance in some evaluation metrics, there were notable deficiencies in recall and precision, which suggests further model refinement is needed to effectively predict these cases.</p> Read more Data engineering and data science Machine Learning Feature Selection Classification Maternal Mortality
2	Genetic Programming Based Multicategory Pattern Classification Kishore, Krishna J 03 1900 (has links) Nature has created complex biological structures that exhibit intelligent behaviour through an evolutionary process. Thus, intelligence and evolution are intimately connected. This has inspired evolutionary computation (EC) that simulates the evolutionary process to develop powerful techniques such as genetic algorithms (GAs), genetic programming (GP), evolutionary strategies (ES) and evolutionary programming (EP) to solve real-world problems in learning, control, optimization and classification. GP discovers the relationship among data and expresses it as a LISP-S expression i.e., a computer program. Thus the goal of program discovery as a solution for a problem is addressed by GP in the framework of evolutionary computation. In this thesis, we address for the first time the problem of applying GP to mu1ticategory pattern classification. In supervised pattern classification, an input vector of m dimensions is mapped onto one of the n classes. It has a number of application areas such as remote sensing, medical diagnosis etc., A supervised classifier is developed by using a training set that contains representative samples of various classes present in the application. Supervised classification has been done earlier with maximum likelihood classifier: neural networks and fuzzy logic. The major considerations in applying GP to pattern classification are listed below: (i) GP-based techniques are data distribution-free i.e., no a priori knowledge is needed abut the statistical distribution of the data or no assumption such as normal distribution for data needs to be made as in MLC. (ii) GP can directly operate on the data in its original form. (iii) GP can detect the underlying but unknown relationship that mists among data and express it as a mathematical LISP S-expression. The generated LISP S-expressions can be directly used in the application environment. (iv) GP can either discover the most important discriminating features of a class during evolution or it requires minor post-processing of the LISP-S expression to discover the discriminant features. In a neural network, the knowledge learned by the neural network about the data distributions is embedded in the interconnection weights and it requires considerable amount of post-processing of the weights to understand the decision of the neural network. In 2-category pattern classification, a single GP expression is evolved as a discriminant function. The output of the GP expression can be +l for samples of one class and -1 for samples of the other class. When the GP paradigm is applied to an n-class problem, the following questions arise: Ql. As a typical GP expression returns a value (+l or -1) for a 2-class problem, how does one apply GP for the n-class pattern classification problem? Q2. What should be the fitness function during evolution of the GP expressions? Q3. How does the choice of a function set affect the performance of GP-based classification? Q4. How should training sets be created for evaluating fitness during the evolution of GP classifier expressions? Q5. How does one improve learning of the underlying data distributions in a GP framework? Q6. How should conflict resolution be handled before assigning a class to the input feature vector? Q7. How does GP compare with other classifiers for an n-class pattern classification problem? The research described here seeks to answer these questions. We show that GP can be applied to an n-category pattern classification problem by considering it as n 2-class problems. The suitability of this approach is demonstrated by considering a real-world problem based on remotely sensed satellite images and Fisher's Iris data set. In a 2-class problem, simple thresholding is sufficient for a discriminant function to divide the feature space into two regions. This means that one genetic programming classifier expression (GPCE) is sufficient to say whether or not the given input feature vector belongs to that class; i.e., the GP expression returns a value (+1 or -1). As the n-class problem is formulated as n 2-class problems, n GPCEs are evolved. Hence, n GPCE specific training sets are needed to evolve these n GPCEs. For the sake of illustration, consider a 5-class pat tern classification problem. Let n, be the number of samples that belong to class j, and N, be the number of samples that do not belong to class j, (j = 1,..., 5). Thus, N1=n2+n3+n4+n5 N2=n1+n3+n4+n5 N3=n1+n2+n4+n5 N4=n1+n2+n3+n5 N5=n1+n2+n3+n4 Thus, When the five class problem is formulated as five 2-class problems. we need five GPCEs as discriminant functions to resolve between n1 and N1, n2 and N2, n3 and N3, n4 and N4 and lastly n5 and N5. Each of these five 2-class problems is handled as a separate 2-class problem with simple thresholding. Thus, GPCE# l resolves between samples of class# l and the remaining n - 1 classes. A training set is needed to evaluate the fitness of GPCE during its evolution. If we directly create the training set, it leads to skewness (as n1 < N1). To overcome the skewness, an interleaved data format is proposed for the training set of a GPCE. For example, in the training set of GPCE# l, samples of class# l are placed alternately between samples of the remaining n - 1 classes. Thus, the interleaved data format is an artifact to create a balanced training set. Conventionally, all the samples of a training set are fed to evaluate the fitness of every member of the population in each generation. We call this "global" learning 3s GP tries to learn the entire training set at every stage of the evolution. We have introduced incremental learning to simplify the task of learning for the GP paradigm. A subset of the training set is fed and the size of the subset is gradually increased over time to cover the entire training data. The basic motivation for incremental learning is to improve learning during evolution as it is easier to learn a smaller task and then to progress from a smaller task to a bigger task. Experimental results are presented to show that the interleaved data format and incremental learning improve the performance of the GP classifier. We also show that the GPCEs evolved with an arithmetic function set are able to track variation in the input better than GPCEs evolved with function sets containing logical and nonlinear elements. Hence, we have used arithmetic function set, incremental learning, and interleaved data format to evolve GPCEs in our simulations. AS each GPCE is trained to recognize samples belonging to its own class and reject samples belonging to other classes a strength of association measure is associated with each GPCE to indicate the degree to which it can recognize samples belonging to its own class. The strength of association measures are used for assigning a class to an input feature vector. To reduce misclassification of samples, we also show how heuristic rules can be generated in the GP framework unlike in either MLC or the neural network classifier. We have also studied the scalability and generalizing ability of the GP classifier by varying the number of classes. We also analyse the performance of the GP classifier by considering the well-known Iris data set. We compare the performance of classification rules generated from the GP classifier with those generated from neural network classifier, (24.5 method and fuzzy classifier for the Iris data set. We show that the performance of GP is comparable to other classifiers for the Iris data set. We notice that the classification rules can be generated with very little post-processing and they are very similar to the rules generated from the neural network and C4.5 for the Iris data set. Incremental learning influences the number of generations available for GP to learn the data distribution of classes whose d is -1 in the interleaved data format. This is because the samples belonging to the true class (desired output d is +1) are alternately placed between samples belonging to other classes i.e., they are repeated to balance the training set in the interleaved data format. For example, in the evolution of GPCE for class# l, the fitness function can be fed initially with samples of class#:! and subsequently with the samples of class#3, class#4 and class#. So in the evaluation of the fitness function, the samples of class#kt5 will not be present when the samples of class#2 are present in the initial stages. However, in the later stages of evolution, when samples of class#5 are fed, the fitness function will utilize the samples of both class#2 and class#5. As learning in evolutionary computation is guided by the evaluation of the fitness function, GPCE# l gets lesser number of generations to learn how to reject data of class#5 as compared to the data of class#2. This is because the termination criterion (i.e., the maximum number of generations) is defined a priori. It is clear that there are (n-l)! Ways of ordering the samples of classes whose d is -1 in the interleaved data format. Hence a heuristic is presented to determine a possible order to feed data of different classes for the GPCEs evolved with incremental learning and interleaved data format. The heuristic computes an overlap index for each class based on its spatial spread and distribution of data in the region of overlap with respect to other classes in each feature. The heuristic determines the order in which classes whose desired output d is –1 should be placed in each GPCE-specific training set for the interleaved data format. This ensures that GP gets more number of generations to learn about the data distribution of a class with higher overlap index than a class with lower overlap index. The ability of the GP classifier to learn the data distributions depends upon the number of classes and the spatial spread of data. As the number of classes increases, the GP classifier finds it difficult to resolve between classes. So there is a need to partition the feature space and identify subspaces with reduced number of classes. The basic objective is to divide the feature space into subspaces and hence the data set that contains representative samples of n classes into subdata sets corresponding to the subspaces of the feature space, so that some of the subdata sets/spaces can have data belonging to only p classes (p < n). The GP classifier is then evolved independently for the subdata sets/spaces of the feature space. This results in localized learning as the GP classifier has to learn the data distribution in only a subspace of the feature space rather than in the entire feature space. By integrating the GP classifier with feature space partitioning (FSP), we improve classification accuracy due to localized learning. Although serial computers have increased steadily in their performance, the quest for parallel implementation of a given task has continued to be of interest in any computationally intensive task since parallel implementation leads to a faster execution than a serial implementation As fitness evaluation, selection strategy and population structures are used to evolve a solution in GP, there is scope for a parallel implementation of GP classifier. We have studied distributed GP and massively parallel GP for our approach to GP-based multicategory pattern classification. We present experimental results for distributed GP with Message Passing Interface on IBM SP2 to highlight the speedup that can be achieved over the serial implementation of GP. We also show how data parallelism can be used to further speed up fitness evaluation and hence the execution of the GP paradigm for multicategory pat tern classification. We conclude that GP can be applied to n-category pattern classification and its potential lies in its simplicity and scope for parallel implementation. The GP classifier developed in this thesis can be looked upon as an addition to the earlier statistical, neural and fuzzy approaches to multicategory pattern classification. Read more Computer and Information Science Computer Programming Genetic Algorithms Data Engineering Evolutionary Computation Genetic Programming Pattern Classification Pattern Perception Iris data set
3	ADVANCES IN MACHINE LEARNING METHODOLOGIES FOR BUSINESS ANALYTICS, VIDEO SUPER-RESOLUTION, AND DOCUMENT CLASSIFICATION Tianqi Wang (18431280) 26 April 2024 (has links) <p dir="ltr">This dissertation encompasses three studies in distinct yet impactful domains: B2B marketing, real-time video super-resolution (VSR), and smart office document routing systems. In the B2B marketing sphere, the study addresses the extended buying cycle by developing an algorithm for customer data aggregation and employing a CatBoost model to predict potential purchases with 91% accuracy. This approach enables the identification of high-potential<br>customers for targeted marketing campaigns, crucial for optimizing marketing efforts.<br>Transitioning to multimedia enhancement, the dissertation presents a lightweight recurrent network for real-time VSR. Developed for applications requiring high-quality video with low latency, such as video conferencing and media playback, this model integrates an optical flow estimation network for motion compensation and leverages a hidden space for the propagation of long-term information. The model demonstrates high efficiency in VSR. A<br>comparative analysis of motion estimation techniques underscores the importance of minimizing information loss.<br>The evolution towards smart office environments underscores the importance of an efficient document routing system, conceptualized as an online class-incremental image classification challenge. This research introduces a one-versus-rest parametric classifier, complemented by two updating algorithms based on passive-aggressiveness, and adaptive thresholding methods to manage low-confidence predictions. Tested on 710 labeled real document<br>images, the method reports a cumulative accuracy rate of approximately 97%, showcasing the effectiveness of the chosen aggressiveness parameter through various experiments.</p> Read more Signal processing Image processing Video processing Data engineering and data science Deep learning video super-resolution image classification Conversion prediction
4	A STUDY ON THE IMPACT OF PREPROCESSING STEPS ON MACHINE LEARNING MODEL FAIRNESS Sathvika Kotha (18370548) 17 April 2024 (has links) <p dir="ltr">The success of machine learning techniques in widespread applications has taught us that with respect to accuracy, the more data, the better the model. However, for fairness, data quality is perhaps more important than quantity. Existing studies have considered the impact of data preprocessing on the accuracy of ML model tasks. However, the impact of preprocessing on the fairness of the downstream model has neither been studied nor well understood. Throughout this thesis, we conduct a systematic study of how data quality issues and data preprocessing steps impact model fairness. Our study evaluates several preprocessing techniques for several machine learning models trained over datasets with different characteristics and evaluated using several fairness metrics. It examines different data preparation techniques, such as changing categories into numbers, filling in missing information, and smoothing out unusual data points. The study measures fairness using standards that check if the model treats all groups equally, predicts outcomes fairly, and gives similar chances to everyone. By testing these methods on various types of data, the thesis identifies which combinations of techniques can make the models both accurate and fair.The empirical analysis demonstrated that preprocessing steps like one-hot encoding, imputation of missing values, and outlier treatment significantly influence fairness metrics. Specifically, models preprocessed with median imputation and robust scaling exhibited the most balanced performance across fairness and accuracy metrics, suggesting a potential best practice guideline for equitable ML model preparation. Thus, this work sheds light on the importance of data preparation in ML and emphasizes the need for careful handling of data to support fair and ethical use of ML in society.</p> Read more Data engineering and data science Data quality data preprocessing workflow ML pipeline ML Fairness
5	Clustering Customers from Home Appliance Data Porcu, Simone January 2024 (has links) In the realm of customer-centric strategies, the study focuses on the critical aspect of customer segmentation in the context of innovative home appliances of Electrolux, the company where this master thesis was performed. This thesis leverages Machine Learning models to analyze washing machine data from the Europe, Middle East, and Africa (EMEA) region, aiming to cluster customers and unveil patterns in appliance usage. The importance of tailored marketing strategies is underscored, prompting an investigation into existing solutions for customer segmentation in this specific engineering domain. The study addresses challenges such as developing a robust methodology for clustering and ensuring accurate information extraction. Results demonstrate the efficacy of Machine Learning in customer segmentation, enabling the company to enhance its understanding of customers, implement targeted campaigns, and offer personalized experiences. The successful resolution of this problem opens avenues for broader conclusions, such as gaining insights from worldwide data sets, transcending the previous limitation to the EMEA region. Furthermore, incorporating various timestamps, including periods before, during, and after the COVID-19 pandemic, enables a more comprehensive understanding of the issue. This approach enhances the applicability and robustness of our findings, offering a nuanced and holistic perspective on the challenges faced in different global contexts and over varying temporal dimensions. / När det gäller kundcentrerade strategier fokuserar studien på den kritiska aspekten av kundsegmentering i samband med innovativa hushållsapparater från Electrolux, företaget där denna masteruppsats utfördes. Den här avhandlingen utnyttjar maskininlärningsmodeller för att analysera tvättmaskinsdata från Europa, Mellanöstern och Afrika (EMEA)-regionen, i syfte att klustera kunder och avslöja mönster för användning av apparater. Vikten av skräddarsydda marknadsföringsstrategier understryks, vilket föranleder en undersökning av befintliga lösningar för kundsegmentering inom denna specifika tekniska domän. Studien tar upp utmaningar som att utveckla en robust metod för klustring och säkerställa korrekt informationsextraktion. Resultaten visar effektiviteten av Machine Learning i kundsegmentering, vilket gör det möjligt för företaget att öka sin förståelse för kunder, implementera riktade kampanjer och erbjuda personliga upplevelser. Den framgångsrika lösningen av detta problem öppnar vägar för bredare slutsatser, som att få insikter från världsomspännande datamängder, som överskrider den tidigare begränsningen till EMEA-regionen. Genom att införliva olika tidsstämplar, inklusive perioder före, under och efter covid-19-pandemin, möjliggörs en mer omfattande förståelse av problemet. Detta tillvägagångssätt förbättrar tillämpbarheten och robustheten av våra resultat, och erbjuder ett nyanserat och holistiskt perspektiv på de utmaningar som ställs inför i olika globala sammanhang och över varierande tidsdimensioner. Read more Machine Learning Data Engineering Data processing customers clustering Maskininlärning datateknik databearbetning kundkluster Computer Sciences Datavetenskap (datalogi) Computer Engineering Datorteknik
6	Analytic Extensions to the Data Model for Management Analytics and Decision Support in the Big Data Environment Akpakpan, Nsikak Etim 01 January 2018 (has links) From 2006 to 2016, an estimated average of 50% of big data analytics and decision support projects failed to deliver acceptable and actionable outputs to business users. The resulting management inefficiency came with high cost, and wasted investments estimated at $2.7 trillion in 2016 for companies in the United States. The purpose of this quantitative descriptive study was to examine the data model of a typical data analytics project in a big data environment for opportunities to improve the information created for management problem-solving. The research questions focused on finding artifacts within enterprise data to model key business scenarios for management action. The foundations of the study were information and decision sciences theories, especially information entropy and high-dimensional utility theories. The design-based research in a nonexperimental format was used to examine the data model for the functional forms that mapped the available data to the conceptual formulation of the management problem by combining ontology learning, data engineering, and analytic formulation methodologies. Semantic, symbolic, and dimensional extensions emerged as key functional forms of analytic extension of the data model. The data-modeling approach was applied to 15-terabyte secondary data set from a multinational medical product distribution company with profit growth problem. The extended data model simplified the composition of acceptable analytic insights, the derivation of business solutions, and the design of programs to address the ill-defined management problem. The implication for positive social change was the potential for overall improvement in management efficiency and increasing participation in advocacy and sponsorship of social initiatives. Read more Analytic extension Analytic formulation Big data Data engineering Data model Ontology learning and discovery Databases and Information Systems Library and Information Science
7	Data Engineering and Failure Prediction for Hard Drive S.M.A.R.T. Data Ramanayaka Mudiyanselage, Asanga 08 September 2020 (has links) No description available. Computer Science Machine Learning Data Engineering Python Data Analysis Big Data Predictive Analytics, Feature Selection Resampling Techniques Hard Drive Failure Prediction SMART Attributes Scikit-Learn PySpark
8	Extending Synthetic Data and Data Masking Procedures using Information Theory Tyler J Lewis (15361780) 26 April 2023 (has links) <p>The two primarily methodologies discussed in this thesis are the nonparametric entropy-based synthetic timeseries (NEST) and Directed infusion of data (DIOD) algorithms. </p> <p><br></p> <p>The former presents a novel synthetic data algorithm that is shown to outperform sismilar state-of-the-art, including generative networks, in terms of utility and data consistency. Majority of data used are open-source, and are cited where appropriate.</p> <p><br></p> <p>DIOD presents a novel data masking paradigm that presevres the utility, privacy, and efficiency required by the current industrial paradigm, and presents a cheaper alternative to many state-of-the-art. Data used include simulation data (source code cited), equations-based data, and open-source images (cited as needed). </p> Data engineering and data science Machine Learning Neural Network Data Science Information Theory Synthetic Data Data Masking Information Security
9	Assessing Viability of Open-Source Battery Cycling Data for Use in Data-Driven Battery Degradation Models Ritesh Gautam (17582694) 08 December 2023 (has links) <p dir="ltr">Lithium-ion batteries are being used increasingly more often to provide power for systems that range all the way from common cell-phones and laptops to advanced electric automotive and aircraft vehicles. However, as is the case for all battery types, lithium-ion batteries are prone to naturally occurring degradation phenomenon that limit their effective use in these systems to a finite amount of time. This degradation is caused by a plethora of variables and conditions including things like environmental conditions, physical stress/strain on the body of the battery cell, and charge/discharge parameters and cycling. Accurately and reliably being able to predict this degradation behavior in battery systems is crucial for any party looking to implement and use battery powered systems. However, due to the complicated non-linear multivariable processes that affect battery degradation, this can be difficult to achieve. Compared to traditional methods of battery degradation prediction and modeling like equivalent circuit models and physics-based electrochemical models, data-driven machine learning tools have been shown to be able to handle predicting and classifying the complex nature of battery degradation without requiring any prior knowledge of the physical systems they are describing.</p><p dir="ltr">One of the most critical steps in developing these data-driven neural network algorithms is data procurement and preprocessing. Without large amounts of high-quality data, no matter how advanced and accurate the architecture is designed, the neural network prediction tool will not be as effective as one trained on high quality, vast quantities of data. This work aims to gather battery degradation data from a wide variety of sources and studies, examine how the data was produced, test the effectiveness of the data in the Interfacial Multiphysics Laboratory’s autoencoder based neural network tool CD-Net, and analyze the results to determine factors that make battery degradation datasets perform better for use in machine learning/deep learning tools. This work also aims to relate this work to other data-driven models by comparing the CD-Net model’s performance with the publicly available BEEP’s (Battery Evaluation and Early Prediction) ElasticNet model. The reported accuracy and prediction models from the CD-Net and ElasticNet tools demonstrate that larger datasets with actively selected training/testing designations and less errors in the data produce much higher quality neural networks that are much more reliable in estimating the state-of-health of lithium-ion battery systems. The results also demonstrate that data-driven models are much less effective when trained using data from multiple different cell chemistries, form factors, and cycling conditions compared to more congruent datasets when attempting to create a generalized prediction model applicable to multiple forms of battery cells and applications.</p> Read more Aerospace materials Data engineering and data science Neural networks Lithium-ion Batteries Machine Learning Models Battery degradation data preprocessing efforts
10	INVESTIGATING OFFENDER TYPOLOGIES AND VICTIM VULNERABILITIES IN ONLINE CHILD GROOMING Siva sahitya Simhadri (17522730) 02 December 2023 (has links) <p dir="ltr">One of the issues on social media that is expanding the fastest is children being exposed to predators online [ 1 ]. Due to the ease with which a larger segment of the younger population may now access the Internet, online grooming activity on social media has grown to be a significant social concern. Child grooming, in which adults and minors exchange sexually explicit text and media via social media platforms, is a typical component of online child exploitation. An estimated 500,000 predators operate online every day. According to estimates, Internet chat rooms and instant messaging are where 89% of sexual approaches against children take place. The child may face a variety of unpleasant consequences following a grooming event, including shame, anger, anxiety, tension, despair, and substance abuse which make it more difficult for them to report the exploitation. A substantial amount of research in this domain has focused on identifying certain vulnerabilities of the victims of grooming. These vulnerabilities include specific age groups, gender, psychological factors, no family support, and lack of good social relations which make young people more vulnerable to grooming. So far no technical work has been done to apply statistical analysis on these vulnerability profiles and observe how these patterns change between different victim types and offender types. This work presents a detailed analysis of the effect of Offender type (contact and fantasy) and victim type (Law Enforcement Officers, Real Victims and Decoys (Perverted Justice)) on representation of different vulnerabilities in grooming conversations. Comparison of different victim groups would provide insights into creating the right training material for LEOs and decoys and help in the training process for online sting operations. Moreover, comparison of different offender types would help create targeted prevention strategies to tackle online child grooming and help the victims.</p> Read more Data engineering and data science Statistics not elsewhere classified Online grooming Child sexual abuse -- Investigation Chat rooms ANOVA statistics analysis Post-hoc analysis vulnerabilities ---

Search results