Global ETD Search

291	Data-based Explanations of Random Forest using Machine Unlearning Tanmay Laxman Surve (17537112) 03 December 2023 (has links) <p dir="ltr">Tree-based machine learning models, such as decision trees and random forests, are one of the most widely used machine learning models primarily because of their predictive power in supervised learning tasks and ease of interpretation. Despite their popularity and power, these models have been found to produce unexpected or discriminatory behavior. Given their overwhelming success for most tasks, it is of interest to identify root causes of the unexpected and discriminatory behavior of tree-based models. However, there has not been much work on understanding and debugging tree-based classifiers in the context of fairness. We introduce FairDebugger, a system that utilizes recent advances in machine unlearning research to determine training data subsets responsible for model unfairness. Given a tree-based model learned on a training dataset, FairDebugger identifies the top-k training data subsets responsible for model unfairness, or bias, by measuring the change in model parameters when parts of the underlying training data are removed. We describe the architecture of FairDebugger and walk through real-world use cases to demonstrate how FairDebugger detects these patterns and their explanations.</p> Data engineering and data science Data quality Model Debugging Example-based explanations Algorithmic Fairness Data Cleaning Data Analytics Fairness in ML Random Forest Debugging
292	Understanding the Knowledge, Skills, and Abilities (KSAs) of Data Professionals in United States Academic Libraries Khan, Hammad Rauf 12 1900 (has links) This study applies the knowledge, skills, and abilities (KSA) framework for eScience professionals to data service positions in academic libraries. Understanding the KSAs needed to provide data services is of crucial concern. The current study looks at KSAs of data professionals working in the United States academic libraries. An exploratory sequential mixed method design was adopted to discover the KSAs. The study was divided into two phases, a qualitative content analysis of 260 job advertisements for data professionals for Phase 1, and distribution of a self-administered online survey to data professionals working in academic libraries research data services (RDS) for Phase 2. The discovery of the KSAs from the content analysis of 260 job ads and the survey results from 167 data professionals were analyzed separately, and then Spearman rank order correlation was conducted in order to triangulate the data and compare results. The results from the study provide evidence on what hiring managers seek through job advertisements in terms of KSAs and which KSAs data professionals find to be important for working in RDS. The Spearman rank order correlation found strong agreement between job advertisement KSAs and data professionals perceptions of the KSAs. Data Professionals Data Librarians Data Literacy Research Data Services Research Data Lifecycle Data Culture Academic Library Librarian Data Science Library Science Information Science Librarianship KSA KSAs Information Science Library Science Education, Curriculum and Instruction
293	Human Mobility and Infectious Disease Dynamics / How modern mobility data enhances epidemic control Schlosser, Frank 02 August 2023 (has links) Die Covid-19 Pandemie hat gezeigt, wie stark die Ausbreitung von Infektionskrankheiten von der Dynamik der menschlichen Mobilität bestimmt wird. Gleichzeitig eröffnet die anhaltende Explosion an verfügbaren Mobilitätsdaten im 21. Jahrhundert einen viel genaueren Blick auf die menschliche Mobilität. In dieser Arbeit untersuchen wir verschiedene Ansätze, wie moderne Mobilitätsdaten zusammen mit Modellierung ein tieferes Verständnis des Zusammenspiels von menschlicher Mobilität und der Ausbreitung von Infektionskrankheiten ermöglichen. Wir verwenden Mobilitätsdaten um zu zeigen, dass landesweite Mobilitätsmuster während der Covid-19 Pandemie in Deutschland komplexe strukturelle Veränderungen durchlaufen haben. Wir stellen einen räumlich heterogenen Rückgang der Mobilität während Lockdown-Phasen fest. Vor allem beobachten wir, dass ein deutlicher Rückgang der Fernreisen während der Pandemie zu einem lokaleren Netzwerk und einer Abschwächung des “Small-World”-Effekts führt. Wir zeigen, dass diese strukturellen Veränderungen einen erheblichen Einfluss auf die Ausbreitungsdynamik von Epidemien haben, indem sie die epidemische Kurve abflachen und die Ausbreitung in geografisch weit entfernte Regionen verzögern. Des Weiteren entwickeln wir eine neue Methode zur Bestimmung des Ausbruchsursprungs anhand von hochaufgelösten geografischen Bewegungsdaten. Abschließend untersuchen wir, wie repräsentativ Mobilitätsdatensätze für das tatsächliche Reiseverhalten einer Bevölkerung sind. Wir identifizieren verschieden Arten von Verzerrungen, zeigen ihre Spuren in empirischen Datensätzen, und entwickeln einen mathematischen Rahmen um diese Verzerrungen abzuschwächen. Wir hoffen, dass unsere Studien in dieser Arbeit sich als hilfreiche Bausteine erweisen für ein einheitliches Verständnis von menschlicher Mobilität und der Dynamik von Infektionskrankheiten. / The Covid-19 pandemic demonstrated how strongly infectious disease spread is driven by the dynamics of human mobility. At the same time, the ongoing explosion of available mobility data in the 21st century opens up a much finer view of human mobility. In this thesis, we investigate several ways in which modern mobility data sources and modeling enable a deeper understanding of the interplay of human mobility and infectious disease spread. We use large-scale mobility data captured from mobile phones to show that country-wide mobility patterns undergo complex structural changes during the Covid-19 pandemic in Germany. Most prominently, we observe that a distinct reduction in long-distance travel during the pandemic leads to a more local, clustered network and a moderation of the “small-world” effect. We demonstrate that these structural changes have a considerable effect on epidemic spreading processes by “flattening” the epidemic curve and delaying the spread to geographically distant regions. Further, we show that high-resolution mobility data can be used for early outbreak detection. We develop a novel method to determine outbreak origins from geolocated movement data of individuals affected by the outbreak. We also present several practical applications that have been developed based on the above research. To further explore the question of applicability, we examine how representative mobility datasets are of the actual travel behavior of a population. We develop a mathematical framework to mitigate these biases, and use it to show that biases can severely impact outcomes of dynamic processes such as epidemic simulations, where biased data incorrectly estimates the severity and speed of disease transmission. We hope that our studies in this thesis will prove as helpful building blocks to assemble the emerging, unified understanding of mobility and infectious disease dynamics. physik komplexe systeme mobilität infektionskrankheiten epidemien covid-19 mobilitätsdaten physics complex systems network science mobility infectious diseases covid-19 epidemics mobile phone data big data data science 530 Physik ddc:530
294	Performance comparison of data mining algorithms for imbalanced and high-dimensional data Rubio Adeva, Daniel January 2023 (has links) Artificial intelligence techniques, such as artificial neural networks, random forests, or support vector machines, have been used to address a variety of problems in numerous industries. However, in many cases, models have to deal with issues such as imbalanced data or high multi-dimensionality. This thesis implements and compares the performance of support vector machines, random forests, and neural networks for a new bank account fraud detection, a use case defined by imbalanced data and high multi-dimensionality. The neural network achieved both the best AUC-ROC (0.889) and the best average precision (0.192). However, the results of the study indicate that the difference between the models’ performance is not statistically significant to reject the initial hypothesis that assumed equal model performances. / Artificiell intelligens, som artificiella neurala nätverk, random forests eller support vector machines, har använts för att lösa en mängd olika problem inom många branscher. I många fall måste dock modellerna hantera problem som obalanserade data eller hög flerdimensionalitet. Denna avhandling implementerar och jämför prestandan hos support vector machines, random forests och neurala nätverk för att upptäcka bedrägerier med nya bankkonton, ett användningsfall som definieras av obalanserade data och hög flerdimensionalitet. Det neurala nätverket uppnådde både den bästa AUC-ROC (0,889) och den bästa genomsnittliga precisionen (0,192). Resultaten av studien visar dock att skillnaden mellan modellernas prestanda inte är statistiskt signifikant för att förkasta den ursprungliga hypotesen som antog lika modellprestanda. Data science neural network random forest support vector machine imbalanced data average precision ROC Datavetenskap neuralt nätverk slumpmässig skog stödvektormaskin obalanserad data medelprecision ROC Computer and Information Sciences Data- och informationsvetenskap
295	Legislative Language for Success Gundala, Sanjana 01 June 2022 (has links) (PDF) Legislative committee meetings are an integral part of the lawmaking process for local and state bills. The testimony presented during these meetings is a large factor in the outcome of the proposed bill. This research uses Natural Language Processing and Machine Learning techniques to analyze testimonies from California Legislative committee meetings from 2015-2016 in order to identify what aspects of a testimony makes it successful. A testimony is considered successful if the alignment of the testimony matches the bill outcome (alignment is "For" and the bill passes or alignment is "Against" and the bill fails). The process of finding what makes a testimony successful was accomplished through data filtration, feature extraction, implementation of classification models, and feature analysis. Several features were extracted and tested to find those that had the greatest impact on the bill outcome. The features chosen provided information on the sentence complexity and type of words used (adjective, verb, nouns) for each testimony. Additionally all the testimonies were analyzed to find common phrases used within successful testimonies. Two types of classification models were implemented: ones that used the manually extracted feature as input and ones that used their own feature extraction process. The results from the classification models and feature analysis show that certain aspects within a testimony such as sentence complexity and using specific phrases significantly impact the bill outcome. The most successful models, Support Vector Machine and Multinomial Naive Bayes, achieved an accuracy of 91.79\% and 91.22\% respectively Legislative Meetings Natural Language Processing Machine Learning Classification Models Testimonies Predicting Bill Outcome American Politics Cognitive Science Data Science Other Computer Sciences Other Linguistics Social Work Software Engineering Theory and Algorithms
296	Object Dependent Properties of Multicomponent Acrylic Systems Kidd, Ian V. 29 August 2014 (has links) No description available. Materials Science Optics Polymers Engineering Degradation hardcoat acrylic accelerated exposure real-world optical properties study protocol big data yellowing haze PET TPU crazing correlation lifetime and degradation science data science
297	Degradation Pathway Models of Poly(ethylene-terephthalate) Under Accelerated Weathering Exposures Gok, Abdulkerim 27 January 2016 (has links) No description available. Materials Science Polymers Engineering accelerated weathering exposure polyethylene-terephthalate photodegradation hydrolysis yellowing hazing PV backsheets degradation science stress-mechanism-response framework data science
298	LB-CNN & HD-OC, DEEP LEARNING ADAPTABLE BINARIZATION TOOLS FOR LARGE SCALE IMAGE CLASSIFICATION Timothy G Reese (13163115) 28 July 2022 (has links) <p>The computer vision task of classifying natural images is a primary driving force behind modern AI algorithms. Deep Convolutional Neural Networks (CNNs) demonstrate state of the art performance in large scale multi-class image classification tasks. However, due to the many layers and millions of parameters these models are considered to be black box algorithms. The decisions of these models are further obscured due to a cumbersome multi-class decision process. There exists another approach called class binarization in the literature which determines the multi-class prediction outcome through a sequence of binary decisions.The focus of this dissertation is on the integration of the class-binarization approach to multi-class classification with deep learning models, such as CNNs, for addressing large scale image classification problems. Three works are presented to address the integration.</p> <p>In the first work, Error Correcting Output Codes (ECOCs) are integrated into CNNs by inserting a latent-binarization layer prior to the CNNs final classification layer. This approach encapsulates both encoding and decoding steps of ECOC into a single CNN architecture. EM and Gibbs sampling algorithms are combined with back-propagation to train CNN models with Latent Binarization (LB-CNN). The training process of LB-CNN guides the model to discover hidden relationships similar to the semantic relationships known apriori between the categories. The proposed models and algorithms are applied to several image recognition tasks, producing excellent results.</p> <p>In the second work, Hierarchically Decodeable Output Codes (HD-OCs) are proposedto compactly describe a hierarchical probabilistic binary decision process model over the features of a CNN. HD-OCs enforce more homogeneous assignments of the categories to the dichotomy labels. A novel concept called average decision depth is presented to quantify the average number of binary questions needed to classify an input. An HD-OC is trained using a hierarchical log-likelihood loss that is empirically shown to orient the output of the latent feature space to resemble the hierarchical structure described by the HD-OC. Experiments are conducted at several different scales of category labels. The experiments demonstrate strong performance and powerful insights into the decision process of the model.</p> <p>In the final work, the literature of enumerative combinatorics and partially ordered sets isused to establish a unifying framework of class-binarization methods under the Multivariate Bernoulli family of models. The unifying framework theoretically establishes simple relationships for transitioning between the different binarization approaches. Such relationships provide useful investigative tools for the discovery of statistical dependencies between large groups of categories. They are additionally useful for incorporating taxonomic information as well as enforcing structural model constraints. The unifying framework lays the groundwork for future theoretical and methodological work in addressing the fundamental issues of large scale multi-class classification.</p> <p><br></p> Computational statistics Statistical data science Statistical theory ECOC classification algorithms image classification techniques Hierarchical algorithms enumerative combinatorics Deep Learning Imaging Neural Networks method Convolutional neural networks image analysis class-binarization
299	Expeditious Causal Inference for Big Observational Data Yumin Zhang (13163253) 28 July 2022 (has links) <p>This dissertation address two significant challenges in the causal inference workflow for Big Observational Data. The first is designing Big Observational Data with high-dimensional and heterogeneous covariates. The second is performing uncertainty quantification for estimates of causal estimands that are obtained from the application of black box machine learning algorithms on the designed Big Observational Data. The methodologies developed by addressing these challenges are applied for the design and analysis of Big Observational Data from a large public university in the United States. </p> <h4>Distributed Design</h4> <p>A fundamental issue in causal inference for Big Observational Data is confounding due to covariate imbalances between treatment groups. This can be addressed by designing the study prior to analysis. The design ensures that subjects in the different treatment groups that have comparable covariates are subclassified or matched together. Analyzing such a designed study helps to reduce biases arising from the confounding of covariates with treatment. Existing design methods, developed for traditional observational studies consisting of a single designer, can yield unsatisfactory designs with sub-optimum covariate balance for Big Observational Data due to their inability to accommodate the massive dimensionality, heterogeneity, and volume of the Big Data. We propose a new framework for the distributed design of Big Observational Data amongst collaborative designers. Our framework first assigns subsets of the high-dimensional and heterogeneous covariates to multiple designers. The designers then summarize their covariates into lower-dimensional quantities, share their summaries with the others, and design the study in parallel based on their assigned covariates and the summaries they receive. The final design is selected by comparing balance measures for all covariates across the candidates and identifying the best amongst the candidates. We perform simulation studies and analyze datasets from the 2016 Atlantic Causal Inference Conference Data Challenge to demonstrate the flexibility and power of our framework for constructing designs with good covariate balance from Big Observational Data.</p> <h4>Designed Bootstrap</h4> <p>The combination of modern machine learning algorithms with the nonparametric bootstrap can enable effective predictions and inferences on Big Observational Data. An increasingly prominent and critical objective in such analyses is to draw causal inferences from the Big Observational Data. A fundamental step in addressing this objective is to design the observational study prior to the application of machine learning algorithms. However, the application of the traditional nonparametric bootstrap on Big Observational Data requires excessive computational efforts. This is because every bootstrap sample would need to be re-designed under the traditional approach, which can be prohibitive in practice. We propose a design-based bootstrap for deriving causal inferences with reduced bias from the application of machine learning algorithms on Big Observational Data. Our bootstrap procedure operates by resampling from the original designed observational study. It eliminates the need for additional, costly design steps on each bootstrap sample that are performed under the standard nonparametric bootstrap. We demonstrate the computational efficiency of this procedure compared to the traditional nonparametric bootstrap, and its equivalency in terms of confidence interval coverage rates for the average treatment effects, by means of simulation studies and a real-life case study.</p> <h4>Case Study</h4> <p>We apply the distributed design and designed bootstrap methodologies in a case study involving institutional data from a large public university in the United States. The institutional data contains comprehensive information about the undergraduate students in the university, ranging from their academic records to on-campus activities. We study the causal effects of undergraduate students’ attempted course load on their academic performance based on a selection of covariates from these data. Ultimately, our real-life case study demonstrates how our methodologies enable researchers to effectively use straightforward design procedures to obtain valid causal inferences with reduced computational efforts from the application of machine learning algorithms on Big Observational Data.</p> <p><br></p> Econometric and statistical methods Applied statistics Statistical data science Statistics not elsewhere classified Causal inference Design of observational studies Propensity score method Bootstrap resampling method Causal machine learning Institutional research Big data
300	Reducing software complexity by hidden structure analysis : Methods to improve modularity and decrease ambiguity of a software system Bjuhr, Oscar, Segeljakt, Klas January 2016 (has links) Software systems can be represented as directed graphs where components are nodes and dependencies between components are edges. Improvement in system complexity and reduction of interference between development teams can be achieved by applying hidden structure analysis. However, since systems can contain thousands of dependencies, a concrete method for selecting which dependencies that are most beneficial to remove is needed. In this thesis two solutions to this problem are introduced; dominator- and cluster analysis. Dominator analysis examines the cost/gain ratio of detaching individual components from a cyclic group. Cluster analysis finds the most beneficial subgroups to split in a cyclic group. The aim of the methods is to reduce the size of cyclic groups, which are sets of co- dependent components. As a result, the system architecture will be less prone to propagating errors, caused by modifications of components. Both techniques derive from graph theory and data science but have not been applied to the area of hidden structures before. A subsystem at Ericsson is used as a testing environment. Specific dependencies in the structure which might impede the development process have been discovered. The outcome of the thesis is four to-be scenarios of the system, displaying the effect of removing these dependencies. The to-be scenarios show that the architecture can be significantly improved by removing few direct dependencies. / Mjukvarusystem kan representeras som riktade grafer där komponenter är noder och beroenden mellan komponenter är kanter. Förbättrad systemkomplexitet och minskad mängd störningar mellan utvecklingsteam kan åstadkommas genom att applicera teorin om gömda beroende. Eftersom system kan innehålla tusentals beroenden behövs en konkret metod för att hitta beroenden i systemet som är fördelaktiga att ta bort. I den här avhandlingen presenteras två lösningar till problemet; dominator- och klusteranalys. Dominatoranalys undersöker kostnad/vinst ration av att ta bort individuella komponenter i systemet från en cyklisk grupp. Klusteranalys hittar de mest lönsamma delgrupperna att klyva isär i en cyklisk grupp. Metodernas mål är att minska storleken på cykliska grupper. Cykliska grupper är uppsättningar av komponenter som är beroende av varandra. Som resultat blir systemarkitekturen mindre benägen till propagering av fel, orsakade av modifiering av komponenter. Båda metoderna härstammar från grafteori och datavetenskap men har inte applicerats på området kring gömda strukturer tidigare. Ett subsystem på Ericsson användes som testmiljö. Specifika beroenden i strukturen som kan vara hämmande för utvecklingsprocessen har identifierats. Resultatet av avhandlingen är fyra potentiella framtidsscenarion av systemet som visualiserar effekten av att ta bort de funna beroendena. Framtidsscenariona visar att arkitekturen kan förbättras markant genom att avlägsna ett fåtal direkta beroenden. DSM VSM Hidden Structure Analysis Directed Graphs Graph Theory Data Science DSM VSM Analys av Gömda Strukturer Riktade Grafer Grafteori Datavetenskap Elektroteknik och elektronik

Search results