Global ETD Search

21	Active provenance for data intensive research Spinuso, Alessandro January 2018 (has links) The role of provenance information in data-intensive research is a significant topic of discussion among technical experts and scientists. Typical use cases addressing traceability, versioning and reproducibility of the research findings are extended with more interactive scenarios in support, for instance, of computational steering and results management. In this thesis we investigate the impact that lineage records can have on the early phases of the analysis, for instance performed through near-real-time systems and Virtual Research Environments (VREs) tailored to the requirements of a specific community. By positioning provenance at the centre of the computational research cycle, we highlight the importance of having mechanisms at the data-scientists' side that, by integrating with the abstractions offered by the processing technologies, such as scientific workflows and data-intensive tools, facilitate the experts' contribution to the lineage at runtime. Ultimately, by encouraging tuning and use of provenance for rapid feedback, the thesis aims at improving the synergy between different user groups to increase productivity and understanding of their processes. We present a model of provenance, called S-PROV, that uses and further extends PROV and ProvONE. The relationships and properties characterising the workflow's abstractions and their concrete executions are re-elaborated to include aspects related to delegation, distribution and steering of stateful streaming operators. The model is supported by the Active framework for tuneable and actionable lineage ensuring the user's engagement by fostering rapid exploitation. Here, concepts such as provenance types, configuration and explicit state management allow users to capture complex provenance scenarios and activate selective controls based on domain and user-defined metadata. We outline how the traces are recorded in a new comprehensive system, called S-ProvFlow, enabling different classes of consumers to explore the provenance data with services and tools for monitoring, in-depth validation and comprehensive visual-analytics. The work of this thesis will be discussed in the context of an existing computational framework and the experience matured in implementing provenance-aware tools for seismology and climate VREs. It will continue to evolve through newly funded projects, thereby providing generic and user-centred solutions for data-intensive research.
22	Investigating the relationship between mobile network performance metrics and customer satisfaction Labuschagne, Louwrens 16 March 2020 (has links) Fixed and mobile communication service providers (CSPs) are facing fierce competition among each other. In a globally saturated market, the primary di↵erentiator between CSPs has become customer satisfaction, typically measured by the Net Promoter Score (NPS) for a subscriber. The NPS is the answer to the question: ”How likely is it that you will recommend this product/company to a friend or colleague?” The responses range from 0 representing not at all likely to 10 representing extremely likely. In this thesis, we aim to identify which, if any, network performance metrics contribute to subscriber satisfaction. In particular, we investigate the relationship between the NPS survey results and 11 network performance metrics of the respondents of a major mobile operator in South Africa. We identify the most influential performance metrics by fitting both linear and non-linear statistical models to the February 2018 survey dataset and test the models on the June 2018 dataset. We find that metrics such as Call Drop Rate, Call Setup Failure Rate, Call Duration and Server Setup Latency are consistently selected as significant features in models of NPS prediction. Nevertheless we find that all the tested statistical and machine learning models, whether linear or non-linear, are poor predictors of NPS scores in a month, when only the network performance metrics in the same month are provided. This suggests that either NPS is driven primarily by other factors (such as customer service interactions at branches and contact centres) or are determined by historical network performance over multiple months. data science mobile network performance metrics
23	The Scope and Value of Healthcare Data Science Applications Huerta, Jose Oscar 05 1900 (has links) Health disparities are a recognized public health concern and the need to address these disparities remains worthy of bringing new methods that assist in closing the gap. This research examined the effectiveness of data science to highlight health disparities, and to convey the value of data science applications in related health care applications. The goal of this research was accomplished by undertaking a multi-phased and multi-method approach, best represented in three individual essays. In essay one, a systematic literature review assessed the state in current academic literature of data science applications used to explore health disparities and to determine its applicability. The systematic review was guided by the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) guidelines. Essay two assessed the capacity of data science software to address the effectiveness of these data science technologies in examining health disparities data. This was conducted using KDnuggets data pertaining to analytics, data science, and machine-learning software. The research in this essay demonstrated the potential utility of leading software to perform the kinds of data science operations that can achieve improved care in healthcare networks by addressing health disparities. Essay three provided an appropriate case study to showcase the value data science brings to the healthcare space. This study used a geographic information system to create and analyze choropleth maps to determine the distribution of prostate cancer in Texas. SPSS software was used to assess the social determinants of health that may explain prostate cancer mortality. Healthcare Data Science Applications Health Disparities
24	THEORY AND APPLICATIONS OF DATA SCIENCE Sheng Zhang (13900074) 07 October 2022 (has links) <p>This work is a collection of original research, contributing to various hot topics in contemporary data science, covering both theory and applications. The topics include discovering physical laws from data, data-driven epidemiological models, Gaussian random field surrogate models, and image texture classification. In Chapter 2, we introduce a novel method for discovering physical laws from data with uncertainty quantification. In Chapter 3, this method is enhanced to tackle high noise and outliers. In Chapter 4, the method is applied to discover the law of turbine component damage in industry. In Chapter 5, we propose a new framework for building trustworthy data-driven epidemiological models and apply it to the COVID-19 outbreak in New York City. In Chapter 6, we construct augmented Gaussian random field, a universal framework incorporating the data of observable and derivatives of any order. The theoretical framework as well as computational framework are established. In Chapter 7, we introduce the use of 2-dimensional signature, an object inspired by rough paths theory, as feature for image texture classification.</p> data science
25	DEA in Healthcare Management in Conjunction with Data Science Vincent, Charles 27 April 2020 (has links) No Data science Management science Health care
26	Comparison of Computational Notebook Platforms for Interactive Visual Analytics: Case Study of Andromeda Implementations Liu, Han 22 September 2022 (has links) Existing notebook platforms have different capabilities for supporting visual analytics use. It is not clear which platform to choose for implementing visual analytics notebooks. In this work, we investigated the problem using Andromeda, an interactive dimension reduction algorithm, and implemented it using three different notebook platforms: 1) Python-based Jupyter Notebook, 2) JavaScript-based Observable Notebook, and 3) Jupyter Notebook embedding both Python (data science use) and JavaScript (visual analytics use). We also made comparisons for all the notebook platforms via a case study based on metrics such as programming difficulty, notebook organization, interactive performance, and UI design choice. Furthermore, guidelines are provided for data scientists to choose one notebook platform for implementing their visual analytics notebooks in various situations. Laying the groundwork for future developers, advice is also given on architecting better notebook platforms. / Master of Science / Data scientists are interested in developing visual analytics notebooks. However, different notebook platforms have different support for visual analytics components, such as visualizations and user interactions. To investigate which notebook platform to use for visual analytics, we built notebooks based on three different notebook platforms, i.e., Jupyter Notebook (with Python), Observable Notebook (with JavaScript), and Jupyter Notebook (with Python and JavaScript). Based on the implementation and user interactions, we explained why significant differences exist via specific metrics, such as programming difficulty, notebook organization, interactive performance, and the UI design choice. Furthermore, our work will benefit future researchers in choosing suitable notebook platforms for implementing visual analytics notebooks. Visual Analytics Data Science Computational Notebooks
27	Applied Machine Learning for Online Education Serena Alexis Nicoll (12476796) 28 April 2022 (has links) <p>We consider the problem of developing innovative machine learning tools for online education and evaluate their ability to provide instructional resources. Prediction tasks for student behavior are a complex problem spanning a wide range of topics: we complement current research in student grade prediction and clickstream analysis by considering data from three areas of online learning: Social Learning Networks (SLN), Instructor Feedback, and Learning Management Systems (LMS). In each of these categories, we propose a novel method for modelling data and an associated tool that may be used to assist students and instructors. First, we develop a methodology for analyzing instructor-provided feedback and determining how it correlates with changes in student grades using NLP and NER--based feature extraction. We demonstrate that student grade improvement can be well approximated by a multivariate linear model with average fits across course sections approaching 83\%, and determine several contributors to student success. Additionally, we develop a series of link prediction methodologies that utilize spatial and time-evolving network architectures to pass network state between space and time periods. Through evaluation on six real-world datasets, we find that our method obtains substantial improvements over Bayesian models, linear classifiers, and an unsupervised baseline, with AUCs typically above 0.75 and reaching 0.99. Motivated by Federated Learning, we extend our model of student discussion forums to model an entire classroom as a SLN. We develop a methodology to represent student actions across different course materials in a shared, low-dimensional space that allows characteristics from actions of different types to be passed jointly to a downstream task. Performance comparisons against several baselines in centralized, federated, and personalized learning demonstrate that our model offers more distinctive representations of students in a low-dimensional space, which in turn results in improved accuracy on a common downstream prediction task. Results from these three research thrusts indicate the ability of machine learning methods to accurately model student behavior across multiple data types and suggest their ability to benefit students and instructors alike through future development of assistive tools. </p> Computer Engineering machine learning data science data science for education federated learning social learning networks NLP
28	Machine Learning to predict student performance based on well-being data : a technical and ethical discussion / Maskininlärning för att förutsäga elevers prestationer baserat på data om mående : en teknisk och etisk diskussion McCarren, Lucy January 2023 (has links) The data provided by educational platforms and digital tools offers new ways of analysing students’ learning strategies. One such digital tool is the wellbeing platform created by EdAider, which consists of an interface where students can answer questions about their well-being, and a dashboard where teachers and schools can see insights into the well-being of individual students and groups of students. Both students and teachers can see the development of student well-being on a weekly basis. This thesis project investigates how Machine Learning (ML) can be used along side Learning Analytics (LA) to understand and improve students’ well-being. Real-world data generated by students at Swedish schools using EdAider’s well-being platform is analysed to generate data insights. In addition ML methods are implemented in order to build a model to predict whether students are at risk of failing based from their well-being data, with the goal to inform data-driven improvements of students’ education. This thesis has three primary goals which are to: 1. Generate data insights to further understand patterns in the student wellbeing data. 2. Design a classification model using ML methods to predict student performance based on well-being data, and validate the model against actual performance data provided by the schools. 3. Carry out an ethical evaluation of the data analysis and grade prediction model. The results showed that males report higher well-being on average than females across most well-being factors, with the exception of relationships where females report higher well-being than males. Students identifying as non-binary gender report a considerably lower level of well-being compared with males and females across all 8 well-being factors. However, the amount of data for non-binary students was limited. Primary schools report higher well-being than the older secondary school students. Students reported anxiety/depression as the most closely correlated dimensions, followed by engagement/accomplishment and positive emotion/depression. Logistic regression and random forest models were used to build a performance prediction model, which aims to predict whether a student is at risk of performing poorly based on their reported well-being data. The model achieved accuracy of 80-85 percent. Various methods of feature importance including regularization, recursive feature selection, and impurity decrease for random forest were investigated to examine which well-being factors have the most effect on performance. All methods of examining feature importance consistently identified three features as important: ”accomplishment,” ”depression,” and ”number of surveys answered.” The benefits, risks and ethical value conflicts of the data analysis and prediction model were carefully considered and discussed using a Value Sensitive Design approach. Ethical practices for mitigating risks are discussed. / Den data som tillhandahålls av utbildningsplattformar och digitala verktyg erbjuder nya sätt att analysera studenters inlärningsstrategier. Ett sådant digitalt verktyg är mående plattformen skapad av EdAider, som består av ett gränssnitt där elever kan svara på frågor om deras mående, och en dashboard där lärare och skolor kan se insikter om individuella elevers och grupper av elevers mående. Både elever och lärare kan se utvecklingen av elevers mående på veckobasis. Detta examensarbete undersöker hur Maskininlärning (ML) kan användas tillsammans med Inlärningsanalys (LA) för att förstå och förbättra elevers mående. Verkliga data genererade av elever vid svenska skolor med hjälp av EdAiders måendeplattform analyseras för att skapa insikter om data. Dessutom implementeras ML-metoder för att bygga en modell för att förutsäga om elever riskerar att misslyckas baserat på deras mående-data, med målet att informera data-drivna förbättringar av elevers utbildning. Detta examensarbete har tre primära mål: 1. Skapa datainsikter för att ytterligare förstå mönster i data om elevers mående. 2. Utforma en modell med hjälp av ML-metoder för att förutsäga elevprestationer baserat på mående-data, och validera modellen mot faktiska prestationsdata som tillhandahålls av skolorna. 3. Utföra en etisk utvärdering av dataanalysen och modellen för betygsprediktion. Resultaten visade att pojkar i genomsnitt rapporterar högre mående än flickor inom de flesta måendefaktorer, med undantag för relationer där flickor rapporterar högre mående än pojkar. Elever som identifierar sig som icke-binära rapporterar en betydligt lägre nivå av mående jämfört med pojkar och flickor över alla 8 måendefaktorer. Men mängden data för icke-binära elever var begränsad. Grundskolor rapporterar högre mående än äldre gymnasieelever. Elever rapporterade ångest/depression som de mest nära korrelerade dimensionerna, följt av engagemang/prestation och positivt känsloläge/depression. Logistisk regression och random forest-modeller användes för att bygga en prestationsprediktionmodell, med en noggrannhet på 80-85 procent uppnådd. Olika metoder för feature selection undersöktes, inklusive regularisering, recursive feature selection och impurity decrease för random forest. Alla metoder för undersökning av feature selection identifierade konsekvent tre funktioner som viktiga: ”prestation,” ”depression,” och ”antal svarade enkäter.” Fördelarna, riskerna och etiska värdekonflikterna i dataanalysen och prediktionsmodellen beaktades noggrant och diskuterades med hjälp av en Value Sensitive Design-ansats. Machine Learning Data Science Learning Analytics Maskininlärning Data Science Inlärningsanalys Computer and Information Sciences Data- och informationsvetenskap
29	Nonlinear parameter estimation of experimental cake filtration data Buchwald, Thomas 20 January 2022 (has links) Diese Arbeit stellt die nichtlineare Parameterschätzung als alternative Auswertemethode von Kuchenfiltrationsexperimenten vor. Anhand eines größeren Datensatzes werden die Vorteile dieser Methode gegenüber der verbreiteten Auswertung mittels einer linearisierten Form der Kuchenfiltrationsgleichung für den Fall konstanten Drucks gezeigt. Zur Bewertung der Anpassungsgüte werden Residuenplots erläutert und verwendet. Die Unterschiede der Ergebnisse bewegen sich im Bereich von 5 bis 15% bei der Bestimmung des spezifischen Kuchenfiltrationswiderstands, welcher der wichtigste Parameter bei der Auslegung von Filtrationsapparaten ist. Weitere Möglichkeiten der Auswertung werden aufgezeigt, die durch die nichtlineare Parameterschätzung möglich werden, darunter die Auswertung von Experimenten bei variablem Druck, die Bestimmung des Kuchenwiderstands kompressibler Feststoffsysteme sowie eine Bewertung der anfänglichen Verblockungsvorgänge am Filtermedium.:1 Introduction 2 Cake Filtration Theory 2.1 Historical Development 2.2 Derivation of the Cake Filtration Equation 2.3 Fit Procedures for Cake Filtration Data 2.4 Additional Methods for Finding the Time Offset 3 Materials and Methods 3.1 Materials 3.2 Filter Medium 3.3 Laboratory Pressure Filters 3.4 Example Dataset 3.5 Preparation of Example Dataset 3.6 Residual Plots and Chi-Squares 3.7 Bootstrapped Statistics 4 Proposed Fit Procedure 4.1 Nonlinear Regression 4.2 Region of Best Fit 5 Results and Discussion 5.1 Constant-Pressure Filtration 5.2 Hermans & Bredée Models 5.3 Residual Plots of Fit Results 5.4 Nonconstant Filtration 5.5 Compressibility Effects 5.6 Optimal Parameter Definition 5.7 The Role of the t/V-V-Diagram 6 Conclusions 7 Outlook 7.1 Constant-Flux Filtration 7.2 Inline Resistance Measurements 7.3 Parameter Estimation in Chemical Engineering A Appendix A.1 The Concentration Parameter A.2 Obsolete Fit Methods A.3 Residual Statistics A.4 Bootstrapped Statistics Data A.5 Fit Example in Microsoft Excel A.6 Experimental Data and Metadata B References / This thesis presents nonlinear parameter estimation as an alternative method for the evaluation of cake filtration experiments. A dataset of 225 constant-pressure filtration experiments is used to highlight the advantages of this method compared to the widely used evaluation method which uses a linear transformation of the cake filtration equation. The goodness-of-fit is tested through the means of residual plots, which are introduced and discussed. The difference in results for the two methods for the specific cake resistance parameter, which is the most important parameter in the dimensioning of filtration apparatused, lies between 5 and 15%. Further possibilities of evaluation are presented, which become possible through the use of nonlinear parameter estimation, such as: evaluation of filtration experiments with nonconstant pressure, the determination of cake resistances for compressible systems, and the investigation of the processes present in the beginning stages of cake filtration.:1 Introduction 2 Cake Filtration Theory 2.1 Historical Development 2.2 Derivation of the Cake Filtration Equation 2.3 Fit Procedures for Cake Filtration Data 2.4 Additional Methods for Finding the Time Offset 3 Materials and Methods 3.1 Materials 3.2 Filter Medium 3.3 Laboratory Pressure Filters 3.4 Example Dataset 3.5 Preparation of Example Dataset 3.6 Residual Plots and Chi-Squares 3.7 Bootstrapped Statistics 4 Proposed Fit Procedure 4.1 Nonlinear Regression 4.2 Region of Best Fit 5 Results and Discussion 5.1 Constant-Pressure Filtration 5.2 Hermans & Bredée Models 5.3 Residual Plots of Fit Results 5.4 Nonconstant Filtration 5.5 Compressibility Effects 5.6 Optimal Parameter Definition 5.7 The Role of the t/V-V-Diagram 6 Conclusions 7 Outlook 7.1 Constant-Flux Filtration 7.2 Inline Resistance Measurements 7.3 Parameter Estimation in Chemical Engineering A Appendix A.1 The Concentration Parameter A.2 Obsolete Fit Methods A.3 Residual Statistics A.4 Bootstrapped Statistics Data A.5 Fit Example in Microsoft Excel A.6 Experimental Data and Metadata B References info:eu-repo/classification/ddc/660 ddc:660 Kuchenfiltration Datenanalyse Nichtlineare Regression Parameterschätzung Data Science
30	Fostering collaboration amongst business intelligence, business decision makers and statisticians for the optimal use of big data in marketing strategies De Koker, Louise January 2019 (has links) Philosophiae Doctor - PhD / The aim of this study was to propose a model of collaboration adaptable for the optimal use of big data in an organisational environment. There is a paucity of knowledge on such collaboration and the research addressed this gap. More specifically, the research attempted to establish whether leadership, trust and knowledge sharing influence collaboration among the stakeholders identified at large organisations. The conceptual framework underlying this research was informed by collaboration theory and organisational theory. It was assumed that effective collaboration in the optimal use of big data possibly is associated with leadership, knowledge sharing and trust. These concepts were scientifically hypothesised to determine whether such associations exist within the context of big data. The study used a mixed methods approach, combining a qualitative with a quantitative study. The qualitative study was in the form of in-depth interviews with senior managers from different business units at a retail organisation in Cape Town. The quantitative study was an online survey conducted with senior marketing personnel at JSE-listed companies from various industries in Cape Town. A triangulation methodology was adopted, with additional in-depth interviews of big data and analytics experts from both South Africa and abroad, to strengthen the research. The findings of the research indicate the changing role of the statistician in the era of big data and the new discipline of data science. They also confirm the importance of leadership, trust and knowledge sharing in ensuring effective collaboration. Of the three hypotheses tested, two were confirmed. Collaboration has been applied in many areas. Unexpected findings of the research were the role the chief data officer plays in fostering collaboration among stakeholders in the optimal use of big data in marketing strategies, as well as the importance of organisational structure and culture in effective collaboration in the context of big data and data science in large organisations. The research has contributed to knowledge by extending the theory of collaboration to the domain of big data in the organisational context, with the proposal of an integrated model of collaboration in the context of big data. This model was grounded in the data collected from various sources, establishing the crucial new role of the chief data officer as part of the executive leadership and main facilitator of collaboration in the organisation. Collaboration among the specified stakeholders, led by the chief data officer, occurs both horizontally with peers and vertically with specialists at different levels within the organisation in the proposed model. The application of such a model of collaboration should facilitate the successful outcome of the collaborative efforts in data science in the form of financial benefits to the organisation through the optimal use of big data. Big data Data science Collaboration Leadership Knowledge sharing

Search results