Global ETD Search

121	Exploration and Statistical Modeling of Profit Gibson, Caleb 01 December 2023 (has links) (PDF) For any company involved in sales, maximization of profit is the driving force that guides all decision-making. Many factors can influence how profitable a company can be, including external factors like changes in inflation or consumer demand or internal factors like pricing and product cost. Understanding specific trends in one's own internal data, a company can readily identify problem areas or potential growth opportunities to help increase profitability. In this discussion, we use an extensive data set to examine how a company might analyze their own data to identify potential changes the company might investigate to drive better performance. Based upon general trends in the data, we recommend potential actions the company could take. Additionally, we examine how a company can utilize predictive modeling to help them adapt their decision-making process as the trends identified from the initial analysis of the data evolve over time. Applied Mathematics Applied Statistics Categorical Data Analysis Data Science Probability Statistical Methodology Statistical Models Statistical Theory
122	Project-based Multi-tenant Container Registry For Hopsworks Kashyap, Pradyumna Krishna January 2020 (has links) There has been a substantial growth in the usage of data in the past decade, cloud technologies and big data platforms have gained popularity as they help in processing such data on a large scale. Hopsworks is such a managed plat- form for scale out data science. It is an open-source platform for the develop- ment and operation of Machine Learning models, available on-premise and as a managed platform in the cloud. As most of these platforms provide data sci- ence environments to collate the required libraries to work with, Hopsworks provides users with Anaconda environments.Hopsworks provides multi-tenancy, ensuring a secure model to manage sen- sitive data in the shared platform. Most of the Hopsworks features are built around projects, each project includes an Anaconda environment that provides users with a number of libraries capable of processing data. Each project cre- ation triggers a creation of a base Anaconda environment and each added li- brary updates this environment. For an on-premise application, as data science teams are diverse and work towards building repeatable and scalable models, it becomes increasingly important to manage these environments in a central location locally.The purpose of the thesis is to provide a secure storage for these Anaconda en- vironments. As Hopsworks uses a Kubernetes cluster to serve models, these environments can be containerized and stored on a secure container registry on the Kubernetes Cluster. The provided solution also aims to extend the multi- tenancy feature of Hopsworks onto the hosted local storage. The implemen- tation comprises of two parts; First one, is to host a compatible open source container registry to store the container images on a local Kubernetes cluster with fault tolerance and by avoiding a single point of failure. Second one, is to leverage the multi-tenancy feature in Hopsworks by storing the images on the self sufficient secure registry with project level isolation. / Det har skett en betydande tillväxt i dataanvändningen under det senaste decen- niet, molnteknologier och big data-plattformar har vunnit popularitet eftersom de hjälper till att bearbeta sådan data i stor skala. Hopsworks är en sådan hante- rad plattform för att skala ut datavetenskap. Det är en öppen källkodsplattform för utveckling och drift av Machine Learning-modeller, tillgänglig på plats och som en hanterad plattform i molnet. Eftersom de flesta av dessa plattformar tillhandahåller datavetenskapsmiljöer för att samla in de bibliotek som krävs för att arbeta med, ger Hopsworks användare Anaconda-miljöer.Hopsworks tillhandahåller multi-tenancy, vilket säkerställer en säker modell för att hantera känslig data i den delade plattformen. De flesta av Hopsworks- funktionerna är uppbyggda kring projekt, varje projekt innehåller en Anaconda- miljö som ger användarna ett antal bibliotek som kan bearbeta data. Varje projektskapning utlöser skapandet av en basanacondamiljö och varje tillagt bibliotek uppdaterar denna miljö. För en lokal applikation, eftersom datave- tenskapsteam är olika och arbetar för att bygga repeterbara och skalbara mo- deller, blir det allt viktigare att hantera dessa miljöer på en central plats lokalt. Syftet med avhandlingen är att tillhandahålla en säker lagring för dessa Anaconda- miljöer. Eftersom Hopsworks använder ett Kubernetes-kluster för att betjäna modeller kan dessa miljöer containeriseras och lagras i ett säkert container- register i Kubernetes-klustret. Den medföljande lösningen syftar också till att utvidga Hopsworks-funktionen för flera hyresgäster till det lokala lagrade vär- det. Implementeringen består av två delar; Den första är att vara värd för ett kompatibelt register med öppen källkod för att lagra behållaravbildningarna iett lokalt Kubernetes-kluster med feltolerans och genom att undvika en enda felpunkt. Den andra är att utnyttja multihyresfunktionen i Hopsworks genom att lagra bilderna i det självförsörjande säkra registret med projektnivåisole- ring. Cloud Big Data Hopsworks Data Science On-premise Multitenancy Container Registry Kubernetes. Computer and Information Sciences Data- och informationsvetenskap
123	Data Driven High Performance Data Access Ramljak, Dusan January 2018 (has links) Low-latency, high throughput mechanisms to retrieve data become increasingly crucial as the cyber and cyber-physical systems pour out increasing amounts of data that often must be analyzed in an online manner. Generally, as the data volume increases, the marginal utility of an ``average'' data item tends to decline, which requires greater effort in identifying the most valuable data items and making them available with minimal overhead. We believe that data analytics driven mechanisms have a big role to play in solving this needle-in-the-haystack problem. We rely on the claim that efficient pattern discovery and description, coupled with the observed predictability of complex patterns within many applications offers significant potential to enable many I/O optimizations. Our research covers exploitation of storage hierarchy for data driven caching and tiering, reduction of distance between data and computations, removing redundancy in data, using sparse representations of data, the impact of data access mechanisms on resilience, energy consumption, storage usage, and the enablement of new classes of data driven applications. For caching and prefetching, we offer a powerful model that separates the process of access prediction from the data retrieval mechanism. Predictions are made on a data entity basis and used the notions of ``context'' and its aspects such as ``belief'' to uncover and leverage future data needs. This approach allows truly opportunistic utilization of predictive information. We elaborate on which aspects of the context we are using in areas other than caching and prefetching different situations and why it is appropriate in the specified situation. We present in more details the methods we have developed, BeliefCache for data driven caching and prefetching and AVSC for pattern mining based compression of data. In BeliefCache, using a belief, an aspect of context representing an estimate of the probability that the storage element will be needed, we developed modular framework BeliefCache, to make unified informed decisions about that element or a group. For the workloads we examined we were able to capture complex non-sequential access patterns better than a state-of-the-art framework for optimizing cloud storage gateways. Moreover, our framework is also able to adjust to variations in the workload faster. It also does not require a static workload to be effective since modular framework allows for discovering and adapting to the changes in the workload. In AVSC, using an aspect of context to gauge the similarity of the events, we perform our compression by keeping relevant events intact and approximating other events. We do that in two stages. We first generate a summarization of the data, then approximately match the remaining events with the existing patterns if possible, or add the patterns to the summary otherwise. We show gains over the plain lossless compression for a specified amount of accuracy for purposes of identifying the state of the system and a clear tradeoff in between the compressibility and fidelity. In other mentioned research areas we present challenges and opportunities with the hope that will spur researchers to further examine those issues in the space of rapidly emerging data intensive applications. We also discuss the ideas how our research in other domains could be applied in our attempts to provide high performance data access. / Computer and Information Science Computer Science Caching Data Filtering Data Science Locality Exploitation Prefetching Storage Systems
124	Restoring Consistency in Ontological Multidimensional Data Models via Weighted Repairs Haque, Enamul January 2020 (has links) This can be considered as a multidisciplinary research where ideas from Operations Research, Data Science and Logic came together to solve an inconsistency handling problem in a special type of ontology. / High data quality is a prerequisite for accurate data analysis. However, data inconsistencies often arise in real data, leading to untrusted decision making downstream in the data analysis pipeline. In this research, we study the problem of inconsistency detection and repair of the Ontology Multi-dimensional Data Model (OMD). We propose a framework of data quality assessment, and repair for the OMD. We formally define a weight-based repair-by-deletion semantics, and present an automatic weight generation mechanism that considers multiple input criteria. Our methods are rooted in multi-criteria decision making that consider the correlation, contrast, and conflict that may exist among multiple criteria, and is often needed in the data cleaning domain. After weight generation we present a dynamic programming based Min-Sum algorithm to identify minimal weight solution. We then apply evolutionary optimization techniques and demonstrate improved performance using medical datasets, making it realizable in practice. / Thesis / Master of Computer Science (MCS) / Accurate data analysis requires high quality data as input. In this research, we study inconsistency in an ontology known as Ontology Multi-dimensional Data (OMD) Model and propose algorithms to repair them based on their automatically generated relative weights. We proposed two techniques to restore consistency, one provides optimal results but takes longer time compared to the other one, which produces sub-optimal results but fast enough for practical purposes, shown with experiments on datasets. Logic Data Science Data Cleaning MCDM CRITIC OMD Genetic Algorithms Database Datalog
125	Interpreting Shift Encoders as State Space models for Stationary Time Series Donkoh, Patrick 01 May 2024 (has links) (PDF) Time series analysis is a statistical technique used to analyze sequential data points collected or recorded over time. While traditional models such as autoregressive models and moving average models have performed sufficiently for time series analysis, the advent of artificial neural networks has provided models that have suggested improved performance. In this research, we provide a custom neural network; a shift encoder that can capture the intricate temporal patterns of time series data. We then compare the sparse matrix of the shift encoder to the parameters of the autoregressive model and observe the similarities. We further explore how we can replace the state matrix in a state-space model with the sparse matrix of the shift encoder. Time Series Autoencoders State-Space Shift Encoders Kalman Filters Data Science Dynamic Systems Other Applied Mathematics
126	Establishing “The Fossil Record”: A Database of Vertebrate Paleontological Sites Across the State of Tennessee Mclaurine, Sarah 01 May 2024 (has links) (PDF) Fossil localities across the state of Tennessee and the data related to those sites were compiled from Tennessee Division of Geology Bulletin 84, titled “Tennessee’s Prehistoric Vertebrates,” and stored in a Microsoft Access geodatabase housed by the Department of Collections at the East Tennessee State University Museum of Natural History located at the Gray Fossil Site. Included in the database are forms to enter new site localities, view information about those already entered, view and add data to a master faunal list for the state, view sites repository information and store and add documents that are key-word searchable from the main menu. This database was compiled to give researchers a straightforward and easy to use means of analyzing known information about paleontological sites across the state, with the potential to be expanded worldwide. Conservation of data is crucial and can be lost over time unless data preservation efforts are made. fossil record geodatabase faunal list paleontological sites data preservation Data Science Other Earth Sciences Paleobiology Paleontology
127	<b>Using ICU Admission as a Predictor for Maternal Mortality: Identifying Essential Features for Accurate Classification</b> Dairian Haulani Ly Balai (18415224) 20 April 2024 (has links) <p dir="ltr">Maternal mortality (MM) is a pressing global health issue that results in thousands of mothers dying annually from pregnancy-related complications. Despite spending trillions of dollars on the healthcare industry, the U.S. continues to experience one of the highest rates of maternal death (MD) compared to other developed countries. This ongoing public health crisis highlights the urgent need for innovative strategies to detect and mitigate adverse maternal outcomes. This study introduces a novel approach, utilizing admission to the ICU as a proxy for MM. By analyzing 14 years of natality birth data, this study aims to explore the complex web of factors that elevate the chances of MD. The primary goal of this study is to identify features that are most influential in predicting ICU admission cases. These factors hold the potential to be applied to MM, as they can serve as early warning signs that complications may arise, allowing healthcare professionals to step in and intervene before adverse maternal outcomes occur. Two supervised machine learning models were employed in this study, specifically Logistic Regression (LR) and eXtreme Gradient Boosting (XGBoost). The models were executed twice for each dataset: once incorporating all available features and again utilizing only the most significant features. Following model training, XGBoost’s feature selection technique was employed to identify the top 10 influential features that are most important to the classification process. Our analysis revealed a diverse range of factors that are important for the prediction of ICU admission cases. In this study, we identified maternal transfusion, labor and delivery characteristics, delivery methods, gestational age, maternal attributes, and newborn conditions as the most influential factors to categorize maternal ICU admission cases. In terms of model performance, the XGBoost consistently outperformed LR across various datasets, demonstrating higher accuracy, precision, and F1 scores. For recall, however, LR maintained higher scores, surpassing those of XGBoost. Moreover, the models consistently achieved higher scores when trained with all available features compared to those trained solely with the top features. Although the models demonstrated satisfactory performance in some evaluation metrics, there were notable deficiencies in recall and precision, which suggests further model refinement is needed to effectively predict these cases.</p> Data engineering and data science Machine Learning Feature Selection Classification Maternal Mortality
128	Enhancing NFL Game Insights: Leveraging XGBoost For Advanced Football Data Analytics To Quantify Multifaceted Aspects Of Gameplay Schoborg, Christopher P 01 January 2024 (has links) (PDF) XGBoost, renowned for its efficacy in various statistical domains, offers enhanced precision and efficiency. Its versatility extends to both regression and categorization tasks, rendering it a valuable asset in predictive modeling. In this dissertation, I aim to harness the power of XGBoost to forecast and rank performances within the National Football League (NFL). Specifically, my research focuses on predicting the next play in NFL games based on pre-snap data, optimizing the draft ranking process by integrating data from the NFL combine, and collegiate statistics, creating a player rating system that can be compared across all positions, and evaluating strategic decisions for NFL teams when crossing the 50-yard line, including the feasibility of attempting a first down conversion versus opting for a field goal attempt. NFL Analytics XGBoost Prediction Fourth Down Categorical Data Analysis Data Science
129	Semantic Structuring Of Digital Documents: Knowledge Graph Generation And Evaluation Luu, Erik E 01 June 2024 (has links) (PDF) In the era of total digitization of documents, navigating vast and heterogeneous data landscapes presents significant challenges for effective information retrieval, both for humans and digital agents. Traditional methods of knowledge organization often struggle to keep pace with evolving user demands, resulting in suboptimal outcomes such as information overload and disorganized data. This thesis presents a case study on a pipeline that leverages principles from cognitive science, graph theory, and semantic computing to generate semantically organized knowledge graphs. By evaluating a combination of different models, methodologies, and algorithms, the pipeline aims to enhance the organization and retrieval of digital documents. The proposed approach focuses on representing documents as vector embeddings, clustering similar documents, and constructing a connected and scalable knowledge graph. This graph not only captures semantic relationships between documents but also ensures efficient traversal and exploration. The practical application of the system is demonstrated in the context of digital libraries and academic research, showcasing its potential to improve information management and discovery. The effectiveness of the pipeline is validated through extensive experiments using contemporary open-source tools. Knowledge Management Semantic Embeddings Knowledge Graphs Documents Artificial Intelligence and Robotics Databases and Information Systems Data Science
130	Contrastive Filtering And Dual-Objective Supervised Learning For Novel Class Discovery In Document-Level Relation Extraction Hansen, Nicholas 01 June 2024 (has links) (PDF) Relation extraction (RE) is a task within natural language processing focused on the classification of relationships between entities in a given text. Primary applications of RE can be seen in various contexts such as knowledge graph construction and question answering systems. Traditional approaches to RE tend towards the prediction of relationships between exactly two entity mentions in small text snippets. However, with the introduction of datasets such as DocRED, research in this niche has progressed into examining RE at the document-level. Document-level relation extraction (DocRE) disrupts conventional approaches as it inherently introduces the possibility of multiple mentions of each unique entity throughout the document along with a significantly higher probability of multiple relationships between entity pairs. There have been many effective approaches to document-level RE in recent years utilizing various architectures, such as transformers and graph neural networks. However, all of these approaches focus on the classification of a fixed number of known relationships. As a result of the large quantity of possible unique relationships in a given corpus, it is unlikely that all interesting and valuable relationship types are labeled before hand. Furthermore, traditional naive approaches to clustering on unlabeled data to discover novel classes are not effective as a result of the unique problem of large true negative presence. Therefore, in this work we propose a multi-step filter and train approach leveraging the notion of contrastive representation learning to discover novel relationships at the document level. Additionally, we propose the use of an alternative pretrained encoder in an existing DocRE solution architecture to improve F1 performance in base multi-label classification on the DocRED dataset by 0.46. To the best of our knowledge, this is the first exploration of novel class discovery applied to the document-level RE task. Based upon our holdout evaluation method, we increase novel class instance representation in the clustering solution by 5.5 times compared to the naive approach and increase the purity of novel class clusters by nearly 4 times. We then further enable the retrieval of both novel and known classes at test time provided human labeling of cluster propositions achieving a macro F1 score of 0.292 for novel classes. Finally, we note only a slight macro F1 decrease on previously known classes from 0.402 with fully supervised training to 0.391 with our novel class discovery training approach. Document-Level Relation Extraction Contrastive Learning Novel Class Discovery Data Science

Search results