Global ETD Search

81	Data analytics and optimization methods in biomedical systems: from microbes to humans Wang, Taiyao 19 May 2020 (has links) Data analytics and optimization theory are well-developed techniques to describe, predict and optimize real-world systems, and they have been widely used in engineering and science. This dissertation focuses on applications in biomedical systems, ranging from the scale of microbial communities to problems relating to human disease and health care. Starting from the microbial level, the first problem considered is to design metabolic division of labor in microbial communities. Given a number of microbial species living in a community, the starting point of the analysis is a list of all metabolic reactions present in the community, expressed in terms of the metabolite proportions involved in each reaction. Leveraging tools from Flux Balance Analysis (FBA), the problem is formulated as a Mixed Integer Program (MIP) and new methods are developed to solve large scale instances. The strategies found reveal a large space of nuanced and non-intuitive metabolic division of labor opportunities, including, for example, splitting the Tricarboxylic Acid Cycle (TCA) cycle into two separate halves. More broadly, the landscape of possible 1-, 2-, and 3-strain solutions is systematically mapped at increasingly tight constraints on the number of allowed reactions. The second problem addressed involves the prediction and prevention of short-term (30-day) hospital re-admissions. To develop predictive models, a variety of classification algorithms are adapted and coupled with robust (regularized) learning and heuristic feature selection approaches. Using real, large datasets, these methods are shown to reliably predict re-admissions of patients undergoing general surgery, within 30-days of discharge. Beyond predictions, a novel prescriptive method is developed that computes specific control actions with the effect of altering the outcome. This method, termed Prescriptive Support Vector Machines (PSVM), is based on an underlying SVM classifier. Applied to the hospital re-admission data, it is shown to reduce 30-day re-admissions after surgery through better control of the patient’s pre-operative condition. Specifically, using the new method the patient’s pre-operative hematocrit is regulated through limited blood transfusion. In the last problem in this dissertation, a framework for parameter estimation in Regularized Mixed Linear Regression (MLR) problems is developed. In the specific MLR setting considered, training data are generated from a mixture of distinct linear models (or clusters) and the task is to identify the corresponding coefficient vectors. The problem is formulated as a Mixed Integer Program (MIP) subject to regularization constraints on the coefficient vectors. A number of results on the convergence of parameter estimates for MLR are established. In addition, experimental prediction results are presented comparing the prediction algorithm with mean absolute error regression and random forest regression, in terms of both accuracy and interpretability. Statistics Data science Flux balance analysis Machine learning Mixed linear regression Optimization Predictive and prescriptive models
82	Predicting Delays In Delivery Process Using Machine Learning-Based Approach Shehryar Shahid (9745388) 16 December 2020 (has links) <div>There has been a great interest in applying Data Science, Machine Learning, and AI-related technologies in recent years. Industries are adopting these technologies very rapidly, which has enabled them to gather valuable data about their businesses. One such industry that can leverage this data to improve their business's output and quality is the logistics and transport industry. This phenomenon provides an excellent opportunity for companies who rely heavily on air transportation to leverage this data to gain valuable insights and improve their business operations. This thesis is aimed to leverage this data to develop techniques to model complex business processes and design a machine learning-based predictive analytical approach to predict process violations.</div><div>This thesis focused on solving delays in shipment delivery by modeling a prediction technique to predict these delays. The approach presented here was based on real airfreight shipping data, which follows the International Air and Transport Association industry standard for airfreight transportation, to identify shipments at risk of being delayed. By leveraging the shipment process structure, this research presented a new approach that solved the complex event-driven structure of airfreight data that made it difficult to model for predictive analytics.</div><div>By applying different data mining and machine learning techniques, prediction techniques were developed to predict delays in delivering airfreight shipments. The prediction techniques were based on random forest and gradient boosting algorithms. To compare and select the best model, the prediction results were interpreted in the form of six confusion matrix-based performance metrics. The results showed that all the predictors had a high specificity of over 90%, but the sensitivity was low, under 44%. Accuracy was observed to be over 75%, and a geometric mean was between 58% – 64%.</div><div>The performance metrics results provided evidence that our approach could be implemented to develop a prediction technique to model complex business processes. Additionally, an early prediction method was designed to test predictors' performance if complete process information was not available. This proposed method delivered compelling evidence suggesting that early prediction can be achieved without compromising the predictor’s performance.</div> Applied Computer Science Predictive Modeling Machine Learning Decision Making Logistics information system Data Science
83	Aplicación de Data Science Specialist Ccora Camarena, Yuli, Jeri De La Cruz, Nélida, Enriquez Yance, Rosario Grace 14 January 2020 (has links) El trabajo de investigación que se presenta a continuación constituye el análisis de la problemática planteada sobre la empresa Travico Perú S.A.C, la cual ha reportado un descenso en sus ventas de sus diferentes servicios que ofrece. Para este desarrollo de este trabajo se ha aplicado la metodología de la ciencia de datos, con la cual se logró identificar las variables que influyeron en las ventas de todos los servicios durante los años 2016 al 2018, el conjunto de datos se obtuvo a través de plataformas con las que la empresa trabaja y los reportes de control interno, con ello, se identificaron 12 variables con 6429 datos. Así mismo, se empleó la técnica de aprendizaje automático no supervisado, basado en particiones: K means, las cual permitió segmentar y agrupar las variables que fueron seleccionadas. Finalmente, para el análisis, se presentaron distintas gráficas con los resultados de las ventas de la empresa y se hicieron comparaciones con los resultados de los clústeres. / The research work presented below constitutes the analysis of the problem raised about the company Travico Perú S.A.C, which has reported a decrease in its sales of its different services offered. For this development of this work, the methodology of data science has been applied, with which it has been identified to identify the variables that influenced the sales of all services during the years 2016 to 2018, the data set was achieved through of platforms with which the company works and internal control reports, thereby identifying 12 variables with 6429 data. Furthermore, we use a technique machine learning without supervised, based on partitions: K means the qualified segment and group the variables that were selected. Finally, for the analysis, different graphs are shown with the results of the company's sales and comparisons were made with the results of the clusters. / Trabajo de investigación Ciencia de datos Aprendizaje automático K-means Análisis de datos Data science Machine learning
84	Improving Recommendation Systems Using Image Data Åslin, Filip January 2022 (has links) Recommendation systems typically use historical interactions between users and items topredict what other items can be of interest to a user. The recommendations are based onpatterns in how users interact similarly with items. This thesis investigates if it is possible toimprove the quality of the recommendations by including more information about the items inthe model that predicts the recommendations. More specifically, the use of deep learning toextract information from item images is investigated. To do this, two types of collaborativefiltering models, based on historic interactions, are implemented. These models are thencompared to different collaborative filtering models that either make use of user and itemattributes, or images of the items. Three pre-trained image classification models are used toextract useful item features from the item images. The models are trained and evaluated using adataset of historic transactions and item images from the online sports shop Stadium, given bythe thesis supervisor. The results show no noticeable improvement in performance for themodels using the images compared to the models without images. The model using the userand item attributes performs the best, indicating that the collaborative filtering models can beimproved by giving it more information than just the historic interactions. Possible ways tofurther investigate using the image feature vectors in collaborative filtering models, as well asusing them to create better item attributes, are discussed and suggested for future work. Recommendation Systems Image Analysis Machine Learning Data Science Computer Sciences Datavetenskap (datalogi)
85	Topological Hierarchies and Decomposition: From Clustering to Persistence Brown, Kyle A. 27 May 2022 (has links) No description available. Computer Science topological data analysis hierarchical clustering exploratory data analysis topology clustering data science
86	GENERATIVE, PREDICTIVE, AND REACTIVE MODELS FOR DATA SCARCE PROBLEMS IN CHEMICAL ENGINEERING Nicolae Christophe Iovanac (11167785) 22 July 2021 (has links) <div>Data scarcity is intrinsic to many problems in chemical engineering due to physical constraints or cost. This challenge is acute in chemical and materials design applications, where a lack of data is the norm when trying to develop something new for an emerging application. Addressing novel chemical design under these scarcity constraints takes one of two routes: the traditional forward approach, where properties are predicted based on chemical structure, and the recent inverse approach, where structures are predicted based on required properties. Statistical methods such as machine learning (ML) could greatly accelerate chemical design under both frameworks; however, in contrast to the modeling of continuous data types, molecular prediction has many unique obstacles (e.g., spatial and causal relationships, featurization difficulties) that require further ML methods development. Despite these challenges, this work demonstrates how transfer learning and active learning strategies can be used to create successful chemical ML models in data scarce situations.<br></div><div>Transfer learning is a domain of machine learning under which information learned in solving one task is transferred to help in another, more difficult task. Consider the case of a forward design problem involving the search for a molecule with a particular property target with limited existing data, a situation not typically amenable to ML. In these situations, there are often correlated properties that are computationally accessible. As all chemical properties are fundamentally tied to the underlying chemical topology, and because related properties arise due to related moieties, the information contained in the correlated property can be leveraged during model training to help improve the prediction of the data scarce property. Transfer learning is thus a favorable strategy for facilitating high throughput characterization of low-data design spaces.</div><div>Generative chemical models invert the structure-function paradigm, and instead directly suggest new chemical structures that should display the desired application properties. This inversion process is fraught with difficulties but can be improved by training these models with strategically selected chemical information. Structural information contained within this chemical property data is thus transferred to support the generation of new, feasible compounds. Moreover, transfer learning approach helps ensure that the proposed structures exhibit the specified property targets. Recent extensions also utilize thermodynamic reaction data to help promote the synthesizability of suggested compounds. These transfer learning strategies are well-suited for explorative scenarios where the property values being sought are well outside the range of available training data.</div><div>There are situations where property data is so limited that obtaining additional training data is unavoidable. By improving both the predictive and generative qualities of chemical ML models, a fully closed-loop computational search can be conducted using active learning. New molecules in underrepresented property spaces may be iteratively generated by the network, characterized by the network, and used for retraining the network. This allows the model to gradually learn the unknown chemistries required to explore the target regions of chemical space by actively suggesting the new training data it needs. By utilizing active learning, the create-test-refine pathway can be addressed purely in silico. This approach is particularly suitable for multi-target chemical design, where the high dimensionality of the desired property targets exacerbates data scarcity concerns.</div><div>The techniques presented herein can be used to improve both predictive and generative performance of chemical ML models. Transfer learning is demonstrated as a powerful technique for improving the predictive performance of chemical models in situations where a correlated property can be leveraged alongside scarce experimental or computational properties. Inverse design may also be facilitated through the use of transfer learning, where property values can be connected with stable structural features to generate new compounds with targeted properties beyond those observed in the training data. Thus, when the necessary chemical structures are not known, generative networks can directly propose them based on function-structure relationships learned from domain data, and this domain data can even be generated and characterized by the model itself for closed-loop chemical searches in an active learning framework. With recent extensions, these models are compelling techniques for looking at chemical reactions and other data types beyond the individual molecule. Furthermore, the approaches are not limited by choice of model architecture or chemical representation and are expected to be helpful in a variety of data scarce chemical applications.</div> Machine Learning Inverse Design Data Science Chemical Engineering Computational Chemistry
87	Characterizing the learning, sociology, and identity effects of participating in The Data Mine Aparajita Jaiswal (12418072) 14 April 2022 (has links) <p>The discipline of data science has gained substantial attention recently. This is mainly attributed to the technological advancement that led to an exponential increase in computing power and has made the generation and recording of enormous amounts of data possible on an everyday basis. It has become crucial for industries to wrangle, curate, and analyze data using data science techniques to make informed decisions. Making informed decisions is complex. Therefore, a trained data science workforce is required to analyze data on a real-time basis. The increasing demand for data science professionals has caused higher education institutions to develop courses and train students starting from the undergraduate level about the data science concepts and tools.</p> <p>Despite the efforts from the institutions and national agency such as National Academies of Sciences, Engineering, and Medicine, it has been witnessed that there have been significant challenges in retaining and attracting students in the discipline of data science. The novice learners in data science are required to possess the skills of a programmer, a statistician, research skills, and non-technical skills such as communication and critical thinking. The undergraduate students do not possess all the required skills, which, in turn, creates a cognitive load for novice learners (Koby & Orit, 2020). Research suggests that improving the teaching and mentoring methodologies can improve retention for students from all demographic groups (Seymour, 2002). Previous studies (e.g., Hoffmann et al., 2002, Flynn, 2015; Lenning & Ebbers, 1999) have revealed that learning communities are effective in improving student retention, especially at the undergraduate level, as it helps students develop a sense of belonging, socialize, and form their own identities. Learning communities have been identified as <em>high impact practices</em> (Kuh, 2008) that helps to develop identities and sense of belonging, however to the best of our knowledge there are few studies that focus on the development of the psychosocial and cognitive skills of the students enrolled in a data science learning community.</p> <p>To meet the demand for the future workforce and help undergraduate students develop data science skills, The Data Mine (TDM) at Purdue University has undertaken an initiative in the discipline of data science. The Data Mine is an interdisciplinary living-learning community that allows students from various disciplines to enroll and learn data science skills under the guidance of competent faculty and corporate mentors. The residential nature of the learning community allows the undergraduate students to live, learn and socialize with peers of similar interests and develop a sense of belonging. The constant interaction with knowledgeable faculty and mentors in real-world projects allows novice learners to master data science skills and develop an identity. The study aims to characterize the effects of identity formation, socialization, and learning of the undergraduate students enrolled in The Data Mine and answer the following research question:</p> <p><br></p> <p><strong>Quantitative: RQ 1:</strong> What are the perceptions of students regarding their identity formation, socialization opportunities, self-belief, and academic/intellectual development in The Data Mine? </p> <p><strong>Qualitative: Guiding RQ 2:</strong> How do students’ participation in activities and interaction with peers, faculty, staff at The Data Mine contribute to becoming an experienced member of the learning community?</p> <ul> <li><strong>Sub-RQ 2(a):</strong> What are the perceived benefits and challenges of participating in The Data Mine?</li> <li><strong>Sub-RQ 2(b):</strong> How do students describe their levels of socialization and a sense of belonging within The Data Mine?</li> <li><strong>Sub-RQ 2(c):</strong> How do students’ participation and interaction in The Data Mine help them form their identity?</li> </ul> <p>To approach the above research questions, we conducted a sequential explanatory mixed method study to understand the growth journey of students in terms of socialization, sense of belonging and identity formation. The data were collected in two phases: a quantitative survey study followed by qualitative semi-structured interviews. The quantitative data was analyzed using descriptive and inferential statistics, and qualitative data were analyzed using thematic analysis, followed by narrative analysis. The results of the quantitative and qualitative analysis demonstrated that learning in The Data Mine happened through interaction and socialization of the students with faculty, staff, and peers at The Data Mine. Students found multiple opportunities to learn and develop data science skills, such as working on real-world projects or working in groups. This continuous interaction with peers, faculty and staff at The Data Mine helped them to learn and develop identities. This study revealed that students did develop a data science identity, but the corporate partner TAs developed a leader identity along with the data science identity. In summary all students grew and served as mentor, guide, and role models for new incoming students.</p> Education Data Science Education Identity Sense of Belonging Learning Communities Communities of Practice Legitimate peripheral participation
88	Intraday Algorithmic Trading using Momentum and Long Short-Term Memory network strategies Whitinger, Andrew R., II 01 May 2022 (has links) Intraday stock trading is an infamously difficult and risky strategy. Momentum and reversal strategies and long short-term memory (LSTM) neural networks have been shown to be effective for selecting stocks to buy and sell over time periods of multiple days. To explore whether these strategies can be effective for intraday trading, their implementations were simulated using intraday price data for stocks in the S&P 500 index, collected at 1-second intervals between February 11, 2021 and March 9, 2021 inclusive. The study tested 160 variations of momentum and reversal strategies for profitability in long, short, and market-neutral portfolios, totaling 480 portfolios. Long and short portfolios for each strategy were also compared to the market to observe excess returns. Eight reversal portfolios yielded statistically significant profits, and 16 yielded significant excess returns. Tests of these strategies on another set of 16 days failed to yield statistically significant returns, though average returns remained profitable. Four LSTM network configurations were tested on the same original set of days, with no strategy yielding statistically significant returns. Close examination of the stocks chosen by LSTM networks suggests that the networks expect stocks to exhibit a momentum effect. Further studies may explore whether an intraday reversal effect can be observed over time during different market conditions and whether different configurations of LSTM networks can generate significant returns. machine learning prediction tensorflow keras finance robinhood Applied Statistics Computer Sciences Data Science Finance and Financial Management
89	Detection of 3D Genome Folding at Multiple Scales Akgol-Oksuz, Betul 13 April 2022 (has links) Understanding 3D genome structure is crucial to learn how chromatin folds and how genes are regulated through the spatial organization of regulatory elements. Various technologies have been developed to investigate genome architecture. These technologies include ligation-based 3C Methodologies such as Hi-C and Micro-C, ligation-based pull-down methods like Proximity Ligation-Assisted ChIP-seq (PLAC Seq) and Paired-end tag sequencing (ChIA PET), and ligation-free methods like Split-Pool Recognition of Interactions by Tag Extension (SPRITE) and Genome Architecture Mapping (GAM). Although these technologies have provided great insight into chromatin organization, a systematic evaluation of these technologies is lacking. Among these technologies, Hi-C has been one of the most widely used methods to map genome-wide chromatin interactions for over a decade. To understand how the choice of experimental parameters determines the ability to detect and quantify the features of chromosome folding, we have first systematically evaluated two critical parameters in the Hi-C protocol: cross-linking and digestion of chromatin. We found that different protocols capture distinct 3D genome features with different efficiencies depending on the cell type (Chapter 2). Use of the updated Hi-C protocol with new parameters, which we call Hi-C 3.0, was subsequently evaluated and found to provide the best loop detection compared to all previous Hi-C protocols as well as better compartment quantification compared to Micro-C (Chapter 3). Finally, to understand how the aforementioned technologies (Hi-C, Micro-C, PLAC-Seq, ChIA-PET, SPRITE, GAM) that measure 3D organization could provide a comprehensive understanding of the genome structure, we have performed a comparison of these technologies. We found that each of these methods captures different aspects of the chromatin folding (Chapter 4). Collectively, these studies suggest that improving the 3D methodologies and integrative analyses of these methods will reveal unprecedented details of the genome structure and function. Hi-C 3.0 Method Comparison 3D Methods 3C-Based Methods Data Science
90	Model-based assessments of freshwater ecosystems and species under climate change Kärcher, Oskar 14 October 2019 (has links) Climate change, global warming and anthropogenic disturbances are threatening freshwater ecosystems globally. The protection and preservation of freshwater environments, its biodiversity and all of its services for human well-being requires comprehensive knowledge of the impacts that climate change and anthropogenic disturbances have on freshwaters and freshwater species. In-depth knowledge needed for conservation strategies can be established through versatile assessments. Quantitative assessments and the investigation of prevailing environmental relationships within ecosystems constitute the basis for sustaining freshwater systems. However, it is a great challenge to quantify the multifaceted effects of climate change and to broaden the understanding of complex environmental relationships. This thesis aims at contributing to an extension of the understanding of climate change impacts on freshwater ecosystems and environmental relationships, which implies the provision of useful guidelines for the protection and preservation of freshwaters. For this, various statistical approaches based on comprehensive data sets are applied at different scales, ranging from local to global assessments. In particular, five research studies investigating the (1) water quality-nutrient and temperature relationships in European lakes, (2) drivers of freshwater fish species distributions across varying scales in the Danube River delta, (3) globally derived thermal response curves and thermal properties of native European freshwater species, (4) differences between thermal properties derived from native and global range data, and (5) thermal performances of freshwater fish species for different life stages and different global future dispersal scenarios are presented to address the effects of environmental change. Main results of this thesis comprise various aspects of conservation implications and planning. (i) The first study outlines drivers influencing water quality through studying multi-dimensional relationships and compares different modelling techniques in order to outline models that are suitable for the identification of complex driver interactions. (ii) The second study addresses scale effects on the performance of species distribution models, which are commonly used for assessments of climate change impacts, and identifies key predictors driving distributions for the varying scales and studied species. (iii) The third study parameterizes thermal responses of species from different taxonomic groups and assesses the potential resilience in terms of warming tolerance and additional thermal properties as well as the influence of future rising temperatures on current distributions. (iv) The fourth study quantifies the differences in thermal response curves and thermal properties for freshwater fishes derived from global and continental data in order to clarify the need for using global range data in studies making suggestions for conservation planning. (v) The last study estimates the impact of changing climatic conditions on species distribution ranges of two fish species for different time periods by including biotic information about thermal performances for various life stages. Overall, this thesis contributes to the broad field of studying consequences and impacts of climate change on freshwater ecosystems. By applying statistical methods tailored to the underlying investigations, useful implications for conservation planning are derived. freshwater ecosystems statistical modelling data science species distribution modelling water quality modelling climate change ddc:500

Search results