Global ETD Search

231	Particulate Matter Matters Meyer, Holger J., Gruner, Hannes, Waizenegger, Tim, Woltmann, Lucas, Hartmann, Claudio, Lehner, Wolfgang, Esmailoghli, Mahdi, Redyuk, Sergey, Martinez, Ricardo, Abedjan, Ziawasch, Ziehn, Ariane, Rabl, Tilmann, Markl, Volker, Schmitz, Christian, Serai, Dhiren Devinder, Gava, Tatiane Escobar 15 June 2023 (has links) For the second time, the Data Science Challenge took place as part of the 18th symposium “Database Systems for Business, Technology and Web” (BTW) of the Gesellschaft für Informatik (GI). The Challenge was organized by the University of Rostock and sponsored by IBM and SAP. This year, the integration, analysis and visualization around the topic of particulate matter pollution was the focus of the challenge. After a preselection round, the accepted participants had one month to adapt their developed approach to a substantiated problem, the real challenge. The final presentation took place at BTW 2019 in front of the prize jury and the attending audience. In this article, we give a brief overview of the schedule and the organization of the Data Science Challenge. In addition, the problem to be solved and its solution will be presented by the participants. info:eu-repo/classification/ddc/004 ddc:004
232	New Spatio-temporal Hawkes Process Models For Social Good Wen-Hao Chiang (12476658) 28 April 2022 (has links) <p>As more and more datasets with self-exciting properties become available, the demand for robust models that capture contagion across events is also getting stronger. Hawkes processes stand out given their ability to capture a wide range of contagion and self-excitation patterns, including the transmission of infectious disease, earthquake aftershock distributions, near-repeat crime patterns, and overdose clusters. The Hawkes process is flexible in modeling these various applications through parametric and non-parametric kernels that model event dependencies in space, time and on networks.</p> <p>In this thesis, we develop new frameworks that integrate Hawkes Process models with multi-armed bandit algorithms, high dimensional marks, and high-dimensional auxiliary data to solve problems in search and rescue, forecasting infectious disease, and early detection of overdose spikes.</p> <p>In Chapter 3, we develop a method applications to the crisis of increasing overdose mortality over the last decade. We first encode the molecular substructures found in a drug overdose toxicology report. We then cluster these overdose encodings into different overdose categories and model these categories with spatio-temporal multivariate Hawkes processes. Our results demonstrate that the proposed methodology can improve estimation of the magnitude of an overdose spike based on the substances found in an initial overdose. </p> <p>In Chapter 4, we build a framework for multi-armed bandit problems arising in event detection where the underlying process is self-exciting. We derive the expected number of events for Hawkes processes given a parametric model for the intensity and then analyze the regret bound of a Hawkes process UCB-normal algorithm. By introducing the Hawkes Processes modeling into the upper confidence bound construction, our models can detect more events of interest under the multi-armed bandit problem setting. We apply the Hawkes bandit model to spatio-temporal data on crime events and earthquake aftershocks. We show that the model can quickly learn to detect hotspot regions, when events are unobserved, while striking a balance between exploitation and exploration. </p> <p>In Chapter 5, we present a new spatio-temporal framework for integrating Hawkes processes with multi-armed bandit algorithms. Compared to the methods proposed in Chapter 4, the upper confidence bound is constructed through Bayesian estimation of a spatial Hawkes process to balance the trade-off between exploiting and exploring geographic regions. The model is validated through simulated datasets and real-world datasets such as flooding events and improvised explosive devices (IEDs) attack records. The experimental results show that our model outperforms baseline spatial MAB algorithms through rewards and ranking metrics.</p> <p>In Chapter 6, we demonstrate that the Hawkes process is a powerful tool to model the infectious disease transmission. We develop models using Hawkes processes with spatial-temporal covariates to forecast COVID-19 transmission at the county level. In the proposed framework, we show how to estimate the dynamic reproduction number of the virus within an EM algorithm through a regression on Google mobility indices. We also include demographic covariates as spatial information to enhance the accuracy. Such an approach is tested on both short-term and long-term forecasting tasks. The results show that the Hawkes process outperforms several benchmark models published in a public forecast repository. The model also provides insights on important covariates and mobility that impact COVID-19 transmission in the U.S.</p> <p>Finally, in chapter 7, we discuss implications of the research and future research directions.</p> Spatio-temporal Hawkes Process Multi-armed bandits Data science machine learning Social Good Theoretical Computer Science
233	Predicting Carbon Dioxide Levels and Occupancy with Machine Learning and Environmental Data Datunaishvili, Giorgi, Khederchah, Christian, Li, Henrik, Kevin, Salazar January 2022 (has links) Buildings consume the majority of the world’s energy usage through heating, ventilation and cooling. These elements are not regulated in an efficient and effective manner. Lights and heating are often left in action in empty spaces leading to waste. This project’s goal and purpose is to mitigate this wastefulness by implementing self powered environment sensors that can predict carbon dioxide levels and occupancy. These values can then be used to regulate spaces accordingly. The approach chosen to find a solution to this problem was to use machine learning. Machine learning was used to generate a prediction model. Different methods and models were used such as Gaussian Process Regression and Tree algorithm. The most effective model for this particular case turned out to be Gaussian Process Regression. The model was built by using accumulate, a model was made to calculate carbon dioxide values through humidity, temperature and pressure where an accuracy above 90% was achieved. The model to calculate occupancy levels had significantly lower accuracy. The reason that the carbon dioxide model was a success and the occupancy model was not, is due to the small size of the data set used while training the model. Carbon dioxide values had a bigger variance between data points, while the occupancy dataset contained mostly ones and zeros. This concludes to a longer training period to achieve high accuracy and precision for the occupancy model. The model for carbon dioxide converges with fewer data points as the result of the data having higher variance. Machine Learning AI matlab prediction environment data data science neural network sensor electrical. Elektroteknik och elektronik
234	IDE-based learning analytics for assessing introductory programming skill Beck, Phyllis J. 08 August 2023 (has links) (PDF) Providing a sufficient level of personalized feedback on students' current level of strategic knowledge within the context of the natural programming environment through IDE-based learning analytics would transform learning outcomes for introductory programming students. However, providing sufficient insight into the programming process was previously inaccessible due to the need for more complex and scalable data collection methods and metrics with a wider variety for understanding programming metacognition and the full programming process. This research developed a custom-built web-based IDE and event compression system to investigate two of the five components of a five-dimensional model of cognition for programming skill estimation (1) Design Cohesion and (2) Development Path over Time. The IDE captured students' programming process data for 25 participants, where each participated in two programming sessions that required both a design and code phase. For Design Cohesion, the alignment between flowchart design and source code implementation was investigated and manually classified. The classification process produced three Design Cohesion metrics: Design Cohesion Level, Granularity Level, and Granularity Score. The relationship between programming skill and Design Cohesion was explored using the newly developed metrics and a case-study approach. For the Development Path over Time, the compressed programming events were used to create a Timeline of Events for each participant, which was manually examined for distinct clusters of programming patterns and behavior such as execution behavior and debugging patterns. Custom visualizations were developed to display the timelines. Then, the timelines were used to compare programming behaviors for participants with different programming skill levels. The results of the investigation into Design Cohesion and Development Path Over Time contribute to the fundamental understanding of differences between beginner, intermediate, and advanced programmers and the context in which specific programming difficulties arise. This work produced insight into students' programming processes that can be used to advance the model of cognition for programming skill estimation and provide personalized feedback to support the development of programming skills and expertise. Additionally, this research produced tools and metrics that can be used in future studies examining programming metacognition. learning analytics IDE-based learn analytics introductory programming programming skill programming skill estimation programming metacognition Artificial Intelligence and Robotics Data Science Engineering Education Software Engineering
235	GENERATIVE MODELS WITH MARGINAL CONSTRAINTS Bingjing Tang (16380291) 16 June 2023 (has links) <p> Generative models form powerful tools for learning data distributions and simulating new samples. Recent years have seen significant advances in the flexibility and applicability of such models, with Bayesian approaches like nonparametric Bayesian models and deep neural network models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) finding use in a wide range of domains. However, the black-box nature of these models means that they are often hard to interpret, and they often come with modeling implications that are inconsistent with side knowledge resulting from domain knowledge. This thesis studies situations where the modeler has side knowledge represented as probability distributions on functionals of the objects being modeled, and we study methods to incorporate this particular kind of side knowledge into flexible generative models. This dissertation covers three main parts. </p> <p><br></p> <p>The first part focuses on incorporating a special case of the aforementioned side knowledge into flexible nonparametric Bayesian models. Many times, practitioners have additional distributional information about a subset of the coordinates of the observations being modeled. The flexibility of nonparametric Bayesian models usually implies incompatibility with this side information. Such inconsistency triggers the necessity of developing methods to incorporate this side knowledge into flexible nonparametric Bayesian models. We design a specialized generative process to build in this side knowledge and propose a novel sigmoid Gaussian process conditional model. We also develop a corresponding posterior sampling method based on data augmentation to overcome a doubly intractable problem. We illustrate the efficacy of our proposed constrained nonparametric Bayesian model in a variety of real-world scenarios including modeling environmental and earthquake data. </p> <p><br></p> <p>The second part of the dissertation discusses neural network approaches to satisfying the said general side knowledge. Further, the generative models considered in this part broaden into black-box models. We formulate this side knowledge incorporation problem as a constrained divergence minimization problem and propose two scalable neural network approaches as its solution. We demonstrate their practicality using various synthetic and real examples. </p> <p><br></p> <p> The third part of the dissertation concentrates on a specific generative model of individual pixels of the fMRI data constructed from a latent group image. Usually there is two-fold side knowledge about the latent group image: spatial structure and partial activation zones. The former can be captured by modeling the prior for the group image with Markov random fields. The latter, which is often obtained from previous related studies, is left for future research. We propose a novel Bayesian model with Markov random fields and aim to estimate the maximum a posteriori for the group image. We also derive a variational Bayes algorithm to overcome local optima in the optimization.</p> Computational statistics Statistical data science Knowledge Constraints Nonparametric Bayesian Black-box Neural Networks Conditional Density Estimation Density Ratio Estimation Sigmoid Gaussian Processes
236	Aplicación de ciencia de datos para incrementar la efectividad del número de operaciones de la base de clientes tácticos de Mibanco - Agencia Zárate Bravo España, Ana María, Chacón Chávez, Verónica Magaly, Flores Chumpitaz, María Isabel, Mamani Gutiérrez, Miguel Hilarión, Toranzo Pellanne, María Pía 15 July 2021 (has links) El presente trabajo de investigación busca analizar nuevas estrategias para incrementar el nivel de efectividad del número de operaciones de la base de tácticos de los clientes de la agencia Zárate de Mibanco ubicada en el distrito de San Juan de Lurigancho, basado en la segmentación comercial del cliente. La metodología de investigación de ciencia de datos consta de 10 etapas, desde la comprensión de datos hasta la retroalimentación, se aplicará un modelo analítico de carácter predictivo, se analiza la información histórica de Mibanco y con ello se identifica la problemática de la baja efectividad de la base de tácticos de la agencia Zárate. Seguidamente se exponen las posibles soluciones basadas en el modelo de ciencia de datos y la hipótesis. También, se realiza el análisis EDA para la comprensión y preparación de los datos a través de visualizaciones y se describen las herramientas que se utilizarán para el proyecto. Se establece una arquitectura de los datos en base a la funcionalidad y estructura actual de Mibanco. Asimismo, se emplea la técnica de ciencia de datos de Aprendizaje Supervisado, modelo de Clasificación basado en el algoritmo de Árbol de Decisiones. Adicionalmente, se muestran los resultados del modelo de ciencia de datos, basado en encontrar la fórmula del éxito para encontrar los perfiles idóneos de los clientes pre-aprobados de la base de. Finalmente, se establecieron las estrategias para la implementación del modelo de ciencia de datos en la empresa Mibanco. / The present research work seeks to analyze new strategies to increase the level of effectiveness of the number of operations of the tactical base of the clients of the Zarate agency of Mibanco located in the district of San Juan de Lurigancho, based on the commercial segmentation of the client. The data science research methodology consists of 10 stages, from data comprehension to feedback, an analytical model of predictive character will be applied, the historical information of Mibanco is analyzed and with it the problem of the low effectiveness of the tactical base of the Zarate agency is identified. Next, the possible solutions based on the data science model and the hypothesis are presented. Also, the EDA analysis is performed for the understanding and preparation of the data through visualizations and the tools that will be used for the project are described. A data architecture is established based on the current functionality and structure of Mibanco. Likewise, the Supervised Learning data science technique, a Classification model based on the Decision Tree algorithm, is used. Additionally, the results of the data science model are shown, based on finding the success formula to find the ideal profiles of the pre-approved customers of the customer base. Finally, the strategies for the implementation of the data science model in Mibanco were established. / Trabajo de investigación Base de datos Ciencia de datos Modelamiento de datos Database Data science Data modeling
237	[pt] ENSAIOS EM PREDIÇÃO DO TEMPO DE PERMANÊNCIA EM UNIDADES DE TERAPIA INTENSIVA / [en] ESSAYS ON LENGTH OF STAY PREDICTION IN INTENSIVE CARE UNITS IGOR TONA PERES 28 June 2021 (has links) [pt] O tempo de permanência (LoS) é uma das métricas mais utilizadas para avaliar o uso de recursos em Unidades de Terapia Intensiva (UTI). Esta tese propõe uma metodologia estruturada baseada em dados para abordar três principais demandas de gestores de UTI. Primeiramente, será proposto um modelo de predição individual do LoS em UTI, que pode ser utilizado para o planejamento dos recursos necessários. Em segundo lugar, tem-se como objetivo desenvolver um modelo para predizer o risco de permanência prolongada, o que auxilia na identificação deste tipo de paciente e assim uma ação mais rápida de intervenção no mesmo. Finalmente, será proposto uma medida de eficiência ajustada por case-mix capaz de realizar análises comparativas de benchmark entre UTIs. Os objetivos específicos são: (i) realizar uma revisão da literatura dos fatores que predizem o LoS em UTI; (ii) propor uma metodologia data-driven para predizer o LoS individual do paciente na UTI e o seu risco de longa permanência; e (iii) aplicar essa metodologia no contexto de um grande conjunto de UTIs de diferentes tipos de hospitais. Os resultados da revisão da literatura apresentaram os principais fatores de risco que devem ser considerados em modelos de predição. Em relação ao modelo preditivo, a metodologia proposta foi aplicada e validada em um conjunto de dados de 109 UTIs de 38 diferentes hospitais brasileiros. Este conjunto continha um total de 99.492 internações de 01 de janeiro a 31 de dezembro de 2019. Os modelos preditivos construídos usando a metodologia proposta apresentaram resultados precisos comparados com a literatura. Estes modelos propostos têm o potencial de melhorar o planejamento de recursos e identificar precocemente pacientes com permanência prolongada para direcionar ações de melhoria. Além disso, foi utilizado o modelo de predição proposto para construir uma medida não tendenciosa para benchmarking de UTIs, que também foi validada no conjunto de dados estudado. Portanto, esta tese propôs um guia estruturado baseado em dados para gerar predições para o tempo de permanência em UTI ajustadas ao contexto em que se deseja avaliar. / [en] The length of stay (LoS) in Intensive Care Units (ICU) is one of the most used metrics for resource use. This thesis proposes a structured datadriven methodology to approach three main demands of ICU managers. First, we propose a model to predict the individual ICU length of stay, which can be used to plan the number of beds and staff required. Second, we develop a model to predict the risk of prolonged stay, which helps identifying prolonged stay patients to drive quality improvement actions. Finally, we build a case-mix-adjusted efficiency measure (SLOSR) capable of performing non-biased benchmarking analyses between ICUs. To achieve these objectives, we divided the thesis into the following specific goals: (i) to perform a literature review and meta-analysis of factors that predict patient s LoS in ICUs; (ii) to propose a data-driven methodology to predict the numeric ICU LoS and the risk of prolonged stay; and (iii) to apply this methodology in the context of a big set of ICUs from mixed-type hospitals. The literature review results presented the main risk factors that should be considered in future prediction models. Regarding the predictive model, we applied and validated our proposed methodology to a dataset of 109 ICUs from 38 different Brazilian hospitals. The included dataset contained a total of 99,492 independent admissions from January 01 to December 31, 2019. The predictive models to numeric ICU LoS and to the risk of prolonged stay built using our data-driven methodology presented accurate results compared to the literature. The proposed models have the potential to improve the planning of resources and early identifying prolonged stay patients to drive quality improvement actions. Moreover, we used our prediction model to build a non-biased measure for ICU benchmarking, which was also validated in our dataset. Therefore, this thesis proposed a structured data-driven guide to generating predictions to ICU LoS adjusted to the specific environment analyzed. [pt] APRENDIZADO DE MAQUINA [pt] UNIDADES DE TERAPIA INTENSIVA [pt] TEMPO DE PERMANENCIA [pt] MODELOS PREDITIVOS [pt] CIENCIA DE DADOS [en] MACHINE LEARNING [en] INTENSIVE CARE UNITS [en] LENGTH OF STAY [en] PREDICTIVE MODELS [en] DATA SCIENCE
238	[pt] MINERANDO O PROCESSO DE UM COQUEAMENTO RETARDADO ATRAVÉS DE AGRUPAMENTO DE ESTADOS / [en] MINING THE PROCESS OF A DELAYED COKER USING CLUSTERED STATES RAFAEL AUGUSTO GASETA FRANCA 25 November 2021 (has links) [pt] Procedimentos e processos são essenciais para garantir a qualidade de qualquer operação. Porém, o processo realizado na prática nem sempre está de acordo com o processo idealizado. Além disso, uma análise mais refinada de gargalos e inconsistências só é possível a partir do registro de eventos do processo (log). Mineração de processos (process mining) é uma área que reúne um conjunto de métodos para reconstruir, monitorar e aprimorar um processo a partir de seu registro de eventos. Mas, ao aplicar as soluções já existentes no log de uma unidade de coqueamento retardado, os resultados foram insatisfatórios. O núcleo do problema está na forma como o log está estruturado, carecendo de uma identificação de casos, essencial para a mineração do processo. Para contornar esse problema, aplicamos agrupamento hierárquico aglomerativo no log, separando as válvulas em grupos que exercem uma função na operação. Desenvolvemos uma ferramenta (PLANTSTATE) para avaliar a qualidade desses grupos no contexto da planta e ajustar conforme a necessidade do domínio. Identificando os momentos de ativação desses grupos no log chegamos a uma estrutura de sequência e paralelismo entre os grupos. Finalmente, propomos um modelo capaz de representar as relações entre os grupos, resultando em um processo que representa a operações em uma unidade de coqueamento retardado. / [en] Procedures and processes are essential to guarantee the quality of any operation. However, processes carried out in the real world are not always in accordance with the imagined process. Furthermore, a more refined analysis of obstacles and inconsistencies is only possible from the process events record (log). Process mining is an area that brings together a set of methods to rebuild, monitor and improve processes from their log. Nevertheless, when applying existing solutions to the log of a delayed coker unit, the results were unsatisfactory. The core of the problem is how the log is structured, lacking a case identification, essential for process mining. To deal with this issue, we apply agglomerative hierarchical clustering in the log, separating the valves into groups that perform a task in an operation. We developed a tool (PLANTSTATE) to assess the quality of these groups in the context of the plant and to adjust in accord to the needs of the domain. By identifying the moments of activation of these groups in the log we arrive at a structure of sequence and parallelism between the groups. Finally, we propose a model capable of representing the relationships between groups, resulting in a process that represents the operations in a delayed coker unit. [pt] APRENDIZADO DE MAQUINA [pt] COQUEAMENTO RETARDADO [pt] MINERACAO DE PROCESSOS [pt] CIENCIA DE DADOS [en] MACHINE LEARNING [en] DELAYED COKE [en] PROCESS MINING [en] DATA SCIENCE
239	Implementation of Hierarchical and K-Means Clustering Techniques on the Trend and Seasonality Components of Temperature Profile Data Ogedegbe, Emmanuel 01 December 2023 (has links) (PDF) In this study, time series decomposition techniques are used in conjunction with Kmeans clustering and Hierarchical clustering, two well-known clustering algorithms, to climate data. Their implementation and comparisons are then examined. The main objective is to identify similar climate trends and group geographical areas with similar environmental conditions. Climate data from specific places are collected and analyzed as part of the project. The time series is then split into trend, seasonality, and residual components. In order to categorize growing regions according to their climatic inclinations, the deconstructed time series are then submitted to K-means clustering and Hierarchical clustering with dynamic time warping. In order to understand how different regions’ climates compare to one another and how regions cluster based on the general trend of the temperature profile over the course of the full growing season as opposed to the seasonality component for the various locations, the created clusters are evaluated. Time series data K-Means Clustering Hierarchical Clustering Applied Mathematics Computer Sciences Data Science Statistics and Probability
240	Node Classification on Relational Graphs Using Deep-RGCNs Chandra, Nagasai 01 March 2021 (has links) (PDF) Knowledge Graphs are fascinating concepts in machine learning as they can hold usefully structured information in the form of entities and their relations. Despite the valuable applications of such graphs, most knowledge bases remain incomplete. This missing information harms downstream applications such as information retrieval and opens a window for research in statistical relational learning tasks such as node classification and link prediction. This work proposes a deep learning framework based on existing relational convolutional (R-GCN) layers to learn on highly multi-relational data characteristic of realistic knowledge graphs for node property classification tasks. We propose a deep and improved variant, Deep-RGCNs, with dense and residual skip connections between layers. These skip connections are known to be very successful with popular deep CNN-architectures such as ResNet and DenseNet. In our experiments, we investigate and compare the performance of Deep-RGCN with different baselines on multi-relational graph benchmark datasets, AIFB and MUTAG, and show how the deep architecture boosts the performance in the task of node property classification. We also study the training performance of Deep-RGCNs (with N layers) and discuss the gradient vanishing and over-smoothing problems common to deeper GCN architectures. Graph Learning Deep Learning Node Classification Graph Convolutional Networks Machine Learning Semi-supervised Learning Artificial Intelligence and Robotics Categorical Data Analysis Data Science

Search results