Global ETD Search

101	Causal Inference under Network Interference: Network Embedding Matching Zhang, Xu January 2023 (has links) Causal inference on networks often encounters interference problems. The potentialoutcomes of a unit depend not only on its treatment but also on the treatments of its neighbors in the network. The classic causal inference assumption of no interference among units is untenable in networks, and many fundamental results in causal inference may no longer hold in the presence of interference. To address interference problems in networks, this thesis proposes a novel Network Embedding Matching (NEM) framework for estimating causal effects under network interference. We recover causal effects based on network structure in an observed network. Furthermore, we extend the network interference from direct neighbors to k-hop neighbors. Unlike most previous studies, which had strong assumptions on interference among units in the network and did not consider network structure, our framework incorporates network structure into the estimation of causal effects. In addition, our NEM framework can be implemented in networks for randomized experiments and observational studies. Our approach is interpretable and can be easily applied to networks. We compare our approach with other existing methods in simulations and real networks, and we show that our approach outperforms other methods under linear and nonlinear network interference. / Statistics Statistics Computer science Causal inference Data science Interference Machine learning Network analysis Network effects
102	Pneumonia Detection using Convolutional Neural Network Pillutla Venkata Sathya, Rohit 02 June 2023 (has links) No description available. Information Technology Data Science Image Classification MobileNetV2 Convolutional Neural Network Chest X-Ray Pneumonia Detection
103	Three case studies of using hybrid model machine learning techniques in Educational Data Mining to improve the classification accuracies Poudyal, Sujan 09 August 2022 (has links) (PDF) A multitude of data is being produced from the increase in instructional technology, e-learning resources, and online courses. This data could be used by educators to analyze and extract useful information which could be beneficial to both instructors and students. Educational Data Mining (EDM) extracts hidden information from data contained within the educational domain. In data mining, hybrid method is the combination of various machine learning techniques. Through this dissertation, the novel use of machine learning hybrid techniques was explored in EDM using three educational case studies. First, in consideration for the importance of students’ attention, on and off-task data to analyze the attention behavior of the students were collected. Two feature selection techniques, Principal Component Analysis and Linear Discriminant Analysis, were combined to improve the classification accuracies for classifying the students’ attention patterns. The relationship between attention and learning was also studied by calculating Pearson’s correlation coefficient and p-value. Our examination was then shifted towards academic performance as it is important to ensuring a quality education. Two different 2D- Convolutional Neural Network (CNN) models were concatenated and produced a single model to predict students’ academic performance in terms of pass and fail. Lastly, the importance of using machine learning in online learning to maintain academic integrity was considered. In this work, primarily a traditional machine learning algorithms were used to predict the cheaters in an online examination. 1D CNN architecture was then used to extract the features from our cheater dataset and the previously used machine learning model was applied on extracted features to detect the cheaters. Such type of hybrid model outperformed the original traditional machine learning model and CNN model when used alone in terms of classification accuracy. The three studies reflect the use of machine learning application in EDM. Classification accuracy is important in EDM because different educational decisions are made based on the results of our model. So, to increase the accuracies, a hybrid method was employed. Thus, through this dissertation it was successfully shown that hybrid models can be used in EDM to improve the classification accuracies. Educational Data Mining classification accuracy hybrid model Data Science Engineering Education
104	RISK INTERPRETATION OF DIFFERENTIAL PRIVACY Jiajun Liang (13190613) 31 July 2023 (has links) <p><br></p><p>How to set privacy parameters is a crucial problem for the consistent application of DP in practice. The current privacy parameters do not provide direct suggestions for this problem. On the other hand, different databases may have varying degrees of information leakage, allowing attackers to enhance their attacks with the available information. This dissertation provides an additional interpretation of the current DP notions by introducing a framework that directly considers the worst-case average failure probability of attackers under different levels of knowledge. </p><p><br></p><p>To achieve this, we introduce a novel measure of attacker knowledge and establish a dual relationship between (type I error, type II error) and (prior, average failure probability). By leveraging this framework, we propose an interpretable paradigm to consistently set privacy parameters on different databases with varying levels of leaked information. </p><p><br></p><p>Furthermore, we characterize the minimax limit of private parameter estimation, driven by $1/(n(1-2p))^2+1/n$, where $p$ represents the worst-case probability risk and $n$ is the number of data points. This characterization is more interpretable than the current lower bound $\min{1/(n\epsilon^2),1/(n\delta^2)}+1/n$ on $(\epsilon,\delta)$-DP. Additionally, we identify the phase transition of private parameter estimation based on this limit and provide suggestions for protocol designs to achieve optimal private estimations. </p><p><br></p><p>Last, we consider a federated learning setting where the data are stored in a distributed manner and privacy-preserving interactions are required. We extend the proposed interpretation to federated learning, considering two scenarios: protecting against privacy breaches against local nodes and protecting privacy breaches against the center. Specifically, we consider a non-convex sparse federated parameter estimation problem and apply it to the generalized linear models. We tackle two challenges in this setting. Firstly, we encounter the issue of initialization due to the privacy requirements that limit the number of queries to the database. Secondly, we overcome the heterogeneity in the distribution among local nodes to identify low-dimensional structures.</p> Statistical data science Statistical theory differential privacy sparse federated learning DP paradigm privacy parameters estimation limit
105	Latent Data-Structures for Complex State Representation : A Steppingstone to Generating Synthetic 5G RAN data using Deep Learning Häggström, Jakob January 2023 (has links) The aim of this thesis is to investigate the feasibility of applying generative deep learning models on data related to 5G Radio Access Networks (5GRAN). Simulated data is used in order to develop the generative models, and this project serves as a proof of concept for further applications on real data. A Long Short-Term Memory network based Variational Autoencoder (VAE), Regularised Autoencoder (RAE) with a Gaussian Mixture prior and a Gradient Penalty Wasserstein Generative Adversarial Network (GP-WGAN) are fit by using the collected dataset. Their performance is evaluated in their ability to generate samples that resembles the real distribution and characteristics of the training data. Moreover, the performance is also measured in usability. The results indicates that it is indeed feasible to generate synthetic data given the current dataset, where the RAE and VAE seem to outperform the GP-WGAN in all tests, however, there is no clear best performer between RAE and VAE. Finally, whether the current models function on real 5G RAN data is not known and left for future work. Another topic of interest would be to improve the current models with conditional generation or other types of architectures. Data Science Machine Learning Generative models Artificial Intelligence 5GRAN Engineering and Technology Teknik och teknologier
106	Pragmatic Statistical Approaches for Power Analysis, Causal Inference, and Biomarker Detection Fan Wu (16536675) 26 July 2023 (has links) <p>Mediation analyses play a critical role in social and personality psychology research. However, current approaches for assessing power and sample size in mediation models have limitations, particularly when dealing with complex mediation models and multiple mediator sequential models. These limitations stem from limited software options and the substantial computational time required. In this part, we address these challenges by extending the joint significance test and product of coefficients test to incorporate the fourth-pathed mediated effect and generalized kth-pathed mediated effect. Additionally, we propose a model-based bootstrap method and provide convenient R tools for estimating power in complex mediation models. Through our research, we demonstrate that power decreases as the number of mediators increases and as the influence of coefficients varies. We summarize our results and discuss the implications of power analysis in relation to mediator complexity and coefficient influence. We provide insights for researchers seeking to optimize study designs and enhance the reliability of their findings in complex mediation models. </p> <p>Matching is a crucial step in causal inference, as it allows for more robust and reasonable analyses by creating better-matched pairs. However, in real-world scenarios, data are often collected and stored by different local institutions or separate departments, posing challenges for effective matching due to data fragmentation. Additionally, the harmonization of such data needs to prioritize privacy preservation. In this part, we propose a new hierarchical framework that addresses these challenges by implementing differential privacy on raw data to protect sensitive information while maintaining data utility. We also design a data access control system with three different access levels for designers based on their roles, ensuring secure and controlled access to the matched datasets. Simulation studies and analyses of datasets from the 2017 Atlantic Causal Inference Conference Data Challenge are conducted to showcase the flexibility and utility of our framework. Through this research, we contribute to the advancement of statistical methodologies in matching and privacy-preserving data analysis, offering a practical solution for data integration and privacy protection in causal inference studies. </p> <p>Biomarker discovery is a complex and resource-intensive process, encompassing discovery, qualification, verification, and validation stages prior to clinical evaluation. Streamlining this process by efficiently identifying relevant biomarkers in the discovery phase holds immense value. In this part, we present a likelihood ratio-based approach to accurately identify truly relevant protein markers in discovery studies. Leveraging the observation of unimodal underlying distributions of expression profiles for irrelevant markers, our method demonstrates promising performance when evaluated on real experimental data. Additionally, to address non-normal scenarios, we introduce a kernel ratio-based approach, which we evaluate using non-normal simulation settings. Through extensive simulations, we observe the high effectiveness of the kernel method in discovering the set of truly relevant markers, resulting in precise biomarker identifications with elevated sensitivity and a low empirical false discovery rate. </p> Statistical data science Statistics Data processing causal inference analyses biomarker approach power analysis program
107	Comparison of graph databases and relational databases performance Asplund, Einar, Sandell, Johan January 2023 (has links) There has been a change of paradigm in which way information is being produced, processed, and consumed as a result of social media. While planning to store the data, it is important to choose a suitable database for the type of data, as unsuitable storage and analysis can have a noticeable impact on the system’s energy consumption. Additionally, effectively analyzing data is essential because deficient data analysis on a large dataset can lead to repercussions due to unsound decisions and inadequate planning. In recent years, an increasing amount of organizations have provided services that cannot be anymore achieved efficiently using relational databases. An alternative data storage is graph databases, which is a powerful solution for storing and searching for relationship-dense data. The research question that the thesis aims to answer is, how do state-of-the-art-graph database and relational database technologies compare with each other from a performance perspective in terms of time taken to query, CPU usage, memory usage, power usage, and temperature of the server? To answer the research question, an experimental study using analysis of variance will be performed. One relational database, MySQL, and two graph databases, ArangoDB and Neo4j, will be compared using a benchmark. The benchmark used is Novabench. The results from the post-hoc, KruskalWallis, and analysis of variances show that there are significant differences between the database technologies. This means the null hypothesis, that there is no significant difference, is rejected, and the alternative hypothesis, that there is a significant difference in performance between the database technologies in the aspects of Time to Query, Central Processing Unit usage, Memory usage, Average Energy usage, and temperature holds. In conclusion, the research question was answered. The study shows that Neo4j was the fastest at executing queries, followed by MySQL, and in last place ArangoDB. The results also showed that MySQL was more demanding on memory usage than the other database technologies. Data Science Graph database Relational database Performance Benchmark ANOVA Computer and System Sciences Computer Sciences Datavetenskap (datalogi)
108	Modeling the Effects of Winter Storms on Power Infrastructure Systems in the Northern United States Pino, Jordan Vick 30 September 2019 (has links) No description available. Atmospheric Sciences Geography data science power infrastructure systems storm-impact modeling winter storms
109	[en] A FRAMEWORK TO AUTOMATE DATA SCIENCE TASKS THROUGH PERSONALIZED CHATBOTS / [pt] UM FRAMEWORK PARA AUTOMATIZAR TAREFAS DE CIENCIA DE DADOS ATRAVéS DE INTERFACES CONVERSACIONAIS JEFRY SASTRE PEREZ 31 January 2022 (has links) [pt] Diversas soluções foram criadas para automatizar cenários específicos de ciência de dados e implementações de conteúdo personalizado em interfaces de conversação. No entanto, o entendimento geral dessas interfaces de conversação que fornecem sugestões personalizadas para cientistas de dados ainda é pouco explorado. Identificamos a necessidade de automatizar procedimentos de ciência de dados até diferentes níveis de automação. Nossa pesquisa se concentra em ajudar os cientistas de dados durante a automação desses procedimentos usando interfaces conversacionais. Propomos um framework para a criação de um sistema chat-bot para facilitar a automação de cenários comuns de ciência de dados. Além disso, instanciamos a solução em dois cenários diferentes de ciência de dados. O primeiro cenário se concentra na detecção de valores discrepantes e o segundo na limpeza de dados. Conduzimos um estudo com 28 participantes para demonstrar que os cientistas de dados podem usar a solução proposta. Todos os participantes concluíram as atividades corretamente e 75 a 80 por cento acharam o framework relativamente fácil de estender e usar. Nossa análise sugere que o uso de interfaces conversacionais pode facilitar a automação de tarefas de ciência de dados. / [en] Several solutions have been created for automating specific data science scenarios and implementations of personalized content in conversational interfaces. However, the overall understanding of these conversational interfaces that provide personalized suggestions for data scientists is still poorly explored. We identify the need to automate data science procedures up to different levels of automation. Our research focuses on helping data scientists during the automation of these procedures by using conversational interfaces. We propose a framework for creating a chat-bot system to facilitate the automation of data science common scenarios. In addition, we instantiate the framework in two different data science scenarios. The first scenario focuses on outlier detection, and the second scenario on data cleaning. We conducted a study with 28 participants to demonstrate that data scientists can use the proposed framework. All participants completed the activities correctly, and 75 to 80 percent found the framework relatively easy to extend and use. Our analysis suggests that the use of conversational interfaces can facilitate the automation of data science tasks. [pt] ENGENHARIA DE SOFTWARE [pt] AUTOMACAO DE PROCESSOS [pt] CIENCIA DE DADOS [en] SOFTWARE ENGINEERING [en] AUTOMATION [en] DATA SCIENCE
110	DevOps for Data Science System Zhang, Zhongjian January 2020 (has links) Commercialization potential is important to data science. Whether the problems encountered by data science in production can be solved determines the success or failure of the commercialization of data science. Recent research shows that DevOps theory is a great approach to solve the problems that software engineering encounters in production. And from the product perspective, data science and software engineering both need to provide digital services to customers. Therefore it is necessary to study the feasibility of applying DevOps in data science. This paper describes an approach of developing a delivery pipeline line for a data science system applying DevOps practices. I applied four practices in the pipeline: version control, model server, containerization, and continuous integration and delivery. However, DevOps is not a theory designed specifically for data science. This means the currently available DevOps practices cannot cover all the problems of data science in production. I expended the set of practices of DevOps to handle that kind of problem with a practice of data science. I studied and involved transfer learning in the thesis project. This paper describes an approach of parameterbased transfer where parameters learned from one dataset are transferred to another dataset. I studied the effect of transfer learning on model fitting to a new dataset. First I trained a convolutional neural network based on 10,000 images. Then I experimented with the trained model on another 10,000 images. I retrained the model in three ways: training from scratch, loading the trained weights and freezing the convolutional layers. The result shows that for the problem of image classification when the dataset changes but is similar to the old one, transfer learning a useful practice to adjust the model without retraining from scratch. Freezing the convolutional layer is a good choice if the new model just needs to achieve a similar level of performance as the old one. Loading weights is a better choice if the new model needs to achieve better performance than the original one. In conclusion, there is no need to be limited by the set of existing practices of DevOps when we apply DevOps to data science. / Kommersialiseringspotentialen är viktig för datavetenskapen. Huruvida de problem som datavetenskapen möter i produktionen kan lösas avgör framgången eller misslyckandet med kommersialiseringen av datavetenskap. Ny forskning visar att DevOps-teorin är ett bra tillvägagångssätt för att lösa de problem som programvaruteknik möter i produktionen. Och ur produktperspektivet behöver både datavetenskap och programvaruteknik tillhandahålla digitala tjänster till kunderna. Därför är det nödvändigt att studera genomförbarheten av att tillämpa DevOps inom datavetenskap. Denna artikel beskriver en metod för att utveckla en leverans pipeline för ett datavetenskapssystem som använder DevOps-metoder. Jag använde fyra metoder i pipeline: versionskontroll, modellserver, containerisering och kontinuerlig integration och leverans. DevOps är dock inte en teori som utformats specifikt för datavetenskap. Detta innebär att de för närvarande tillgängliga DevOps-metoderna inte kan täcka alla problem med datavetenskap i produktionen. Jag spenderade uppsättningen av DevOps för att hantera den typen av problem med en datavetenskap. Jag studerade och involverade överföringslärande i avhandlingsprojektet. I det här dokumentet beskrivs en metod för parameterbaserad överföring där parametrar lärda från en datasats överförs till en annan datasats. Jag studerade effekten av överföringsinlärning på modellanpassning till ett nytt datasystem. Först utbildade jag ett invecklat neuralt nätverk baserat på 10 000 bilder. Sedan experimenterade jag med den tränade modellen på ytterligare 10 000 bilder. Jag omskolade modellen på tre sätt: träna från grunden, ladda de tränade vikterna och frysa de invändiga lagren. Resultatet visar att för problemet med bildklassificering när datasättet ändras men liknar det gamla, överföra lärande en användbar praxis för att justera modellen utan omskolning från början. Att frysa det invändiga lagret är ett bra val om den nya modellen bara behöver uppnå en liknande prestanda som den gamla. Att ladda vikter är ett bättre val om den nya modellen behöver uppnå bättre prestanda än den ursprungliga. Sammanfattningsvis finns det inget behov att begränsas av uppsättningen av befintliga metoder för DevOps när vi tillämpar DevOps på datavetenskap. Data science DevOps convolutional neural network transfer learning Computer and Information Sciences Data- och informationsvetenskap

Search results