Global ETD Search

11	Node Centric Community Detection and Evolutional Prediction in Dynamic Networks Oluwafolake A Ayano (13161288) 27 July 2022 (has links) <p> </p> <p>Advances in technology have led to the availability of data from different platforms such as the web and social media platforms. Much of this data can be represented in the form of a network consisting of a set of nodes connected by edges. The nodes represent the items in the networks while the edges represent the interactions between the nodes. Community detection methods have been used extensively in analyzing these networks. However, community detection in evolving networks has been a significant challenge because of the frequent changes to the networks and the need for real-time analysis. Using Static community detection methods for analyzing dynamic networks will not be appropriate because static methods do not retain a network’s history and cannot provide real-time information about the communities in the network.</p> <p>Existing incremental methods treat changes to the network as a sequence of edge additions and/or removals; however, in many real-world networks, changes occur when a node is added with all its edges connecting simultaneously. </p> <p>For efficient processing of such large networks in a timely manner, there is a need for an adaptive analytical method that can process large networks without recomputing the entire network after its evolution and treat all the edges involved with a node equally. </p> <p>We proposed a node-centric community detection method that incrementally updates the community structure in the network using the already known structure of the network to avoid recomputing the entire network from the scratch and consequently achieve a high-quality community structure. The results from our experiments suggest that our approach is efficient for incremental community detection of node-centric evolving networks. </p> Data engineering and data science Data mining and knowledge discovery Graph, social and multimedia data Community Detection Dynamic Networks IP Networks Clustering Big Data Analytics
12	Оценка влияния данных на качество работы модели предсказания схем синтеза органических молекул : магистерская диссертация / Assessment of data influence on model performance in predicting organic molecule synthesis schemes Голубев, А. А., Golubev, A. A. January 2024 (has links) Целью данной дипломной работы является исследование факторов, влияющих на качество предсказания ретросинтетических схем с помощью инструмента AiZynthFinder. В ходе тестирования модели были подобраны оптимальные параметры запуска анализа, позволившие увеличить долю решенных молекул с 40% (для базовых условий) до 80% (для оптимальных условий). Инструмент на основе AiZynthFinder был интегрирован в собственный вычислительный веб-сервис. / The goal of this work is to study the factors affecting the quality of retrosynthetic schemes prediction using the AiZynthFinder tool. During the model testing, optimal parameters for running the analysis were found, which allowed increasing the fraction of solved molecules from 40% (for baseline conditions) to 80% (for optimal conditions). The AiZynthFinder-based tool was integrated into an in-house computational web service. MASTER'S THESIS ARTIFICIAL INTELLIGENCE RETROSYNTHETIC ANALYSIS DATA ENGINEERING PREDICTION QUALITY ASSESSMENT ИНЖИНИРИНГ ДАННЫХ
13	Processing reporting function views in a data warehouse environment Lehner, Wolfgang, Hummer, W., Schlesinger, L. 02 June 2022 (has links) Reporting functions reflect a novel technique to formulate sequence-oriented queries in SQL. They extend the classical way of grouping and applying aggregation functions by additionally providing a column-based ordering, partitioning, and windowing mechanism. The application area of reporting functions ranges from simple ranking queries (TOP(n)-analyses) over cumulative (Year-To-Date-analyses) to sliding window queries. We discuss the problem of deriving reporting function queries from materialized reporting function views, which is one of the most important issues in efficiently processing queries in a data warehouse environment. Two different derivation algorithms, including their relational mappings are introduced and compared in a test scenario. info:eu-repo/classification/ddc/004 ddc:004
14	Towards Privacy and Communication Efficiency in Distributed Representation Learning Sheikh S Azam (12836108) 10 June 2022 (has links) <p>Over the past decade, distributed representation learning has emerged as a popular alternative to conventional centralized machine learning training. The increasing interest in distributed representation learning, specifically federated learning, can be attributed to its fundamental property that promotes data privacy and communication savings. While conventional ML encourages aggregating data at a central location (e.g., data centers), distributed representation learning advocates keeping data at the source and instead transmitting model parameters across the network. However, since the advent of deep learning, model sizes have become increasingly large often comprising million-billions of parameters, which leads to the problem of communication latency in the learning process. In this thesis, we propose to tackle the problem of communication latency in two different ways: (i) learning private representation of data to enable its sharing, and (ii) reducing the communication latency by minimizing the corresponding long-range communication requirements.</p> <p><br></p> <p>To tackle the former goal, we first start by studying the problem of learning representations that are private yet informative, i.e., providing information about intended ''ally'' targets while hiding sensitive ''adversary'' attributes. We propose Exclusion-Inclusion Generative Adversarial Network (EIGAN), a generalized private representation learning (PRL) architecture that accounts for multiple ally and adversary attributes, unlike existing PRL solutions. We then address the practical constraints of the distributed datasets by developing Distributed EIGAN (D-EIGAN), the first distributed PRL method that learns a private representation at each node without transmitting the source data. We theoretically analyze the behavior of adversaries under the optimal EIGAN and D-EIGAN encoders and the impact of dependencies among ally and adversary tasks on the optimization objective. Our experiments on various datasets demonstrate the advantages of EIGAN in terms of performance, robustness, and scalability. In particular, EIGAN outperforms the previous state-of-the-art by a significant accuracy margin (47% improvement), and D-EIGAN's performance is consistently on par with EIGAN under different network settings.</p> <p><br></p> <p>We next tackle the latter objective - reducing the communication latency - and propose two timescale hybrid federated learning (TT-HF), a semi-decentralized learning architecture that combines the conventional device-to-server communication paradigm for federated learning with device-to-device (D2D) communications for model training. In TT-HF, during each global aggregation interval, devices (i) perform multiple stochastic gradient descent iterations on their individual datasets, and (ii) aperiodically engage in consensus procedure of their model parameters through cooperative, distributed D2D communications within local clusters. With a new general definition of gradient diversity, we formally study the convergence behavior of TT-HF, resulting in new convergence bounds for distributed ML. We leverage our convergence bounds to develop an adaptive control algorithm that tunes the step size, D2D communication rounds, and global aggregation period of TT-HF over time to target a sublinear convergence rate of O(1/t) while minimizing network resource utilization. Our subsequent experiments demonstrate that TT-HF significantly outperforms the current art in federated learning in terms of model accuracy and/or network energy consumption in different scenarios where local device datasets exhibit statistical heterogeneity. Finally, our numerical evaluations demonstrate robustness against outages caused by fading channels, as well favorable performance with non-convex loss functions.</p> Knowledge representation and reasoning Pattern recognition Data and information privacy Data engineering and data science Cloud computing Adversarial machine learning Deep learning Neural networks representation learning Federated learning Adversarial learning Deep Learning Framework
15	Myson Burch Thesis Myson C Burch (16637289) 08 August 2023 (has links) <p>With the completion of the Human Genome Project and many additional efforts since, there is an abundance of genetic data that can be leveraged to revolutionize healthcare. Now, there are significant efforts to develop state-of-the-art techniques that reveal insights about connections between genetics and complex diseases such as diabetes, heart disease, or common psychiatric conditions that depend on multiple genes interacting with environmental factors. These methods help pave the way towards diagnosis, cure, and ultimately prediction and prevention of complex disorders. As a part of this effort, we address high dimensional genomics-related questions through mathematical modeling, statistical methodologies, combinatorics and scalable algorithms. More specifically, we develop innovative techniques at the intersection of technology and life sciences using biobank scale data from genome-wide association studies (GWAS) and machine learning as an effort to better understand human health and disease. <br> <br> The underlying principle behind Genome Wide Association Studies (GWAS) is a test for association between genotyped variants for each individual and the trait of interest. GWAS have been extensively used to estimate the signed effects of trait-associated alleles, mapping genes to disorders and over the past decade about 10,000 strong associations between genetic variants and one (or more) complex traits have been reported. One of the key challenges in GWAS is population stratification which can lead to spurious genotype-trait associations. Our work proposes a simple clustering-based approach to correct for stratification better than existing methods. This method takes into account the linkage disequilibrium (LD) while computing the distance between the individuals in a sample. Our approach, called CluStrat, performs Agglomerative Hierarchical Clustering (AHC) using a regularized Mahalanobis distance-based GRM, which captures the population-level covariance (LD) matrix for the available genotype data.<br> <br> Linear mixed models (LMMs) have been a popular and powerful method when conducting genome-wide association studies (GWAS) in the presence of population structure. LMMs are computationally expensive relative to simpler techniques. We implement matrix sketching in LMMs (MaSk-LMM) to mitigate the more expensive computations. Matrix sketching is an approximation technique where random projections are applied to compress the original dataset into one that is significantly smaller and still preserves some of the properties of the original dataset up to some guaranteed approximation ratio. This technique naturally applies to problems in genetics where we can treat large biobanks as a matrix with the rows representing samples and columns representing SNPs. These matrices will be very large due to the large number of individuals and markers in biobanks and can benefit from matrix sketching. Our approach tackles the bottleneck of LMMs directly by using sketching on the samples of the genotype matrix as well as sketching on the markers during the computation of the relatedness or kinship matrix (GRM). <br> <br> Predictive analytics have been used to improve healthcare by reinforcing decision-making, enhancing patient outcomes, and providing relief for the healthcare system. These methods help pave the way towards diagnosis, cure, and ultimately prediction and prevention of complex disorders. The prevalence of these complex diseases varies greatly around the world. Understanding the basis of this prevalence difference can help disentangle the interaction among different factors causing complex disorders and identify groups of people who may be at a greater risk of developing certain disorders. This could become the basis of the implementation of early intervention strategies for populations at higher risk with significant benefits for public health.<br> <br> This dissertation broadens our understanding of empirical population genetics. It proposes a data-driven perspective to a variety of problems in genetics such as confounding factors in genetic structure. This dissertation highlights current computational barriers in open problems in genetics and provides robust, scalable and efficient methods to ease the analysis of genotype data.</p> Applications in health Applications in life sciences Data engineering and data science computational biology and chemistry Statistical methods and models numerical linear algebra Genetics & Genomics big data challenges
16	Decomposition and Stability of Multiparameter Persistence Modules Cheng Xin (16750956) 04 August 2023 (has links) <p>The only datasets used in my thesis work are from TUDatasets, <a href="https://chrsmrrs.github.io/datasets/">TUDataset \| TUD Benchmark datasets (chrsmrrs.github.io)</a>, a collection of public benchmark datasets for graph classification and regression.</p><p><br></p> Data engineering and data science persistence homology computational topology computational geometry persistent homology multiparameter persistence modules persistence module Decomposition (Mathematics) stability algorithm
17	An Open-Source Framework for Large-Scale ML Model Serving Sigfridsson, Petter January 2022 (has links) The machine learning (ML) industry has taken great strides forward and is today facing new challenges. Many more models are developed, used and served within the industry. Datasets that models are trained on, are constantly changing. This demands that modern machine learning processes can handle large number of models, extreme load and support recurring updates in a scalable manner. To handle these challenges, there is a concept called model serving. Model serving is a relatively new concept where more efforts are required to address both conceptual and technical challenges. Existing ML model serving solutions aim to be scalable for the purpose of serving one model at a time. The industry itself requires that the whole ML process, the number of served models and that recurring updates are scalable. That is why this thesis presents an open-source framework for large-scale ML model serving that aims to meet the requirements of today’s ML industry. The presented framework is proven to handle a large-scale ML model serving environment in a scalable way but with some limitations. Results show that the number of parallel requests the framework can handle can be optimized. This would make the solution more efficient in the sense of resource utilization. One avenue for future improvements could be to integrate the developed framework as an application into the open-source machine learning platform STACKn. Model Serving Machine Learning MLOps DevOps Distributed Computing Infrastructure System Scalability and Performance Cloud Infrastructure Open-Source Software Horizontal Scalability Data Engineering Cloud Computing AI Data Science Engineering and Technology Teknik och teknologier
18	DIPBench: An Independent Benchmark for Data-Intensive Integration Processes Lehner, Wolfgang, Böhm, Matthias, Habich, Dirk, Wloka, Uwe 12 August 2022 (has links) The integration of heterogeneous data sources is one of the main challenges within the area of data engineering. Due to the absence of an independent and universal benchmark for data-intensive integration processes, we propose a scalable benchmark, called DIPBench (Data intensive integration Process Benchmark), for evaluating the performance of integration systems. This benchmark could be used for subscription systems, like replication servers, distributed and federated DBMS or message-oriented middleware platforms like Enterprise Application Integration (EAI) servers and Extraction Transformation Loading (ETL) tools. In order to reach the mentioned universal view for integration processes, the benchmark is designed in a conceptual, process-driven way. The benchmark comprises 15 integration process types. We specify the source and target data schemas and provide a toolsuite for the initialization of the external systems, the execution of the benchmark and the monitoring of the integration system's performance. The core benchmark execution may be influenced by three scale factors. Finally, we discuss a metric unit used for evaluating the measured integration system's performance, and we illustrate our reference benchmark implementation for federated DBMS. info:eu-repo/classification/ddc/004 ddc:004
19	Clustering Uncertain Data with Possible Worlds Lehner, Wolfgang, Volk, Peter Benjamin, Rosenthal, Frank, Hahmann, Martin, Habich, Dirk 16 August 2022 (has links) The topic of managing uncertain data has been explored in many ways. Different methodologies for data storage and query processing have been proposed. As the availability of management systems grows, the research on analytics of uncertain data is gaining in importance. Similar to the challenges faced in the field of data management, algorithms for uncertain data mining also have a high performance degradation compared to their certain algorithms. To overcome the problem of performance degradation, the MCDB approach was developed for uncertain data management based on the possible world scenario. As this methodology shows significant performance and scalability enhancement, we adopt this method for the field of mining on uncertain data. In this paper, we introduce a clustering methodology for uncertain data and illustrate current issues with this approach within the field of clustering uncertain data. info:eu-repo/classification/ddc/004 ddc:004
20	Learning From Data Across Domains: Enhancing Human and Machine Understanding of Data From the Wild Sean Michael Kulinski (17593182) 13 December 2023 (has links) <p dir="ltr">Data is collected everywhere in our world; however, it often is noisy and incomplete. Different sources of data may have different characteristics, quality levels, or come from dynamic and diverse environments. This poses challenges for both humans who want to gain insights from data and machines which are learning patterns from data. How can we leverage the diversity of data across domains to enhance our understanding and decision-making? In this thesis, we address this question by proposing novel methods and applications that use multiple domains as more holistic sources of information for both human and machine learning tasks. For example, to help human operators understand environmental dynamics, we show the detection and localization of distribution shifts to problematic features, as well as how interpretable distributional mappings can be used to explain the differences between shifted distributions. For robustifying machine learning, we propose a causal-inspired method to find latent factors that are robust to environmental changes and can be used for counterfactual generation or domain-independent training; we propose a domain generalization framework that allows for fast and scalable models that are robust to distribution shift; and we introduce a new dataset based on human matches in StarCraft II that exhibits complex and shifting multi-agent behaviors. We showcase our methods across various domains such as healthcare, natural language processing (NLP), computer vision (CV), etc. to demonstrate that learning from data across domains can lead to more faithful representations of data and its generating environments for both humans and machines.</p> Knowledge representation and reasoning Natural language processing Planning and decision making Data engineering and data science Data mining and knowledge discovery Stream and sensor data Human-computer interaction Mixed initiative and human-in-the-loop Machine Learning Distribution Shifts Domain Generalization Artificial Intelligence

Search results