Global ETD Search

51	Provable Guarantees of Learning with Incomplete and Latent Data Chuyang Ke (15337258) 21 April 2023 (has links) <p>Real-world datasets are rarely clean. This causes the discrepancy between the claimed performance of machine learning algorithms on paper, and their actual performance on real-world problems. When dealing with missing or hidden information in a dataset, researchers have been using heuristic imputation methods since the first day of machine learning. However, it is known that many imputation methods do not have theoretical guarantees in various machine learning tasks, including clustering, community detection, sparsity recovery, to name a few. On the other hand, theoretical machine learning papers often follow simplistic assumptions, which are rarely fulfilled in real-world datasets. My research focuses on developing statistically and computationally efficient learning algorithms with provable guarantees under novel incomplete and latent assumptions. We consider problems with arguably more realistic incomplete and latent assumptions.We provide analysis to community detection in various network models, inference with latent variables in an arbitrary planted model, federated myopic community detection, and high-order tensor models. We analyze the interaction between the missing or latent structures and the inference / recoverability conditions, and proposed algorithms to solve the problems efficiently. <br> <br> Our main contributions in this thesis are as follows.<br> </p> <ol> <li>We analyze the information-theoretic limits for the recovery of node labels in several network models. We analyze the information-theoretic limits for community detection. We carefully construct restricted ensembles for a subclass of network models, and provide a series of novel results. </li> <li>We analyze the necessary and sufficient conditions for exact inference of a latent model. We show that exact inference can be achieved using a semidefinite programming approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker conditions and the spectrum of a properly defined matrix.</li> <li>We study the problem of recovering the community structure of a network under federated myopic learning. Under this paradigm, we have several clients, each of them having a myopic view, i.e., observing a small subgraph of the network. Each client sends a censored evidence graph to a central server. We provide an efficient algorithm, which computes a consensus signed weighted graph from clients evidence, and recovers the underlying network structure in the central server. We analyze the topological structure conditions of the network, as well as the signal and noise levels of the clients that allow for recovery of the network structure. Our analysis shows that exact recovery is possible and can be achieved in polynomial time.</li> <li>We study the problem of exact partitioning of high-order models. We consider two different high-order assumptions, and show that exact partitioning of high-order planted models is achievable through solving a convex optimization problem with a novel Carathéodory symmetric tensor cone in one case, and with a tensor nuclear norm constraint in the other.</li> <li>We study the problem of inference in high-order structured prediction tasks. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.</li> <li>We study the problem of partial recovery through semidefinite programming. We are interested in the scenarios in which the SDP returns a solution that is partially correct without any rounding. We analyze the optimality condition of partial recovery and provide statistical and topological guarantees. </li> </ol> Knowledge representation and reasoning structured prediction learning theory combinatorial optimization machine learning community detection exact inference
52	Community Detection applied to Cross-Device Identity Graphs / Gemenskapsdetektering applicerades på gränsöverskridande identitetsgrafer Geffrier, Valentin January 2017 (has links) The personalization of online advertising has now become a necessity for marketing agencies. The tracking technologies such as third-party cookies gives advertisers the ability to recognize internet users across different websites, to understand their behavior and to assess their needs and their tastes. The amount of created data and interactions leads to the creation of a large cross-device identity graph that links different identifiers such as emails to different devices used on different networks. Over time, strongly connected components appear in this graph, too large to represent only the identifiers or devices of only one person or household. The aims of this project is to partition these components according to the structure of the graph and the features associated to the edges without separating identifiers used by a same person. Subsequent to this, the size reduction of these components leads to the isolation of individuals and the identifiers associated to them. This thesis presents the design of a bipartite graph from the available data, the implementation of different community detection graphs adapted to this specific case and different validation methods designed to assess the quality of our partition. Different graph metrics are then used to compare the outputs of the algorithms and we will observe how the adaptation of the algorithm to the bipartite case can lead to better results. / Anpassningen av onlineannonsering har nu blivit en nödvändighet för marknadsföringsbyråer. Spårningstekniken som cookies från tredje part ger annonsörer möjlighet att känna igen internetanvändare på olika webbplatser, för att förstå deras beteende och för att bedöma deras behov och deras smak. Mängden skapade data och interaktioner leder till skapandet av en stor identitetsgrafik för flera enheter som länkar olika identifierare, t.ex. e-postmeddelanden till olika enheter som används i olika nätverk. Över tiden visas starkt anslutna komponenter i det här diagrammet, för stora för att endast representera identifierare eller enheter av endast en person eller hushåll. Syftet med detta projekt är att partitionera dessa komponenter enligt grafens struktur och de egenskaper som är knutna till kanterna utan att separera identifierare som används av samma person. Efter detta leder storleksreduktionen av dessa komponenter till isoleringen av individer och de identifierare som är associerade med dem. Denna avhandling presenterar utformningen av en bifogad graf från tillgängliga data, genomförandet av olika samhällsdetekteringskurvor anpassade till detta specifika fall och olika valideringsmetoder som är utformade för att bedöma kvaliteten på vår partition. Olika grafvärden används då för att jämföra algoritmens utgångar och vi kommer att observera hur anpassningen av algoritmen till tvåpartsfallet kan leda till bättre resultat. Computer Sciences Datavetenskap (datalogi)
53	Joint Dynamic Online Social Network Analytics Using Network, Content and User Characteristics Ruan, Yiye 18 May 2015 (has links) No description available. Computer Science data mining online social networks graph mining community detection structural role detection sentiment analysis
54	Efficient and Effective Local Algorithms for Analyzing Massive Graphs Wu, Yubao 31 May 2016 (has links) No description available. Computer Science Bioinformatics community detection random walk top-k query densest subgraph dual networks
55	Joint spectral embeddings of random dot product graphs Draves, Benjamin 05 October 2022 (has links) Multiplex networks describe a set of entities, with multiple relationships among them, as a collection of networks over a common vertex set. Multiplex networks naturally describe complex systems where units connect across different modalities whereas single network data only permits a single relationship type. Joint spectral embedding methods facilitate analysis of multiplex network data by simultaneously mapping vertices in each network to points in Euclidean space, entitled node embeddings, where statistical inference is then performed. This mapping is performed by spectrally decomposing a matrix that summarizes the multiplex network. Different methods decompose different matrices and hence yield different node embeddings. This dissertation analyzes a class of joint spectral embedding methods which provides a foundation to compare these different approaches to multiple network inference. We compare joint spectral embedding methods in three ways. First, we extend the Random Dot Product Graph model to multiplex network data and establish the statistical properties of node embeddings produced by each method under this model. This analysis facilitates a full bias-variance analysis of each method and uncovers connections between these methods and methods for dimensionality reduction. Second, we compare the accuracy of algorithms which utilize these different node embeddings in a variety of multiple network inference tasks including community detection, vertex anomaly detection, and graph hypothesis testing. Finally, we perform a time and space complexity analysis of each method and present a case study in which we analyze interactions between New England sports fans on the social news aggregation and discussion website, Reddit. These findings provide a theoretical and practical guide to compare joint spectral embedding techniques and highlight the benefits and drawbacks of utilizing each method in practice. Statistics Community detection Embedding Graphs Hypothesis testing Latent position Multiplex networks
56	Detecting malware in memory with memory object relationships Thomas, DeMarcus M., Sr. 10 December 2021 (has links) Malware is a growing concern that not only affects large businesses but the basic consumer as well. As a result, there is a need to develop tools that can identify the malicious activities of malware authors. A useful technique to achieve this is memory forensics. Memory forensics is the study of volatile data and its structures in Random Access Memory (RAM). It can be utilized to pinpoint what actions have occurred on a computer system. This dissertation utilizes memory forensics to extract relationships between objects and supervised machine learning as a novel method for identifying malicious processes in a system memory dump. In this work, the Object Association Extractor (OAE) was created to extract objects in a memory dump and label the relationships as a graph of nodes and edges. With OAE, we extracted processes from 13,882 memory images that contained malware from the repository VirusShare and 91 memory images created with benign software from the package management software Chocolatey. The final dataset contained 267,824 processes. Two feature sets were created from the processes dataset and used to train classifiers based on four classification algorithms. These classifiers were evaluated against the ZeroR method using accuracy and recall as the evaluation metrics. The experiments showed that both sets of features used to build classifiers were able to beat the ZeroR method for the Decision Tree and Random Forest algorithms. The Random Forest classifier achieved the highest performance by reaching a recall score of almost 97%. malware memory forensics Rekall community detection machine learning Other Computer Engineering
57	From Clusters to Graphs – Toward a Scalable Viewing of News Videos Ruth, Nicolas, Liebl, Bernhard, Burghardt, Manuel 04 July 2024 (has links) In this paper, we present a novel approach that combines density-based clustering and graph modeling to create a scalable viewing application for the exploration of similarity patterns in news videos. Unlike most existing video analysis tools that focus on individual videos, our approach allows for an overview of a larger collection of videos, which can be further examined based on their connections or communities. By utilizing scalable reading, specific subgraphs can be selected from the overview and their respective clusters can be explored in more detail on the video frame level info:eu-repo/classification/ddc/006 ddc:006
58	Community-Based Intrusion Detection Weigert, Stefan 06 February 2017 (has links) (PDF) Today, virtually every company world-wide is connected to the Internet. This wide-spread connectivity has given rise to sophisticated, targeted, Internet-based attacks. For example, between 2012 and 2013 security researchers counted an average of about 74 targeted attacks per day. These attacks are motivated by economical, financial, or political interests and commonly referred to as “Advanced Persistent Threat (APT)” attacks. Unfortunately, many of these attacks are successful and the adversaries manage to steal important data or disrupt vital services. Victims are preferably companies from vital industries, such as banks, defense contractors, or power plants. Given that these industries are well-protected, often employing a team of security specialists, the question is: How can these attacks be so successful? Researchers have identified several properties of APT attacks which make them so efficient. First, they are adaptable. This means that they can change the way they attack and the tools they use for this purpose at any given moment in time. Second, they conceal their actions and communication by using encryption, for example. This renders many defense systems useless as they assume complete access to the actual communication content. Third, their actions are stealthy — either by keeping communication to the bare minimum or by mimicking legitimate users. This makes them “fly below the radar” of defense systems which check for anomalous communication. And finally, with the goal to increase their impact or monetisation prospects, their attacks are targeted against several companies from the same industry. Since months can pass between the first attack, its detection, and comprehensive analysis, it is often too late to deploy appropriate counter-measures at businesses peers. Instead, it is much more likely that they have already been attacked successfully. This thesis tries to answer the question whether the last property (industry-wide attacks) can be used to detect such attacks. It presents the design, implementation and evaluation of a community-based intrusion detection system, capable of protecting businesses at industry-scale. The contributions of this thesis are as follows. First, it presents a novel algorithm for community detection which can detect an industry (e.g., energy, financial, or defense industries) in Internet communication. Second, it demonstrates the design, implementation, and evaluation of a distributed graph mining engine that is able to scale with the throughput of the input data while maintaining an end-to-end latency for updates in the range of a few milliseconds. Third, it illustrates the usage of this engine to detect APT attacks against industries by analyzing IP flow information from an Internet service provider. Finally, it introduces a detection algorithm- and input-agnostic intrusion detection engine which supports not only intrusion detection on IP flow but any other intrusion detection algorithm and data-source as well. Cybersicherheit Publish-Subscribe Data Mining Verteilte Systeme Community Detection Graphentheorie Cyber security Publish-Subscribe Data Mining Distributed Systems Community Detection Graph theory ddc:004 rvk:ST 277 rvk:ST 270 rvk:ST 200
59	Community-Based Intrusion Detection Weigert, Stefan 11 April 2016 (has links) Today, virtually every company world-wide is connected to the Internet. This wide-spread connectivity has given rise to sophisticated, targeted, Internet-based attacks. For example, between 2012 and 2013 security researchers counted an average of about 74 targeted attacks per day. These attacks are motivated by economical, financial, or political interests and commonly referred to as “Advanced Persistent Threat (APT)” attacks. Unfortunately, many of these attacks are successful and the adversaries manage to steal important data or disrupt vital services. Victims are preferably companies from vital industries, such as banks, defense contractors, or power plants. Given that these industries are well-protected, often employing a team of security specialists, the question is: How can these attacks be so successful? Researchers have identified several properties of APT attacks which make them so efficient. First, they are adaptable. This means that they can change the way they attack and the tools they use for this purpose at any given moment in time. Second, they conceal their actions and communication by using encryption, for example. This renders many defense systems useless as they assume complete access to the actual communication content. Third, their actions are stealthy — either by keeping communication to the bare minimum or by mimicking legitimate users. This makes them “fly below the radar” of defense systems which check for anomalous communication. And finally, with the goal to increase their impact or monetisation prospects, their attacks are targeted against several companies from the same industry. Since months can pass between the first attack, its detection, and comprehensive analysis, it is often too late to deploy appropriate counter-measures at businesses peers. Instead, it is much more likely that they have already been attacked successfully. This thesis tries to answer the question whether the last property (industry-wide attacks) can be used to detect such attacks. It presents the design, implementation and evaluation of a community-based intrusion detection system, capable of protecting businesses at industry-scale. The contributions of this thesis are as follows. First, it presents a novel algorithm for community detection which can detect an industry (e.g., energy, financial, or defense industries) in Internet communication. Second, it demonstrates the design, implementation, and evaluation of a distributed graph mining engine that is able to scale with the throughput of the input data while maintaining an end-to-end latency for updates in the range of a few milliseconds. Third, it illustrates the usage of this engine to detect APT attacks against industries by analyzing IP flow information from an Internet service provider. Finally, it introduces a detection algorithm- and input-agnostic intrusion detection engine which supports not only intrusion detection on IP flow but any other intrusion detection algorithm and data-source as well. info:eu-repo/classification/ddc/004 ddc:004
60	Algoritmos de estimação de distribuição baseados em árvores filogenéticas / Estimation of distribution algorithms based on phylogenetic trees Soares, Antonio Helson Mineiro 27 June 2014 (has links) Algoritmos Evolutivos que utilizam modelos probabilísticos de distribuição dos valores das variáveis (para orientar o processo de busca da solução de problemas) são chamados Algoritmos de Estimação de Distribuição (AEDs). Esses algoritmos têm apresentado resultados relevantes para lidar com problemas relativamente complexos. O desempenho deles depende diretamente da qualidade dos modelos probabilísticos construídos que, por sua vez, dependem dos métodos de construção dos modelos. Os melhores modelos em geral são construídos por métodos computacionalmente complexos, resultando em AEDs que requerem tempo computacional alto, apesar de serem capazes de explorar menos pontos do espaço de busca para encontrar a solução de um problema. Este trabalho investiga modelos probabilísticos obtidos por algoritmos de reconstrução de filogenias, uma vez que alguns desses métodos podem produzir, de forma computacionalmente eficiente, modelos que representam bem as principais relações entre espécies (ou entre variáveis). Este trabalho propõe algumas estratégias para obter um melhor uso de modelos baseados em filogenia para o desenvolvimento de AEDs, dentre elas o emprego de um conjunto de filogenias em vez de apenas uma filogenia como modelo de correlação entre variáveis, a síntese das informações mais relevantes desse conjunto em uma estrutura de rede e a identificação de grupos de variáveis correlacionadas a partir de uma ou mais redes por meio de um algoritmo de detecção de comunidades. Utilizando esses avanços para a construção de modelos, foi desenvolvido uma nova técnica de busca, a Busca Exaustiva Composta, que possibilita encontrar a solução de problemas combinatórios de otimização de diferentes níveis de dificuldades. Além disso, foi proposta uma extensão do novo algoritmo para problemas multiobjetivos, que mostrou ser capaz de determinar a fronteira Pareto-ótima dos problemas combinatórios investigados. Por fim, o AED desenvolvido possibilitou obter um compromisso em termos de número de avaliações e tempo de computação, conseguindo resultados similares aos dos melhores algoritmos encontrados para cada um desses critérios de desempenho nos problemas testados. / Evolutionary Algorithms that use the distribution of values of variables as probabilistic models (to direct the search process of problem solving) are called Estimation of Distribution Algorithms (EDAs). These algorithms have presented relevant performance in handling relatively complex problems. The performance of such algorithms depends directly on the quality of probabilistic models constructed that, in turn, depend on the methods of model building. The best models are often constructed by computationally complex methods, resulting in AEDs that require high running time although they are able to explore less points in the search space to find the solution of a problem. This work investigates probabilistic models obtained by algorithms of phylogeny reconstruction since some of them can produce models in an efficient way representing the main relationships among species (or among variables). This work proposes some strategies for better use of phylogeny-based models in the development of EDAs, such as the employment of a set of phylogenies instead of only one phylogeny as a model of correlation among variables, the synthesis of the most relevant information from a set of phylogenies into a structure of network and the identification groups of correlated variables from one or more networks by an algorithm of community detection. Using those advances for model construction, a new search technique, called Composed Exhaustive Search, was developed in order to find solutions for combinatorial optimization problems with different levels of difficulty. In addition, an extension of the new algorithm for multi-objective problems was proposed, which was able to determine the Pareto-optimal front of the combinatorial problems investigated. Finally, the developed EDA makes possible to obtain a trade-off in terms of number of evaluations and running time, finding results that are similar to the ones achieved by the best algorithms found for each one of these performance criteria in the problems tested. Algoritmos genéticos Árvores filogenéticas Community detection Detecção de comunidades Estimation of distribution algorithms Genetic algorithms Phylogenetic trees

Search results