Global ETD Search

51	Provable Guarantees of Learning with Incomplete and Latent Data Chuyang Ke (15337258) 21 April 2023 (has links) <p>Real-world datasets are rarely clean. This causes the discrepancy between the claimed performance of machine learning algorithms on paper, and their actual performance on real-world problems. When dealing with missing or hidden information in a dataset, researchers have been using heuristic imputation methods since the first day of machine learning. However, it is known that many imputation methods do not have theoretical guarantees in various machine learning tasks, including clustering, community detection, sparsity recovery, to name a few. On the other hand, theoretical machine learning papers often follow simplistic assumptions, which are rarely fulfilled in real-world datasets. My research focuses on developing statistically and computationally efficient learning algorithms with provable guarantees under novel incomplete and latent assumptions. We consider problems with arguably more realistic incomplete and latent assumptions.We provide analysis to community detection in various network models, inference with latent variables in an arbitrary planted model, federated myopic community detection, and high-order tensor models. We analyze the interaction between the missing or latent structures and the inference / recoverability conditions, and proposed algorithms to solve the problems efficiently. <br> <br> Our main contributions in this thesis are as follows.<br> </p> <ol> <li>We analyze the information-theoretic limits for the recovery of node labels in several network models. We analyze the information-theoretic limits for community detection. We carefully construct restricted ensembles for a subclass of network models, and provide a series of novel results. </li> <li>We analyze the necessary and sufficient conditions for exact inference of a latent model. We show that exact inference can be achieved using a semidefinite programming approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker conditions and the spectrum of a properly defined matrix.</li> <li>We study the problem of recovering the community structure of a network under federated myopic learning. Under this paradigm, we have several clients, each of them having a myopic view, i.e., observing a small subgraph of the network. Each client sends a censored evidence graph to a central server. We provide an efficient algorithm, which computes a consensus signed weighted graph from clients evidence, and recovers the underlying network structure in the central server. We analyze the topological structure conditions of the network, as well as the signal and noise levels of the clients that allow for recovery of the network structure. Our analysis shows that exact recovery is possible and can be achieved in polynomial time.</li> <li>We study the problem of exact partitioning of high-order models. We consider two different high-order assumptions, and show that exact partitioning of high-order planted models is achievable through solving a convex optimization problem with a novel Carathéodory symmetric tensor cone in one case, and with a tensor nuclear norm constraint in the other.</li> <li>We study the problem of inference in high-order structured prediction tasks. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.</li> <li>We study the problem of partial recovery through semidefinite programming. We are interested in the scenarios in which the SDP returns a solution that is partially correct without any rounding. We analyze the optimality condition of partial recovery and provide statistical and topological guarantees. </li> </ol> Knowledge representation and reasoning structured prediction learning theory combinatorial optimization machine learning community detection exact inference
52	Community Detection applied to Cross-Device Identity Graphs / Gemenskapsdetektering applicerades på gränsöverskridande identitetsgrafer Geffrier, Valentin January 2017 (has links) The personalization of online advertising has now become a necessity for marketing agencies. The tracking technologies such as third-party cookies gives advertisers the ability to recognize internet users across different websites, to understand their behavior and to assess their needs and their tastes. The amount of created data and interactions leads to the creation of a large cross-device identity graph that links different identifiers such as emails to different devices used on different networks. Over time, strongly connected components appear in this graph, too large to represent only the identifiers or devices of only one person or household. The aims of this project is to partition these components according to the structure of the graph and the features associated to the edges without separating identifiers used by a same person. Subsequent to this, the size reduction of these components leads to the isolation of individuals and the identifiers associated to them. This thesis presents the design of a bipartite graph from the available data, the implementation of different community detection graphs adapted to this specific case and different validation methods designed to assess the quality of our partition. Different graph metrics are then used to compare the outputs of the algorithms and we will observe how the adaptation of the algorithm to the bipartite case can lead to better results. / Anpassningen av onlineannonsering har nu blivit en nödvändighet för marknadsföringsbyråer. Spårningstekniken som cookies från tredje part ger annonsörer möjlighet att känna igen internetanvändare på olika webbplatser, för att förstå deras beteende och för att bedöma deras behov och deras smak. Mängden skapade data och interaktioner leder till skapandet av en stor identitetsgrafik för flera enheter som länkar olika identifierare, t.ex. e-postmeddelanden till olika enheter som används i olika nätverk. Över tiden visas starkt anslutna komponenter i det här diagrammet, för stora för att endast representera identifierare eller enheter av endast en person eller hushåll. Syftet med detta projekt är att partitionera dessa komponenter enligt grafens struktur och de egenskaper som är knutna till kanterna utan att separera identifierare som används av samma person. Efter detta leder storleksreduktionen av dessa komponenter till isoleringen av individer och de identifierare som är associerade med dem. Denna avhandling presenterar utformningen av en bifogad graf från tillgängliga data, genomförandet av olika samhällsdetekteringskurvor anpassade till detta specifika fall och olika valideringsmetoder som är utformade för att bedöma kvaliteten på vår partition. Olika grafvärden används då för att jämföra algoritmens utgångar och vi kommer att observera hur anpassningen av algoritmen till tvåpartsfallet kan leda till bättre resultat. Computer Sciences Datavetenskap (datalogi)
53	Joint Dynamic Online Social Network Analytics Using Network, Content and User Characteristics Ruan, Yiye 18 May 2015 (has links) No description available. Computer Science data mining online social networks graph mining community detection structural role detection sentiment analysis
54	Efficient and Effective Local Algorithms for Analyzing Massive Graphs Wu, Yubao 31 May 2016 (has links) No description available. Computer Science Bioinformatics community detection random walk top-k query densest subgraph dual networks
55	Joint spectral embeddings of random dot product graphs Draves, Benjamin 05 October 2022 (has links) Multiplex networks describe a set of entities, with multiple relationships among them, as a collection of networks over a common vertex set. Multiplex networks naturally describe complex systems where units connect across different modalities whereas single network data only permits a single relationship type. Joint spectral embedding methods facilitate analysis of multiplex network data by simultaneously mapping vertices in each network to points in Euclidean space, entitled node embeddings, where statistical inference is then performed. This mapping is performed by spectrally decomposing a matrix that summarizes the multiplex network. Different methods decompose different matrices and hence yield different node embeddings. This dissertation analyzes a class of joint spectral embedding methods which provides a foundation to compare these different approaches to multiple network inference. We compare joint spectral embedding methods in three ways. First, we extend the Random Dot Product Graph model to multiplex network data and establish the statistical properties of node embeddings produced by each method under this model. This analysis facilitates a full bias-variance analysis of each method and uncovers connections between these methods and methods for dimensionality reduction. Second, we compare the accuracy of algorithms which utilize these different node embeddings in a variety of multiple network inference tasks including community detection, vertex anomaly detection, and graph hypothesis testing. Finally, we perform a time and space complexity analysis of each method and present a case study in which we analyze interactions between New England sports fans on the social news aggregation and discussion website, Reddit. These findings provide a theoretical and practical guide to compare joint spectral embedding techniques and highlight the benefits and drawbacks of utilizing each method in practice. Statistics Community detection Embedding Graphs Hypothesis testing Latent position Multiplex networks
56	Detecting malware in memory with memory object relationships Thomas, DeMarcus M., Sr. 10 December 2021 (has links) Malware is a growing concern that not only affects large businesses but the basic consumer as well. As a result, there is a need to develop tools that can identify the malicious activities of malware authors. A useful technique to achieve this is memory forensics. Memory forensics is the study of volatile data and its structures in Random Access Memory (RAM). It can be utilized to pinpoint what actions have occurred on a computer system. This dissertation utilizes memory forensics to extract relationships between objects and supervised machine learning as a novel method for identifying malicious processes in a system memory dump. In this work, the Object Association Extractor (OAE) was created to extract objects in a memory dump and label the relationships as a graph of nodes and edges. With OAE, we extracted processes from 13,882 memory images that contained malware from the repository VirusShare and 91 memory images created with benign software from the package management software Chocolatey. The final dataset contained 267,824 processes. Two feature sets were created from the processes dataset and used to train classifiers based on four classification algorithms. These classifiers were evaluated against the ZeroR method using accuracy and recall as the evaluation metrics. The experiments showed that both sets of features used to build classifiers were able to beat the ZeroR method for the Decision Tree and Random Forest algorithms. The Random Forest classifier achieved the highest performance by reaching a recall score of almost 97%. malware memory forensics Rekall community detection machine learning Other Computer Engineering
57	Community-Based Intrusion Detection Weigert, Stefan 06 February 2017 (has links) (PDF) Today, virtually every company world-wide is connected to the Internet. This wide-spread connectivity has given rise to sophisticated, targeted, Internet-based attacks. For example, between 2012 and 2013 security researchers counted an average of about 74 targeted attacks per day. These attacks are motivated by economical, financial, or political interests and commonly referred to as “Advanced Persistent Threat (APT)” attacks. Unfortunately, many of these attacks are successful and the adversaries manage to steal important data or disrupt vital services. Victims are preferably companies from vital industries, such as banks, defense contractors, or power plants. Given that these industries are well-protected, often employing a team of security specialists, the question is: How can these attacks be so successful? Researchers have identified several properties of APT attacks which make them so efficient. First, they are adaptable. This means that they can change the way they attack and the tools they use for this purpose at any given moment in time. Second, they conceal their actions and communication by using encryption, for example. This renders many defense systems useless as they assume complete access to the actual communication content. Third, their actions are stealthy — either by keeping communication to the bare minimum or by mimicking legitimate users. This makes them “fly below the radar” of defense systems which check for anomalous communication. And finally, with the goal to increase their impact or monetisation prospects, their attacks are targeted against several companies from the same industry. Since months can pass between the first attack, its detection, and comprehensive analysis, it is often too late to deploy appropriate counter-measures at businesses peers. Instead, it is much more likely that they have already been attacked successfully. This thesis tries to answer the question whether the last property (industry-wide attacks) can be used to detect such attacks. It presents the design, implementation and evaluation of a community-based intrusion detection system, capable of protecting businesses at industry-scale. The contributions of this thesis are as follows. First, it presents a novel algorithm for community detection which can detect an industry (e.g., energy, financial, or defense industries) in Internet communication. Second, it demonstrates the design, implementation, and evaluation of a distributed graph mining engine that is able to scale with the throughput of the input data while maintaining an end-to-end latency for updates in the range of a few milliseconds. Third, it illustrates the usage of this engine to detect APT attacks against industries by analyzing IP flow information from an Internet service provider. Finally, it introduces a detection algorithm- and input-agnostic intrusion detection engine which supports not only intrusion detection on IP flow but any other intrusion detection algorithm and data-source as well. Cybersicherheit Publish-Subscribe Data Mining Verteilte Systeme Community Detection Graphentheorie Cyber security Publish-Subscribe Data Mining Distributed Systems Community Detection Graph theory ddc:004 rvk:ST 277 rvk:ST 270 rvk:ST 200
58	Community-Based Intrusion Detection Weigert, Stefan 11 April 2016 (has links) Today, virtually every company world-wide is connected to the Internet. This wide-spread connectivity has given rise to sophisticated, targeted, Internet-based attacks. For example, between 2012 and 2013 security researchers counted an average of about 74 targeted attacks per day. These attacks are motivated by economical, financial, or political interests and commonly referred to as “Advanced Persistent Threat (APT)” attacks. Unfortunately, many of these attacks are successful and the adversaries manage to steal important data or disrupt vital services. Victims are preferably companies from vital industries, such as banks, defense contractors, or power plants. Given that these industries are well-protected, often employing a team of security specialists, the question is: How can these attacks be so successful? Researchers have identified several properties of APT attacks which make them so efficient. First, they are adaptable. This means that they can change the way they attack and the tools they use for this purpose at any given moment in time. Second, they conceal their actions and communication by using encryption, for example. This renders many defense systems useless as they assume complete access to the actual communication content. Third, their actions are stealthy — either by keeping communication to the bare minimum or by mimicking legitimate users. This makes them “fly below the radar” of defense systems which check for anomalous communication. And finally, with the goal to increase their impact or monetisation prospects, their attacks are targeted against several companies from the same industry. Since months can pass between the first attack, its detection, and comprehensive analysis, it is often too late to deploy appropriate counter-measures at businesses peers. Instead, it is much more likely that they have already been attacked successfully. This thesis tries to answer the question whether the last property (industry-wide attacks) can be used to detect such attacks. It presents the design, implementation and evaluation of a community-based intrusion detection system, capable of protecting businesses at industry-scale. The contributions of this thesis are as follows. First, it presents a novel algorithm for community detection which can detect an industry (e.g., energy, financial, or defense industries) in Internet communication. Second, it demonstrates the design, implementation, and evaluation of a distributed graph mining engine that is able to scale with the throughput of the input data while maintaining an end-to-end latency for updates in the range of a few milliseconds. Third, it illustrates the usage of this engine to detect APT attacks against industries by analyzing IP flow information from an Internet service provider. Finally, it introduces a detection algorithm- and input-agnostic intrusion detection engine which supports not only intrusion detection on IP flow but any other intrusion detection algorithm and data-source as well. info:eu-repo/classification/ddc/004 ddc:004
59	Algoritmos de estimação de distribuição baseados em árvores filogenéticas / Estimation of distribution algorithms based on phylogenetic trees Soares, Antonio Helson Mineiro 27 June 2014 (has links) Algoritmos Evolutivos que utilizam modelos probabilísticos de distribuição dos valores das variáveis (para orientar o processo de busca da solução de problemas) são chamados Algoritmos de Estimação de Distribuição (AEDs). Esses algoritmos têm apresentado resultados relevantes para lidar com problemas relativamente complexos. O desempenho deles depende diretamente da qualidade dos modelos probabilísticos construídos que, por sua vez, dependem dos métodos de construção dos modelos. Os melhores modelos em geral são construídos por métodos computacionalmente complexos, resultando em AEDs que requerem tempo computacional alto, apesar de serem capazes de explorar menos pontos do espaço de busca para encontrar a solução de um problema. Este trabalho investiga modelos probabilísticos obtidos por algoritmos de reconstrução de filogenias, uma vez que alguns desses métodos podem produzir, de forma computacionalmente eficiente, modelos que representam bem as principais relações entre espécies (ou entre variáveis). Este trabalho propõe algumas estratégias para obter um melhor uso de modelos baseados em filogenia para o desenvolvimento de AEDs, dentre elas o emprego de um conjunto de filogenias em vez de apenas uma filogenia como modelo de correlação entre variáveis, a síntese das informações mais relevantes desse conjunto em uma estrutura de rede e a identificação de grupos de variáveis correlacionadas a partir de uma ou mais redes por meio de um algoritmo de detecção de comunidades. Utilizando esses avanços para a construção de modelos, foi desenvolvido uma nova técnica de busca, a Busca Exaustiva Composta, que possibilita encontrar a solução de problemas combinatórios de otimização de diferentes níveis de dificuldades. Além disso, foi proposta uma extensão do novo algoritmo para problemas multiobjetivos, que mostrou ser capaz de determinar a fronteira Pareto-ótima dos problemas combinatórios investigados. Por fim, o AED desenvolvido possibilitou obter um compromisso em termos de número de avaliações e tempo de computação, conseguindo resultados similares aos dos melhores algoritmos encontrados para cada um desses critérios de desempenho nos problemas testados. / Evolutionary Algorithms that use the distribution of values of variables as probabilistic models (to direct the search process of problem solving) are called Estimation of Distribution Algorithms (EDAs). These algorithms have presented relevant performance in handling relatively complex problems. The performance of such algorithms depends directly on the quality of probabilistic models constructed that, in turn, depend on the methods of model building. The best models are often constructed by computationally complex methods, resulting in AEDs that require high running time although they are able to explore less points in the search space to find the solution of a problem. This work investigates probabilistic models obtained by algorithms of phylogeny reconstruction since some of them can produce models in an efficient way representing the main relationships among species (or among variables). This work proposes some strategies for better use of phylogeny-based models in the development of EDAs, such as the employment of a set of phylogenies instead of only one phylogeny as a model of correlation among variables, the synthesis of the most relevant information from a set of phylogenies into a structure of network and the identification groups of correlated variables from one or more networks by an algorithm of community detection. Using those advances for model construction, a new search technique, called Composed Exhaustive Search, was developed in order to find solutions for combinatorial optimization problems with different levels of difficulty. In addition, an extension of the new algorithm for multi-objective problems was proposed, which was able to determine the Pareto-optimal front of the combinatorial problems investigated. Finally, the developed EDA makes possible to obtain a trade-off in terms of number of evaluations and running time, finding results that are similar to the ones achieved by the best algorithms found for each one of these performance criteria in the problems tested. Algoritmos genéticos Árvores filogenéticas Community detection Detecção de comunidades Estimation of distribution algorithms Genetic algorithms Phylogenetic trees
60	Metaheurísticas para o problema de agrupamento de dados em grafo / Metaheuristics for the graph clustering problem Nascimento, Mariá Cristina Vasconcelos 26 February 2010 (has links) O problema de agrupamento de dados em grafos consiste em encontrar clusters de nós em um dado grafo, ou seja, encontrar subgrafos com alta conectividade. Esse problema pode receber outras nomenclaturas, algumas delas são: problema de particionamento de grafos e problema de detecção de comunidades. Para modelar esse problema, existem diversas formulações matemáticas, cada qual com suas vantagens e desvantagens. A maioria dessas formulações tem como desvantagem a necessidade da definição prévia do número de grupos que se deseja obter. Entretanto, esse tipo de informação não está contida em dados para agrupamento, ou seja, em dados não rotulados. Esse foi um dos motivos da popularização nas últimas décadas da medida conhecida como modularidade, que tem sido maximizada para encontrar partições em grafos. Essa formulação, além de não exigir a definição prévia do número de clusters, se destaca pela qualidade das partições que ela fornece. Nesta Tese, metaheurísticas Greedy Randomized Search Procedures para dois modelos existentes para agrupamento em grafos foram propostas: uma para o problema de maximização da modularidade e a outra para o problema de maximização da similaridade intra-cluster. Os resultados obtidos por essas metaheurísticas foram melhores quando comparadas àqueles de outras heurísticas encontradas na literatura. Entretanto, o custo computacional foi alto, principalmente o da metaheurística para o modelo de maximização da modularidade. Com o passar dos anos, estudos revelaram que a formulação que maximiza a modularidade das partições possui algumas limitações. A fim de promover uma alternativa à altura do modelo de maximização da modularidade, esta Tese propõe novas formulações matemáticas de agrupamento em grafos com e sem pesos que visam encontrar partições cujos clusters apresentem alta conectividade. Além disso, as formulações propostas são capazes de prover partições sem a necessidade de definição prévia do número de clusters. Testes com centenas de grafos com pesos comprovaram a eficiência dos modelos propostos. Comparando as partições provenientes de todos os modelos estudados nesta Tese, foram observados melhores resultados em uma das novas formulações propostas, que encontrou partições bastante satisfatórias, superiores às outras existentes, até mesmo para a de maximização de modularidade. Os resultados apresentaram alta correlação com a classificação real dos dados simulados e reais, sendo esses últimos, em sua maioria, de origem biológica / Graph clustering aims at identifying highly connected groups or clusters of nodes of a graph. This problem can assume others nomenclatures, such as: graph partitioning problem and community detection problem. There are many mathematical formulations to model this problem, each one with advantages and disadvantages. Most of these formulations have the disadvantage of requiring the definition of the number of clusters in the final partition. Nevertheless, this type of information is not found in graphs for clustering, i.e., whose data are unlabeled. This is one of the reasons for the popularization in the last decades of the measure known as modularity, which is being maximized to find graph partitions. This formulation does not require the definition of the number of clusters of the partitions to be produced, and produces high quality partitions. In this Thesis, Greedy Randomized Search Procedures metaheuristics for two existing graph clustering mathematical formulations are proposed: one for the maximization of the partition modularity and the other for the maximization of the intra-cluster similarity. The results obtained by these proposed metaheuristics outperformed the results from other heuristics found in the literature. However, their computational cost was high, mainly for the metaheuristic for the maximization of modularity model. Along the years, researches revealed that the formulation that maximizes the modularity of the partitions has some limitations. In order to promote a good alternative for the maximization of the partition modularity model, this Thesis proposed new mathematical formulations for graph clustering for weighted and unweighted graphs, aiming at finding partitions with high connectivity clusters. Furthermore, the proposed formulations are able to provide partitions without a previous definition of the true number of clusters. Computational tests with hundreds of weighted graphs confirmed the efficiency of the proposed models. Comparing the partitions from all studied formulations in this Thesis, it was possible to observe that the proposed formulations presented better results, even better than the maximization of partition modularity. These results are characterized by satisfactory partitions with high correlation with the true classification for the simulated and real data (mostly biological) Agrupamento de dados em grafos Clustering coefficient Clustering Coefficient Community detection Detecção de comunidades Graph clustering GRASP GRASP Modularidade Modularity

Search results