Global ETD Search

41	Efficient Graph Summarization of Large Networks Hajiabadi, Mahdi 24 June 2022 (has links) In this thesis, we study the notion of graph summarization, which is a fundamental task of finding a compact representation of the original graph called the summary. Graph summarization can be used for reducing the footprint of the input graph, better visualization, anonymizing the identity of users, and query answering. There are two different frameworks of graph summarization we consider in this thesis, the utility-based framework and the correction set-based framework. In the utility-based framework, the input graph is summarized until a utility threshold is not violated. In the correction set-based framework a set of correction edges is produced along with the summary graph. In this thesis we propose two algorithms for the utility-based framework and one for the correction set-based framework. All these three algorithms are for static graphs (i.e. graphs that do not change over time). Then, we propose two more utility-based algorithms for fully dynamic graphs (i.e. graphs with edge insertions and deletions). Algorithms for graph summarization can be lossless (summarizing the input graph without losing any information) or lossy (losing some information about the input graph in order to summarize it more). Some of our algorithms are lossless and some lossy, but with controlled utility loss. Our first utility-driven graph summarization algorithm, G-SCIS, is based on a clique and independent set decomposition, that produces optimal compression with zero loss of utility. The compression provided is significantly better than state-of-the-art in lossless graph summarization, while the runtime is two orders of magnitude lower. Our second algorithm is T-BUDS, a highly scalable, utility-driven algorithm for fully controlled lossy summarization. It achieves high scalability by combining memory reduction using Maximum Spanning Tree with a novel binary search procedure. T-BUDS outperforms state-of-the-art drastically in terms of the quality of summarization and is about two orders of magnitude better in terms of speed. In contrast to the competition, we are able to handle web-scale graphs in a single machine without performance impediment as the utility threshold (and size of summary) decreases. Also, we show that our graph summaries can be used as-is to answer several important classes of queries, such as triangle enumeration, Pagerank and shortest paths. We then propose algorithm LDME, a correction set-based graph summarization algorithm that produces compact output representations in a fast and scalable manner. To achieve this, we introduce (1) weighted locality sensitive hashing to drastically reduce the number of comparisons required to find good node merges, (2) an efficient way to compute the best quality merges that produces more compact outputs, and (3) a new sort-based encoding algorithm that is faster and more robust. More interestingly, our algorithm provides performance tuning settings to allow the option of trading compression for running time. On high compression settings, LDME achieves compression equal to or better than the state of the art with up to 53x speedup in running time. On high speed settings, LDME achieves up to two orders of magnitude speedup with only slightly lower compression. We also present two lossless summarization algorithms, Optimal and Scalable, for summarizing fully dynamic graphs. More concretely, we follow the framework of G-SCIS, which produces summaries that can be used as-is in several graph analytics tasks. Different from G-SCIS, which is a batch algorithm, Optimal and Scalable are fully dynamic and can respond rapidly to each change in the graph. Not only are Optimal and Scalable able to outperform G-SCIS and other batch algorithms by several orders of magnitude, but they also significantly outperform MoSSo, the state-of-the-art in lossless dynamic graph summarization. While Optimal produces always the most optimal summary, Scalable is able to trade the amount of node reduction for extra scalability. For reasonable values of the parameter $K$, Scalable is able to outperform Optimal by an order of magnitude in speed, while keeping the rate of node reduction close to that of Optimal. An interesting fact that we observed experimentally is that even if we were to run a batch algorithm, such as G-SCIS, once for every big batch of changes, still they would be much slower than Scalable. For instance, if 1 million changes occur in a graph, Scalable is two orders of magnitude faster than running G-SCIS just once at the end of the 1 million-edge sequence. / Graduate Graph Summarization Query Answering Lossless summary Lossy summary Locality Sensitive Hashing Jaccard Similarity Weighted Jaccard Similarity Hashing Incremental Algorithms Randomized Algorithms
42	An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing / En undersökning av kodningstekniker för diskreta variabler inom maskininlärning: binär mot one-hot och feature hashing Seger, Cedric January 2018 (has links) Machine learning methods can be used for solving important binary classification tasks in domains such as display advertising and recommender systems. In many of these domains categorical features are common and often of high cardinality. Using one-hot encoding in such circumstances lead to very high dimensional vector representations, causing memory and computability concerns for machine learning models. This thesis investigated the viability of a binary encoding scheme in which categorical values were mapped to integers that were then encoded in a binary format. This binary scheme allowed for representing categorical features using log2(d)-dimensional vectors, where d is the dimension associated with a one-hot encoding. To evaluate the performance of the binary encoding, it was compared against one-hot and feature hashed representations with the use of linear logistic regression and neural networks based models. These models were trained and evaluated using data from two publicly available datasets: Criteo and Census. The results showed that a one-hot encoding with a linear logistic regression model gave the best performance according to the PR-AUC metric. This was, however, at the expense of using 118 and 65,953 dimensional vector representations for Census and Criteo respectively. A binary encoding led to a lower performance but used only 35 and 316 dimensions respectively. For Criteo, binary encoding suffered significantly in performance and feature hashing was perceived as a more viable alternative. It was also found that employing a neural network helped mitigate any loss in performance associated with using binary and feature hashed representations. / Maskininlärningsmetoder kan användas för att lösa viktiga binära klassificeringsuppgifter i domäner som displayannonsering och rekommendationssystem. I många av dessa domäner är kategoriska variabler vanliga och ofta av hög kardinalitet. Användning av one-hot-kodning under sådana omständigheter leder till väldigt högdimensionella vektorrepresentationer. Detta orsakar minnesoch beräkningsproblem för maskininlärningsmodeller. Denna uppsats undersökte användbarheten för ett binärt kodningsschema där kategoriska värden var avbildade på heltalvärden som sedan kodades i ett binärt format. Detta binära system tillät att representera kategoriska värden med hjälp av log2(d) -dimensionella vektorer, där d är dimensionen förknippad med en one-hot kodning. För att utvärdera prestandan för den binära kodningen jämfördes den mot one-hot och en hashbaserad kodning. En linjär logistikregression och ett neuralt nätverk tränades med hjälp av data från två offentligt tillgängliga dataset: Criteo och Census, och den slutliga prestandan jämfördes. Resultaten visade att en one-hot kodning med en linjär logistisk regressionsmodell gav den bästa prestandan enligt PR-AUC måttet. Denna metod använde dock 118 och 65,953 dimensionella vektorrepresentationer för Census respektive Criteo. En binär kodning ledde till en lägre prestanda generellt, men använde endast 35 respektive 316 dimensioner. Den binära kodningen presterade väsentligt sämre specifikt för Criteo datan, istället var hashbaserade kodningen en mer attraktiv lösning. Försämringen i prestationen associerad med binär och hashbaserad kodning kunde mildras av att använda ett neuralt nätverk. Computer and Information Sciences Data- och informationsvetenskap
43	Scale and Concurrency of Massive File System Directories Patil, Swapnil 01 May 2013 (has links) File systems store data in files and organize these files in directories. Over decades, file systems have evolved to handle increasingly large files: they distribute files across a cluster of machines, they parallelize access to these files, they decouple data access from metadata access, and hence they provide scalable file access for high-performance applications. Sadly, most cluster-wide file systems lack any sophisticated support for large directories. In fact, most cluster file systems continue to use directories that were designed for humans, not for large-scale applications. The former use-case typically involves hundreds of files and infrequent concurrent mutations in each directory, while the latter use-case consists of tens of thousands of concurrent threads that simultaneously create large numbers of small files in a single directory at very high speeds. As a result, most cluster file systems exhibit very poor file create rate in a directory either due to limited scalability from using a single centralized directory server or due to reduced concurrency from using a system-wide synchronization mechanism. This dissertation proposes a directory architecture called GIGA+ that enables a directory in a cluster file system to store millions of files and sustain hundreds of thousands of concurrent file creations every second. GIGA+ makes two indexing technique to scale out a growing directory on many servers and an efficient layered design to scale up performance. GIGA+ uses a hash-based, incremental partitioning algorithm that enables highly concurrent directory indexing through asynchrony and eventual consistency of the internal indexing state (while providing strong consistency guarantees to the application data). This dissertation analyzes several trade-offs between data migration overhead, load balancing effectiveness, directory scan performance, and entropy of indexing state made by the GIGA+ design, and compares them with policies used in other systems. GIGA+ also demonstrates a modular implementation that separates directory distribution from directory representation. It layers a client-server middleware, which spreads work among many GIGA+ servers, on top of a backend storage system, which manages on-disk directory representation. This dissertation studies how system behavior is tightly dependent on both the indexing scheme and the on-disk implementations, and evaluates how the system performs for different backend configurations including local and shared-disk stores. The GIGA+ prototype delivers highly scalable directory performance (that exceeds the most demanding Petascale-era requirements), provides the traditional UNIX file system interface (that can run applications without any modifications) and offers a new functionality layered on existing cluster file systems (that lack support for distributed directories)contributions: a concurrent Scalable Directory Indexing Load Balancing Distributed Hashing Highspeed File Ingest Cluster File Systems Computer Sciences
44	Large scale optimization methods for metric and kernel learning Jain, Prateek 06 November 2014 (has links) A large number of machine learning algorithms are critically dependent on the underlying distance/metric/similarity function. Learning an appropriate distance function is therefore crucial to the success of many methods. The class of distance functions that can be learned accurately is characterized by the amount and type of supervision available to the particular application. In this thesis, we explore a variety of such distance learning problems using different amounts/types of supervision and provide efficient and scalable algorithms to learn appropriate distance functions for each of these problems. First, we propose a generic regularized framework for Mahalanobis metric learning and prove that for a wide variety of regularization functions, metric learning can be used for efficiently learning a kernel function incorporating the available side-information. Furthermore, we provide a method for fast nearest neighbor search using the learned distance/kernel function. We show that a variety of existing metric learning methods are special cases of our general framework. Hence, our framework also provides a kernelization scheme and fast similarity search scheme for such methods. Second, we consider a variation of our standard metric learning framework where the side-information is incremental, streaming and cannot be stored. For this problem, we provide an efficient online metric learning algorithm that compares favorably to existing methods both theoretically and empirically. Next, we consider a contrasting scenario where the amount of supervision being provided is extremely small compared to the number of training points. For this problem, we consider two different modeling assumptions: 1) data lies on a low-dimensional linear subspace, 2) data lies on a low-dimensional non-linear manifold. The first assumption, in particular, leads to the problem of matrix rank minimization over polyhedral sets, which is a problem of immense interest in numerous fields including optimization, machine learning, computer vision, and control theory. We propose a novel online learning based optimization method for the rank minimization problem and provide provable approximation guarantees for it. The second assumption leads to our geometry-aware metric/kernel learning formulation, where we jointly model the metric/kernel over the data along with the underlying manifold. We provide an efficient alternating minimization algorithm for this problem and demonstrate its wide applicability and effectiveness by applying it to various machine learning tasks such as semi-supervised classification, colored dimensionality reduction, manifold alignment etc. Finally, we consider the task of learning distance functions under no supervision, which we cast as a problem of learning disparate clusterings of the data. To this end, we propose a discriminative approach and a generative model based approach and we provide efficient algorithms with convergence guarantees for both the approaches. / text Rank minimization Metric learning Kernel learning Fast similarity search Locality sensitive hashing
45	ID Photograph hashing : a global approach / Hachage de photographie d’identité : une approche globale Smoaca, Andreea 12 December 2011 (has links) Cette thèse traite de la question de l’authenticité des photographies d’identité, partie intégrante des documents nécessaires lors d’un contrôle d’accès. Alors que les moyens de reproduction sophistiqués sont accessibles au grand public, de nouvelles méthodes / techniques doivent empêcher toute falsification / reproduction non autorisée de la photographie d’identité. Cette thèse propose une méthode de hachage pour l’authentification de photographies d’identité, robuste à l’impression-lecture. Ce travail met ainsi l’accent sur les effets de la numérisation au niveau de hachage. L’algorithme mis au point procède à une réduction de dimension, basée sur l’analyse en composantes indépendantes (ICA). Dans la phase d’apprentissage, le sous-espace de projection est obtenu en appliquant l’ICA puis réduit selon une stratégie de sélection entropique originale. Dans l’étape d’extraction, les coefficients obtenus après projection de l’image d’identité sur le sous-espace sont quantifiés et binarisés pour obtenir la valeur de hachage. L’étude révèle les effets du bruit de balayage intervenant lors de la numérisation des photographies d’identité sur les valeurs de hachage et montre que la méthode proposée est robuste à l’attaque d’impression-lecture. L’approche suivie en se focalisant sur le hachage robuste d’une classe restreinte d’images (d’identité) se distingue des approches classiques qui adressent une image quelconque / This thesis addresses the question of the authenticity of identity photographs, part of the documents required in controlled access. Since sophisticated means of reproduction are publicly available, new methods / techniques should prevent tampering and unauthorized reproduction of the photograph. This thesis proposes a hashing method for the authentication of the identity photographs, robust to print-and-scan. This study focuses also on the effects of digitization at hash level. The developed algorithm performs a dimension reduction, based on independent component analysis (ICA). In the learning stage, the subspace projection is obtained by applying ICA and then reduced according to an original entropic selection strategy. In the extraction stage, the coefficients obtained after projecting the identity image on the subspace are quantified and binarized to obtain the hash value. The study reveals the effects of the scanning noise on the hash values of the identity photographs and shows that the proposed method is robust to the print-and-scan attack. The approach focusing on robust hashing of a restricted class of images (identity) differs from classical approaches that address any image ACI Hachage d'images Photographie d'identité ICA Image hashing Print-scan Noise model ID photograph
46	Hashing algorithms : A comparison for blockchains in Internet of things Dahlin, Karl January 2018 (has links) In today’s society blockchains and the Internet of things are two very discussed subjects this has led to thoughts about combining them by using a blockchain in Internet of things. This objective of this study has been to solve the problem which hashing algorithms is the best for a blockchain used in an Internet of things network. This have been done by first selecting a few hashing algorithms and setting up a scenario where a blockchain can be used in an Internet of things network. After that I specified what to compare, speed, input length and output length. The study has been conducted with the aid of literary studies about the hashing algorithms and a program that implements the algorithms and tests their speed. The study has shown that out of the selected hashing algorithms MD5, SHA-256, SHA3-256 with the conditions specified for the study that the hashing algorithm SHA3-256 is the best option if speed is not of the utmost importance in the scenario since it is from a newer standard and do not have a max input length. If speed is the very important in other words if SHA3-256 is to slow then SHA-256 would be best for the Internet of things network. Internet of things Blockchain Hash Hashing algorithms MD5 SHA C++ Matlab Software Engineering Programvaruteknik
47	Protection des données visuelles : analyse des fonctions de hachage perceptuel / Protection of Visual Data : perceptual Hashing Functions Analysis Hadmi, Azhar 26 October 2012 (has links) Avec une croissance considérable dans le domaine de la technologie multimédia et de la forte disponibilité des logiciels de traitement d'image, il est désormais facile de manipuler les images numériques. En même temps, il est devenu possible de réaliser des contrefaçons très élaborées, en laissant très peu de traces. Cela présente un grave problème, en particulier, si l'authenticité de l'image numérique est exigée. L'authentification des images devrait se baser sur leurs contenus visuels et non pas sur leurs contenus binaires. Par conséquent, pour authentifier une image, il faut tolérer quelques manipulations acceptables que pourrait subir une image telles que la compression JPEG et l'ajout de bruit Gaussien. En effet, ces manipulations préservent l'aspect visuel de l'image. En même temps un système de hachage perceptuel doit être suffisamment fragile pour détecter les manipulations malveillantes qui modifient l'interprétation du contenu sémantique de l'image comme l'ajout de nouveaux objets, la suppression ou la modification majeure d'objets existants.Dans cette thèse, nous nous intéressons aux fonctions de hachage perceptuel pour l'authentification et le contrôle d'intégrité des images numériques. Dans ce but, nous présentons tous les aspects relatifs aux fonctions de hachage perceptuel. Puis, nous exposons les contraintes qu'un système de hachage perceptuel doit satisfaire pour répondre aux exigences souhaitées au niveau de la robustesse des signatures perceptuelles. Enfin, nous présentons une méthode permettant d'améliorer la robustesse et la sécurité d'un système dehachage perceptuel. / The widespread use of multimedia technology has made it relatively easy to manipulate and tamper visual data. In particular, digital image processing and image manipulation tools offer facilities to intentionally alter image content without leaving perceptual traces. This presents a serious problem, particularly if the authenticity of the digital image is required. The image authentication should be based on their visual content and not on their binary content. Therefore, to authenticate an image, some acceptable manipulations that could undergoes an image, such as JPEG compression and Gaussian noise addition, must tolerated. Indeed, these manipulations preserve the visual appearance of the image. At the same time a perceptual hashing system should be sufficiently sensitive to detect malicious manipulations that modify the interpretation of the semantic content of the imagesuch as adding new objects, deleting or major modification of existing objects.In this thesis, we focus on perceptual hash functions for authentication and integrityverification of digital images. For this purpose, we present all aspects of perceptual hashfunctions. Then, we discuss the constraints that perceptual hashing system must satisfy tomeet desired level of robustness of perceptual signatures. Finally, we present a method toimprove the robustness and security of a system of perceptual hashing. Hachage perceptuel Robustesse Sécurité Authentification Intégrité Sha-1 Percetual hashing Robustness Security Authentication Integrity Sha-1.
48	Distributed Local Outlier Factor with Locality-Sensitive Hashing Zheng, Lining 08 November 2019 (has links) Outlier detection remains a heated area due to its essential role in a wide range of applications, including intrusion detection, fraud detection in finance, medical diagnosis, etc. Local Outlier Factor (LOF) has been one of the most influential outlier detection techniques over the past decades. LOF has distinctive advantages on skewed datasets with regions of various densities. However, the traditional centralized LOF faces new challenges in the era of big data and no longer satisfies the rigid time constraints required by many modern applications, due to its expensive computation overhead. A few researchers have explored the distributed solution of LOF, but existant methods are limited by their grid-based data partitioning strategy, which falls short when applied to high-dimensional data. In this thesis, we study efficient distributed solutions for LOF. A baseline MapReduce solution for LOF implemented with Apache Spark, named MR-LOF, is introduced. We demonstrate its disadvantages in communication cost and execution time through complexity analysis and experimental evaluation. Then an approximate LOF method is proposed, which relies on locality-sensitive hashing (LSH) for partitioning data and enables fully distributed local computation. We name it MR-LOF-LSH. To further improve the approximate LOF, we introduce a process called cross-partition updating. With cross-partition updating, the actual global k-nearest neighbors (k-NN) of the outlier candidates are found, and the related information of the neighbors is used to update the outlier scores of the candidates. The experimental results show that MR-LOF achieves a speedup of up to 29 times over the centralized LOF. MR-LOF-LSH further reduces the execution time by a factor of up to 9.9 compared to MR-LOF. The results also highlight that MR-LOF-LSH scales well as the cluster size increases. Moreover, with a sufficient candidate size, MR-LOF-LSH is able to detect in most scenarios over 90% of the top outliers with the highest LOF scores computed by the centralized LOF algorithm. Outlier detection Distributed computing Apache Spark Locality-sensitive hashing Local Outlier Factor
49	The Maximum Displacement for Linear Probing Hashing Petersson, Niclas January 2009 (has links) In this thesis we study the standard probabilistic model for hashing with linear probing. The main purpose is to determine the asymptotic distribution for the maximum displacement. Depending on the ratio between the number of items and the number of cells, there are several cases to consider. Paper I solves the problem for the special case of almost full hash tables. That is, hash tables where every cell but one is occupied. Paper II completes the analysis by solving the problem for all remaining cases. That is, for every case where the number of items divided by the number of cells lies in the interval [0,1]. The last two papers treat quite different topics. Paper III studies the area covered by the supremum process of Brownian motion. One of the main theorems in Paper I is expressed in terms of the Laplace transform of this area. Paper IV provides a new sufficient condition for a collection of independent random variables to be negatively associated when conditioned on their total sum. The condition applies to a collection of independent Borel-distributed random variables, which made it possible to prove a Poisson approximation that where essential for the completion of Paper II. Probabilistic analysis of algorithms hashing linear probing negative dependence Brownian motion. MATHEMATICS MATEMATIK
50	3d Object Recognition By Geometric Hashing For Robotics Applications Hozatli, Aykut 01 February 2009 (has links) (PDF) The main aim of 3D Object recognition is to recognize objects under translation and rotation. Geometric Hashing is one of the methods which represents a rotation and translation invariant approach and provides indexing of structural features of the objects in an efficient way. In this thesis, Geometric Hashing is used to store the geometric relationship between discriminative surface properties which are based on surface curvature. In this thesis surface is represented by shape index and splash where shape index defines particular shaped surfaces and splash introduces topological information. The method is tested on 3D object databases and compared with other methods in the literature. QA Computer Software 76.75-76.765

Search results