141 |
Novel storage architectures and pointer-free search trees for database systemsVasaitis, Vasileios January 2012 (has links)
Database systems research is an old and well-established field in computer science. Many of the key concepts appeared as early as the 60s, while the core of relational databases, which have dominated the database world for a while now, was solidified during the 80s. However, the underlying hardware has not displayed such stability in the same period, which means that a lot of assumptions that were made about the hardware by early database systems are not necessarily true for modern computer architectures. In particular, over the last few decades there have been two notable consistent trends in the evolution of computer hardware. The first is that the memory hierarchy of mainstream computer systems has been getting deeper, with its different levels moving away from each other, and new levels being added in between as a result, in particular cache memories. The second is that, when it comes to data transfers between any two adjacent levels of the memory hierarchy, access latencies have not been keeping up with transfer rates. The challenge is therefore to adapt database index structures so that they become immune to these two trends. The latter is addressed by gradually increasing the size of the data transfer unit; the former, by organizing the data so that it exhibits good locality for memory transfers across multiple memory boundaries. We have developed novel structures that facilitate both of these strategies. We started our investigation with the venerable B+-tree, which is the cornerstone order-preserving index of any database system, and we have developed a novel pointer-free tree structure for its pages that optimizes its cache performance and makes it immune to the page size. We then adapted our approach to the R-tree and the GiST, making it applicable to multi-dimensional data indexes as well as generalized indexes for any abstract data type. Finally, we have investigated our structure in the context of main memory alone, and have demonstrated its superiority over the established approaches in that setting too. While our research has its roots in data structures and algorithms theory, we have conducted it with a strong experimental focus, as the complex interactions within the memory hierarchy of a modern computer system can be quite challenging to model and theorize about effectively. Our findings are therefore backed by solid experimental results that verify our hypotheses and prove the superiority of our structures over competing approaches.
|
142 |
Distributed Frameworks Towards Building an Open Data ArchitectureVenumuddala, Ramu Reddy 05 1900 (has links)
Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model. This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and Spark.
|
143 |
Řídké třídy grafů / Nowhere-dense classes of graphsTůma, Vojtěch January 2013 (has links)
In this thesis we study sparse classes of graphs and their properties usable for design of algorithms and data structures. Our specific focus is on the con- cepts of bounded expansion and tree-depth, developed in recent years mainly by J. Nešetřil and P. Ossona de Mendez. We first give a brief introduction to the theory as whole and survey tools and results from related areas of parametrised complexity and algorithmic model theory. The main part of the thesis, application of the theory, presents two new dynamic data structures. The first is for keeping a tree-depth decomposition of a graph, the second counts appearances of fixed subgraphs in a given graph. The time and space complexity of operations of both structures is guaranteed to be low when used for sparse graphs. 1
|
144 |
Enhancing the Verification-Driven Learning Model for Data Structures with VisualizationKondeti, Yashwanth Reddy 04 August 2011 (has links)
The thesis aims at teaching various data structures algorithms using the Visualization Learning tool. The main objective of the work is to provide a learning opportunity for novice computer science students to gain a broader exposure towards data structure programming. The visualization learning tool is based on the Verification-Driven Learning model developed for software engineering. The tool serves as a platform for demonstrating visualizations of various data structures algorithms. All the visualizations are designed to emphasize the important operational features of various data structures. The learning tool encourages students into learning data structures by designing Learning Cases. The Learning Cases have been carefully designed to systematically implant bugs in a properly functioning visualization. Students are assigned the task of analyzing the code and also identify the bugs through quizzing. This provides students with a challenging hands-on learning experience that complements students’ textbook knowledge. It also serves as a significant foundation for pursuing future courses in data structures.
|
145 |
Algorithmes et structures de données efficaces pour l’indexation de séquences d’ADN / Efficient algorithms and data structures for indexing DNA sequence dataSalikhov, Kamil 17 November 2017 (has links)
Les volumes des données générées par les technologies de séquençage haut débit augmentent exponentiellement ce dernier temps. Le stockage, le traitement et le transfertdeviennent des défis de plus en plus sérieux. Pour les affronter, les scientifiques doivent élaborer des approches et des algorithmes de plus en plus efficaces.Dans cette thèse, nous présentons des structures de données efficaces etdes algorithmes pour des problèmes de recherche approchée de chaînes de caractères, d'assemblagedu génome, de compression de séquences d’ADN et de classificationmétagénomique de lectures d’ADN.Le problème de recherche approchée a été bien étudié, avec un grandnombre de travaux publiés. Dans ledomaine de bioinformatique, le problème d’alignement de séquences peut être considéré comme unproblème de recherche approchée de chaînes de caractères. Dans notre travail, nousétudions une stratégie de recherche basée sur une structure d'indexation ditebidirectionnelle. D’abord, nous définissons un formalisme des schémas de recherche pour travailleravec les stratégies de recherche de ce type, ensuite nous fixons une mesure probabiliste del’efficacité de schémas de recherche et démontrons quelques propriétés combinatoires de schémasde recherche efficaces. Finalement, nous présentons des calculs expérimentaux quivalident la supériorité de nos stratégies. L’assemblage du génome est un des problèmes clefs en bioinformatique.Dans cette thèse, nous présentons une structure de données — filtre de Bloom en Cascade— qui améliore le filtre de Bloom standard et peut être utilisé pour larésolution de certains problèmes, y compris pour l’assemblage du génome. Nousdémontrons ensuite des résultats analytiques et expérimentaux sur les propriétés du filtre deBloom en Cascade. Nous présentons également comment le filtre de Bloom en Cascade peut être appliqué au problèmede compression de séquences d’ADN.Un autre problème que nous étudions dans cette thèse est la classificationmétagénomique de lectures d’ADN. Nous présentons une approche basée sur la transforméede Burrows-Wheeler pour la recherche efficace et rapide de k-mers (mots de longueur k).Cette étude est centrée sur les structures des données qui améliorent lavitesse et la consommation de mémoire par rapport à l'index classique de Burrows-Wheeler, dans le cadre de notre application / Amounts of data generated by Next Generation Sequencing technologies increase exponentially in recent years. Storing, processing and transferring this data become more and more challenging tasks. To be able to cope with them, data scientists should develop more and more efficient approaches and techniques.In this thesis we present efficient data structures and algorithmic methods for the problems of approximate string matching, genome assembly, read compression and taxonomy based metagenomic classification.Approximate string matching is an extensively studied problem with countless number of published papers, both theoretical and practical. In bioinformatics, read mapping problem can be regarded as approximate string matching. Here we study string matching strategies based on bidirectional indices. We define a framework, called search schemes, to work with search strategies of this type, then provide a probabilistic measure for the efficiency of search schemes, prove several combinatorial properties of efficient search schemes and provide experimental computations supporting the superiority of our strategies.Genome assembly is one of the basic problems of bioinformatics. Here we present Cascading Bloom filter data structure, that improves standard Bloom filter and can be applied to several problems like genome assembly. We provide theoretical and experimental results proving properties of Cascading Bloom filter. We also show how Cascading Bloom filter can be used for solving another important problem of read compression.Another problem studied in this thesis is metagenomic classification. We present a BWT-based approach that improves the BWT-index for quick and memory-efficient k-mer search. We mainly focus on data structures that improve speed and memory usage of classical BWT-index for our application
|
146 |
Data mining heuristic-¬based malware detection for android applicationsUnknown Date (has links)
The Google Android mobile phone platform is one of the dominant smartphone operating systems on the market. The open source Android platform allows developers to take full advantage of the mobile operation system, but also raises significant issues related to malicious applications (Apps). The popularity of Android platform draws attention of many developers which also attracts the attention of cybercriminals to develop different kinds of malware to be inserted into the Google Android Market or other third party markets as safe applications. In this thesis, we propose to combine permission, API (Application Program Interface) calls and function calls to build a Heuristic-Based framework for the detection of malicious Android Apps. In our design, the permission is extracted from each App’s profile information and the APIs are extracted from the packed App file by using packages and classes to represent API calls. By using permissions, API calls and function calls as features to characterize each of Apps, we can develop a classifier by data mining techniques to identify whether an App is potentially malicious or not. An inherent advantage of our method is that it does not need to involve any dynamic tracking of the system calls but only uses simple static analysis to find system functions from each App. In addition, Our Method can be generalized to all mobile applications due to the fact that APIs and function calls are always present for mobile Apps. Experiments on real-world Apps with more than 1200 malwares and 1200 benign samples validate the algorithm performance.
Research paper published based on the work reported in this thesis:
Naser Peiravian, Xingquan Zhu, Machine Learning for Android Malware Detection
Using Permission and API Calls, in Proc. of the 25th IEEE International Conference on
Tools with Artificial Intelligence (ICTAI) – Washington D.C, November 4-6, 2013. / Includes bibliography. / Thesis (M.S.)--Florida Atlantic University, 2013.
|
147 |
Comparing different and inverter graph data structure / Comparativo de diferentes estruturas de dados de And-Inverter GraphBittencourt, Marcelo Corrêa de January 2018 (has links)
Este documento apresenta uma análise de desempenho de quatro diferentes implementações de And-Inverter Graph (AIG). AIGs são estruturas de dados normalmente utilizadas em programas que são utilizados para design de circuitos digitais. Diferentes implementações da mesma estrutura de dados pode afetar o desempenho. Isto é demonstrado em trabalhos anteriores que avaliam o desempenho de diferentes pacotes BDD (Binary Decision Diagram), que é outra estrutura de dados largamente utilizada em síntese lógica. Foram implementadas quatro estruturas de dados diferentes utilizando grafos unidirecionais ou bidirecionais aos quais os nodos são referenciados utilizando ponteiros ou índices de inteiros não-negativos. Utilizando estas diferentes estruturas de dados de AIG, medimos como diferentes aspectos das implementações afetam o desempenho da execução de um algoritmo básico. / This document presents a performance analysis of four different And-Inverter Graph (AIG) implementations. AIG is a data structure commonly used in programs used for digital circuits design. Different implementations of the same data structure can affect performance. This is demonstrated by previous works that evaluate performance for different Binary Decision Diagram (BDD) packages, another data structure widely used in logic synthesis. We have implemented four distinct AIG data structures using a choice of unidirectional or bidirectional graphs in which the references to nodes are made using pointers or indexed using non-negative integers. Using these different AIG data structures, we measure how different implementation aspects affect performance in running basic algorithm.
|
148 |
Estruturas de dados topológicas aplicadas em simulações de escoamentos compressíveis utilizando volumes finitos e métodos de alta ordem / Topologic data structures applied on compressible flows simulations using finite volume and high-order methodsBarbosa, Fernanda Paula 18 December 2012 (has links)
A representação de malhas por meio de estrutura de dados e operadores topológicos e um dos focos principais da modelagem geométrica, onde permite uma implementação robusta e eficiente de mecanismos de refinamento adaptativo, alinhamento de células e acesso as relações de incidência e adjacência entre os elementos da malha, o que é de grande importância na maioria das aplicações em mecânica dos fluidos. No caso de malhas não estruturadas, a não uniformidade da decomposição celular e melhor representada por uma estrategia mais sofisticada, que são as estruturas de dados topológicas. As estruturas de dados topológicas indexam elementos de uma malha representando relações de incidência e adjacência entre elementos, garantindo acesso rápido às informações. Um dos aspectos mais comuns aos problemas tratados pela mecânica dos fluidos computacional é a complexidade da geometria do domínio onde ocorre o escoamento. O uso de estruturas de dados para manipular malhas computacionais e de grande importância pois realiza de modo eficiente as consultas às informações da malha e centraliza todas as operações sobre a malha em um único módulo, possibilitando sua extensão e adaptação em diversas situações. Este trabalho visou explorar o acoplamento de uma estrutura de dados topológica, a Mate Face, em um módulo simulador existente, de modo a gerenciar todos os acessos à malha e dispor operações e iteradores para pesquisas complexas nas vizinhanças de cada elemento na malha. O módulo simulador resolve as equações governantes da mecânica dos fluidos através da técnica de volumes finitos. Foi utilizada uma formulação que atribui os valores das propriedades aos centroides dos volumes de controle, utiliza métodos de alta ordem, os esquemas ENO e WENO, que tem a finalidade de capturar com eficiência descontinuidades presentes em problemas governados por equações diferenciais parciais hiperbólicas. As equações de Euler em duas dimensões representam os escoamentos de interesse no presente trabalho. O acoplamento da estrutura de dados Mate Face ao simulador foi realizada através da criação de uma biblioteca desenvolvida que atua como uma interface de comunicação entre os dois módulos, a estrutura de dados e o simulador, que foram implementados em diferentes linguagens de programação. Deste modo, todas as funcionalidades existentes na Mate Face tornaram-se acessíveis ao simulador na forma de procedimentos. Um estudo sobre malhas dinâmicas foi realizado envolvendo o método das molas para movimentação de malhas simulando-se operações de arfagem. A idéia foi verificar a aplicabilidade deste método para auxiliar simulações de escoamentos não estacionarios. Uma outra vertente do trabalho foi estender a estrutura Mate Face de forma a representar elementos não suportados a priori, de modo a flexibilizar o seu uso em simulações de escoamentos baseados no método de volumes finitos espectrais. O método dos volumes espectrais e utilizado para se obter alta resolução espacial do domínio computacional, que também atribui valores das propriedades aos centroides dos volumes de controle, porém, os volumes de controle são particionados em volumes menores de variadas topologias. Assim, uma extensão da Mate Face foi desenvolvida para representar a nova malha para a aplicação do método, representando-se cada particionamento localmente em cada volume espectral. Para todas as etapas deste trabalho, realizaram-se experimentos que validaram a utilizaação da estrutura de dados Mate Face junto a métodos numéricos. Desta forma, a estrutura pode auxiliar as ferramentas de simulações de escoamentos de fluidos no gerenciamento e acesso à malha computacional / The storage and access of grid files by data structures and topologic operators is one of the most important goals of geometric modeling research field, which allows an efficient and stable implementation of adaptive refinement mechanisms, cells alignment and access to incidence and adjacency properties from grid elements, representing great concernment in the majority of applications from fluid mechanics. In the case of non-structured grids, the cellular decomposition if non-uniform and is better suited by a more sophisticated strategy - the topologic data structs. The topologic data structs index grid elements representing incidence and adjacency properties from grid elements, ensuring quick access to information. One of most common aspects from problems solved by computational fluid mechanic is the complexity of the domain geometry where the fluid ows. The usage of data structures to manipulate computational grids is of great importance because it performs efficiently queries on grid information and centers all operations to the grid on a unique module, allowing its extension and flexible usage on many problems. This work aims at exploring the coupling of a topologic data structure, the Mate Face, on a solver module, by controlling all grid access providing operators and iterators that perform complex neighbor queries at each grid element. The solver module solves the governing equations from fluid mechanics though the finite volume technique with a formulation that sets the property values to the control volume centroids, using high order methods - the ENO and WENO schemes, which have the purpose of efficiently capture the discontinuities appearing in problems governed by hyperbolic conservation laws. The two dimensional Euler equations are considered to represent the flows of interest. The coupling of the Mate Face data structure to the solver module was achieved by a creation of a library that acts as an interface layer between both modules, the Mate Face and the solver, which had been implemented using different programming languages. Therefore, all Mate Face class methods are available to the solver module though the interface library in the form of procedures. A study of dynamic grids was made by using spring methods for the moving grid under pitch movement case. The goal was to analyze the applicability of such method to aid non stationary simulations. Another contribution of this work was to show how the Mate Face can be extended in order to deal with non-supported types of elements, allowing it to aid numeric simulations using the spectral finite volume method. The spectral nite volume method is used to obtain high spatial resolution, also by setting the property values to the control volume centroids, but here the control volumes are partitioned into smaller volumes of different types, from triangles to hexagons. Then, an extension of the Mate Face was developed in order to hold the new generated grid by the partitioning specfied by the spectral finite volume method. The extension of Mate Face represents all partitioned elements locally for each original control volume. For all implementations and proposals from this work, experiments were performed to validate the usage of the Mate Face along with numeric methods. Finally, the data structure can aid the fluid flow simulation tools by managing the grid file and providing efficient query operators
|
149 |
Summarizing static graphs and mining dynamic graphs. / CUHK electronic theses & dissertations collectionJanuary 2011 (has links)
Besides finding changing areas based on the number of node and edge evolutions, a more interesting problem is to analyze the impact of these evolutions to graphs and find the regions that exhibit significant changes when these evolutions happen. The more different the relationship between nodes in a certain region is, the more significant this region is. This problem is challenging since it is hard to define the range of changing regions that is closely related to actual evolutions. We formalize the problem by using a similarity measure based on neighborhood random walks, and design an efficient algorithm which is able to identify the significant changing regions without recomputing all similarities. Meaningful examples in experiments demonstrate the effectiveness of our algorithms. / Graph patterns are able to represent the complex structural relations among objects in many applications in various domains. Managing and mining graph data, on which we study in this thesis, are no doubt among the most important tasks. We focus on two challenging problems, namely, graph summarization and graph change detection. / In the area of summarizing a collection of graphs, we study the problem of summarizing frequent subgraphs, since it is not much necessary to summarize a collection of random graphs. The bottleneck for exploring and understanding frequent subgraphs is that they are numerous. A summary can be a solution to this issue, so the goal of frequent subgraph summarization is to minimize the restoration errors of the structure and the frequency information. The unique challenge in frequent subgraph summarization comes from the fact that a subgraph can have multiple embeddings in a summarization template graph. We handle this issue by introducing a partial order between edges to allow accurate structure and frequency estimation based on an independence probabilistic model. The proposed algorithm discovers k summarization templates in a top-down fashion to control the restoration error of frequencies within sigma. There is no restoration error of structures. Experiments on both real and synthetic graph datasets show that our framework can control the frequency restoration error within 10% by a compact summarization model. / The objective of graph change detection is to discover the changing areas on graphs when they evolves at a high speed. The most changing areas are those areas having the highest number of evolutions (additions/deletions) of nodes and edges, which is called burst areas. We study on finding the most burst areas in a stream of fast graph evolutions. We propose to use Haar wavelet tree to monitor the upper bound of the number of evolutions. Our approach monitors all potential changing areas of different sizes and computes incrementally the number of evolutions in those areas. The top-k burst areas are returned as soon as they are detected. Our solution is capable of handling a large amount of evolutions in a short time, which is consistent to the experimental results. / The objective of graph summarization is to obtain a concise representation of a single large graph or a collection of graphs, which is interpretable and suitable for analysis. A good summary can reveal the hidden relationships between nodes in a graph. The key issue of summarizing a single graph is how to construct a high-quality and representative summary, which is in the form of a super-graph. We propose an entropy-based unified model for measuring the homogeneity of the super-graph. The best summary in terms of homogeneity could be too large to explore. By using the unified model, we relax three summarization criteria to obtain an approximate homogeneous summary of appropriate size. We propose both agglomerative and divisive algorithms for approximate summarization, as well as pruning techniques and heuristics for both algorithms to save computation cost. Experimental results confirm that our approaches can efficiently generate high-quality summaries. / Liu, Zheng. / Advisers: Wai Lam; Jeffrey Xu Yu. / Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 133-141). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
150 |
Superseding neighbor search on uncertain data. / 在不確定的空間數據庫中尋找最高取代性的最近鄰 / Zai bu que ding de kong jian shu ju ku zhong xun zhao zui gao qu dai xing de zui jin linJanuary 2009 (has links)
Yuen, Sze Man. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2009. / Includes bibliographical references (leaves [44]-46). / Abstract also in Chinese. / Thesis Committee --- p.i / Abstract --- p.ii / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 2 --- Related Work --- p.6 / Chapter 2.1 --- Nearest Neighbor Search on Precise Data --- p.6 / Chapter 2.2 --- NN Search on Uncertain Data --- p.8 / Chapter 3 --- Problem Definitions and Basic Characteristics --- p.11 / Chapter 4 --- The Full-Graph Approach --- p.16 / Chapter 5 --- The Pipeline Approach --- p.19 / Chapter 5.1 --- The Algorithm --- p.20 / Chapter 5.2 --- Edge Phase --- p.24 / Chapter 5.3 --- Pruning Phase --- p.27 / Chapter 5.4 --- Validating Phase --- p.28 / Chapter 5.5 --- Discussion --- p.29 / Chapter 6 --- Extension --- p.31 / Chapter 7 --- Experiment --- p.34 / Chapter 7.1 --- Properties of the SNN-core --- p.34 / Chapter 7.2 --- Efficiency of Our Algorithms --- p.38 / Chapter 8 --- Conclusions and Future Work --- p.42 / Chapter A --- List of Publications --- p.43 / Bibliography --- p.44
|
Page generated in 0.1119 seconds