Global ETD Search

1	Correlated Sample Synopsis on Big Data Wilson, David S. 12 December 2018 (has links) No description available. Computer Science CS2 Big Data Simple Random Sample Without Replacement Join Synopses
2	Geo-distributed multi-layer stream aggregation Cannalire, Pietro January 2018 (has links) The standard processing architectures are enough to satisfy a lot of applications by employing already existing stream processing frameworks which are able to manage distributed data processing. In some specific cases, having geographically distributed data sources requires to distribute even more the processing over a large area by employing a geographically distributed architecture.‌ The issue addressed in this work is the reduction of data movement across the network which is continuously flowing in a geo-distributed architecture from streaming sources to the processing location and among processing entities within the same distributed cluster. Reduction of data movement can be critical for decreasing bandwidth costs since accessing links placed in the middle of the network can be costly and can increase as the amount of data exchanges increase. In this work we want to create a different concept to deploy geographically distributed architectures by relying on Apache Spark Structured Streaming and Apache Kafka. The features needed for an algorithm to run on a geo-distributed architecture are provided. The algorithms to be executed on this architecture apply the windowing and the data synopses techniques to produce a summaries of the input data and to address issues of the geographically distributed architecture. The computation of the average and the Misra-Gries algorithm are then implemented to test the designed architecture. This thesis work contributes in providing a new model of building geographically distributed architecture. The experimental results show that, for the algorithms running on top of the geo distributed architecture, the computation time is reduced on average by 70% compared to the distributed setup. Similarly, and the amount of data exchanged across the network is reduced on average by 99%, compared to the distributed setup. / Standardbehandlingsarkitekturer är tillräckligt för uppfylla behoven av många tillämpningar genom användning av befintliga ramverk för flödesbehandling med stöd för distribuerad databehandling. I specifika fall kan geografiskt fördelade datakällor kräva att databehandlingen fördelas över ett stort område med hjälp av en geografiskt distribuerad arkitektur. Problemet som behandlas i detta arbete är minskningen av kontinuerlig dataöverföring i ett nätverk med geo-distribuerad arkitektur. Minskad dataöverföring kan vara avgörande för minskade bandbreddskonstnader då åtkomst av länkar placerade i mitten av ett nätverk kan vara dyrt och öka ytterligare med tilltagande dataöverföring. I det här arbetet vill vi skapa ett nytt koncept för att upprätta geografiskt distribuerade arkitekturer med hjälp av Apache Spark Structured Streaming och Apache Kafka. Funktioner och förutsättningar som behövs för att en algoritm ska kunna köras på en geografisk distribuerad arkitektur tillhandahålls. Algoritmerna som ska köras på denna arkitektur tillämpar “windowing synopsing” och “data synopses”-tekniker för att framställa en sammanfattning av ingående data samt behandla problem beträffande den geografiskt fördelade arkitekturen. Beräkning av medelvärdet och Misra-Gries-algoritmen implementeras för att testa den konstruerade arkitekturen. Denna avhandling bidrar till att förse ny modell för att bygga geografiskt distribuerad arkitektur. Experimentella resultat visar att beräkningstiden reduceras i genomsnitt 70% för de algoritmer som körs ovanför den geo-distribuerade arkitekturen jämfört med den distribuerade konfigurationen. På liknande sätt reduceras mängden data som utväxlas över nätverket med 99% i snitt jämfört med den distribuerade inställningen. stream processing geo-distributed architecture algorithms windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries algorithm flödesbehandling geo-distribuerade arkitekturen algoritmerna windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries-algoritmen Computer and Information Sciences Data- och informationsvetenskap
3	Sample Footprints für Data-Warehouse-Datenbanken Rösch, Philipp, Lehner, Wolfgang 20 January 2023 (has links) Durch stetig wachsende Datenmengen in aktuellen Data-Warehouse-Datenbanken erlangen Stichproben eine immer größer werdende Bedeutung. Insbesondere interaktive Analysen können von den signifikant kürzeren Antwortzeiten der approximativen Anfrageverarbeitung erheblich profitieren. Linked-Bernoulli-Synopsen bieten in diesem Szenario speichereffiziente, schemaweite Synopsen, d. h. Synopsen mit Stichproben jeder im Schema enthaltenen Tabelle bei minimalem Mehraufwand für die Erhaltung der referenziellen Integrität innerhalb der Synopse. Dies ermöglicht eine effiziente Unterstützung der näherungsweisen Beantwortung von Anfragen mit beliebigen Fremdschlüsselverbundoperationen. In diesem Artikel wird der Einsatz von Linked-Bernoulli-Synopsen in Data-Warehouse-Umgebungen detaillierter analysiert. Dies beinhaltet zum einen die Konstruktion speicherplatzbeschränkter, schemaweiter Synopsen, wobei unter anderem folgende Fragen adressiert werden: Wie kann der verfügbare Speicherplatz auf die einzelnen Stichproben aufgeteilt werden? Was sind die Auswirkungen auf den Mehraufwand? Zum anderen wird untersucht, wie Linked-Bernoulli-Synopsen für die Verwendung in Data-Warehouse-Datenbanken angepasst werden können. Hierfür werden eine inkrementelle Wartungsstrategie sowie eine Erweiterung um eine Ausreißerbehandlung für die Reduzierung von Schätzfehlern approximativer Antworten von Aggregationsanfragen mit Fremdschlüsselverbundoperationen vorgestellt. Eine Vielzahl von Experimenten zeigt, dass Linked-Bernoulli-Synopsen und die in diesem Artikel präsentierten Verfahren vielversprechend für den Einsatz in Data-Warehouse-Datenbanken sind. / With the amount of data in current data warehouse databases growing steadily, random sampling is continuously gaining in importance. In particular, interactive analyses of large datasets can greatly benefit from the significantly shorter response times of approximate query processing. In this scenario, Linked Bernoulli Synopses provide memory-efficient schema-level synopses, i. e., synopses that consist of random samples of each table in the schema with minimal overhead for retaining foreign-key integrity within the synopsis. This provides efficient support to the approximate answering of queries with arbitrary foreign-key joins. In this article, we focus on the application of Linked Bernoulli Synopses in data warehouse environments. On the one hand, we analyze the instantiation of memory-bounded synopses. Among others, we address the following questions: How can the given space be partitioned among the individual samples? What is the impact on the overhead? On the other hand, we consider further adaptations of Linked Bernoulli Synopses for usage in data warehouse databases. We show how synopses can incrementally be kept up-to-date when the underlying data changes. Further, we suggest additional outlier handling methods to reduce the estimation error of approximate answers of aggregation queries with foreign-key joins. With a variety of experiments, we show that Linked Bernoulli Synopses and the proposed techniques have great potential in the context of data warehouse databases. info:eu-repo/classification/ddc/004 ddc:004
4	[pt] CONTRIBUIÇÕES AO PROBLEMA DE BUSCA POR PALAVRAS-CHAVE EM CONJUNTOS DE DADOS E TRAJETÓRIAS SEMÂNTICAS BASEADOS NO RESOURCE DESCRIPTION FRAMEWORK / [en] CONTRIBUTIONS TO THE PROBLEM OF KEYWORD SEARCH OVER DATASETS AND SEMANTIC TRAJECTORIES BASED ON THE RESOURCE DESCRIPTION FRAMEWORK YENIER TORRES IZQUIERDO 18 May 2021 (has links) [pt] Busca por palavras-chave fornece uma interface fácil de usar para recuperar informação. Esta tese contribui para os problemas de busca por palavras chave em conjuntos de dados sem esquema e trajetórias semânticas baseados no Resource Description Framework. Para endereçar o problema da busca por palavras-chave em conjuntos de dados RDF sem esquema, a tese introduz um algoritmo para traduzir automaticamente uma consulta K baseada em palavras-chave especificadas pelo usuário em uma consulta SPARQL Q de tal forma que as respostas que Q retorna também são respostas para K. O algoritmo não depende de um esquema RDF, mas sintetiza as consultas SPARQL explorando a semelhança entre os domínios e contradomínios das propriedades e os conjuntos de instâncias de classe observados no grafo RDF. O algoritmo estima a similaridade entre conjuntos com base em sinopses, que podem ser precalculadas, com eficiência, em uma única passagem sobre o conjunto de dados RDF. O trabalho inclui dois conjuntos de experimentos com uma implementação do algoritmo. O primeiro conjunto de experimentos mostra que a implementação supera uma ferramenta de pesquisa por palavras-chave sobre grafos RDF que explora o esquema RDF para sintetizar as consultas SPARQL, enquanto o segundo conjunto indica que a implementação tem um desempenho melhor do que sistemas de pesquisa por palavras-chave em conjuntos de dados RDF baseados na abordagem de documentos virtuais denominados TSA+BM25 e TSA+VDP. Finalmente, a tese também computa a eficácia do algoritmo proposto usando uma métrica baseada no conceito de relevância do grafo resposta. O segundo problema abordado nesta tese é o problema da busca por palavras-chave sobre trajetórias semânticas baseadas em RDF. Trajetórias semânticas são trajetórias segmentadas em que as paradas e os deslocamentos de um objeto móvel são semanticamente enriquecidos com dados adicionais. Uma linguagem de consulta para conjuntos de trajetórias semânticas deve incluir seletores para paradas ou deslocamentos com base em seus enriquecimentos e expressões de sequência que definem como combinar os resultados dos seletores com a sequência que a trajetória semântica define. A tese inicialmente propõe um framework formal para definir trajetórias semânticas e introduz expressões de sequências de paradas-e-deslocamentos (stop-and-move sequences), com sintaxe e semântica bem definidas, que atuam como uma linguagem de consulta expressiva para trajetórias semânticas. A tese descreve um modelo concreto de trajetória semântica em RDF, define expressões de sequências de paradas-e-deslocamentos em SPARQL e discute estratégias para compilar tais expressões em consultas SPARQL. A tese define consultas sobre trajetórias semânticas com base no uso de palavras-chave para especificar paradas e deslocamentos e a adoção de termos com semântica predefinida para compor expressões de sequência. Em seguida, descreve como compilar tais expressões em consultas SPARQL, mediante o uso de padrões predefinidos. Finalmente, a tese apresenta uma prova de conceito usando um conjunto de trajetórias semânticas construído com conteúdo gerado pelos usuários do Flickr, combinado com dados da Wikipedia. / [en] Keyword search provides an easy-to-use interface for retrieving information. This thesis contributes to the problems of keyword search over schema-less datasets and semantic trajectories based on RDF. To address the keyword search over schema-less RDF datasets problem, this thesis introduces an algorithm to automatically translate a user-specified keyword-based query K into a SPARQL query Q so that the answers Q returns are also answers for K. The algorithm does not rely on an RDF schema, but it synthesizes SPARQL queries by exploring the similarity between the property domains and ranges, and the class instance sets observed in the RDF dataset. It estimates set similarity based on set synopses, which can be efficiently precomputed in a single pass over the RDF dataset. The thesis includes two sets of experiments with an implementation of the algorithm. The first set of experiments shows that the implementation outperforms a baseline RDF keyword search tool that explores the RDF schema, while the second set of experiments indicate that the implementation performs better than the stateof- the-art TSA+BM25 and TSA+VDP keyword search systems over RDF datasets based on the virtual documents approach. Finally, the thesis also computes the effectiveness of the proposed algorithm using a metric based on the concept of graph relevance. The second problem addressed in this thesis is the keyword search over RDF semantic trajectories problem. Stop-and-move semantic trajectories are segmented trajectories where the stops and moves are semantically enriched with additional data. A query language for semantic trajectory datasets has to include selectors for stops or moves based on their enrichments, and sequence expressions that define how to match the results of selectors with the sequence the semantic trajectory defines. The thesis first proposes a formal framework to define semantic trajectories and introduces stop and move sequence expressions, with well-defined syntax and semantics, which act as an expressive query language for semantic trajectories. Then, it describes a concrete semantic trajectory model in RDF, defines SPARQL stop-and-move sequence expressions, and discusses strategies to compile such expressions into SPARQL queries. Next, the thesis specifies user-friendly keyword search expressions over semantic trajectories based on the use of keywords to specify stop and move queries, and the adoption of terms with predefined semantics to compose sequence expressions. It then shows how to compile such keyword search expressions into SPARQL queries. Finally, it provides a proof-of-concept experiment over a semantic trajectory dataset constructed with user-generated content from Flickr, combined with Wikipedia data. [pt] SPARQL [pt] EXPRESSOES DE SEQUENCIA [pt] SINOPSES KMV [pt] GRAFO RDF [pt] PESQUISA POR PALAVRAS-CHAVE [pt] TRAJETORIAS SEMANTICAS [en] SPARQL [en] SEQUENCE EXPRESSIONS [en] KMV-SYNOPSES [en] RDF GRAPH [en] KEYWORD SEARCH [en] SEMANTIC TRAJECTORIES

1

Page generated in 0.0364 seconds