Global ETD Search

1	Plug-and-Play Web Services Jain, Arihant 01 December 2019 (has links) The goal of this research is to make it easier to design and create web services for relational databases. A web service is a software service for providing data over computer networks. Web services provide data endpoints for many web applications. We adopt a plug-and-play approach for web service creation whereby a designer constructs a “plug,” which is a simple specification of the output produced by the service. If the plug can be “played” on the database then the web service is generated. Our plug-and-play approach has three advantages. First, a plug is portable. You can take the plug to any data source and generate a web service. Second, a plug-and-play service is more reliable. The web service generation checks the database to determine if the service can be safely and correctly generated. Third, plug-and-play web services are easier to code for complex data since a service designer can write a simple plug, abstracting away the data’s real complexity. We describe how to build a system for plug-and-play web services, and experimentally evaluate the system. The software produced by this research will make life easier for web service designers. Web Services JSON Relational Databases Joins Computer Sciences
2	ACE: Agile,Contingent and Efficient Similarity Joins Using MapReduce Lakshminarayanan, Mahalakshmi January 2013 (has links) No description available. Computer Science Engineering
3	Performance Characterization and Improvements of SQL-On-Hadoop Systems Kulkarni, Kunal Vikas 28 December 2016 (has links) No description available. Computer Science Hadoop SQL Impala Hive Big Data Joins HDFS
4	General dynamic Yannakakis: Conjunctive queries with theta joins under updates Idris, Muhammad, Ugarte, Martín, Vansummeren, Stijn, Voigt, Hannes, Lehner, Wolfgang 17 July 2023 (has links) The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. In prior work, we have proposed general dynamic Yannakakis (GDYN), a general framework for dynamically processing acyclic conjunctive queries with θ-joins in the presence of data updates. Whereas traditional approaches face a trade-off between materialization of subresults (to avoid inefficient recomputation) and recomputation of subresults (to avoid the potentially large space overhead of materialization), GDYN is able to avoid this trade-off. It intelligently maintains a succinct data structure that supports efficient maintenance under updates and from which the full query result can quickly be enumerated. In this paper, we consolidate and extend the development of GDYN. First, we give full formal proof of GDYN ’s correctness and complexity. Second, we present a novel algorithm for computing GDYN query plans. Finally, we instantiate GDYN to the case where all θ-joins are inequalities and present extended experimental comparison against state-of-the-art engines. Our approach performs consistently better than the competitor systems with multiple orders of magnitude improvements in both time and memory consumption. info:eu-repo/classification/ddc/004 ddc:004
5	Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates Idris, Muhammad 05 March 2019 (has links) (PDF) Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends, and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain query results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flows of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation), however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems.In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of queries that are based on the relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g. timestamp attributes) along with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints while non-temporal comparisons can either be equality or inequality constraints. Hence these systems mostly process inequality joins. As a starting point for our research, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kinds of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub)results incur high update latency and systems that materialize (sub)results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems and present a main-memory data representation that allows to enumerate query (sub)results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLRs).We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries), 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only), 3) they take space linear in the size of the database, 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm called the Dynamic Yannakakis (DYN) algorithm. Then, we present the generalization of the DYN algorithm to the class of acyclic queries featuring multi-way Theta-joins with projections and call it Generalized DYN (GDYN). We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of DYN and GDYN over DCLRs are based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present algorithms to test a conjunctive query featuring Theta-joins for acyclicity and to generate GJTs for such queries. We extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to testing a conjunctive query featuring multi-way Theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic.GDYN is hence a unified framework based on DCLRs that enables processing of queries that appear in streaming systems as well as in BI systems in a unified main-memory model and addresses the space-time trade-off. We instantiate GDYN to the particular case where all Theta-joins involve only equalities and inequalities and call this instantiation IEDYN. We implement DYN and IEDYN as query compilers that generate executable programs in the Scala programming language and provide all the necessary data structures and their maintenance and enumeration methods in a continuous stream processing model. We evaluate DYN and IEDYN against state-of-the-art BI and streaming systems on both industrial and synthetically generated benchmarks. We show that DYN and IEDYN outperform the existing systems by over an order of magnitude efficiency in both memory footprint and update processing time. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished Informatique générale Business Intelligence Databases Data Warehouse Query Processing Query Execution Real-time Analytics Stream Processing Complex Event Processing Information Flow Processing Joins Join Trees Main-Memory System Inequality Joins Theta Joins Analytical Processing Query Language Acyclic Joins Join Algorithms Acyclicity
6	Applications of Circulations and Removable Pairings to TSP and 2ECSS Fu, Yao 08 May 2014 (has links) In this thesis we focus on two NP-hard and intensively studied problems: The travelling salesman problem (TSP), which aims to find a minimum cost tour that visits every node exactly once in a complete weighted graph, and the 2-edge-connected spanning subgraph problem (2ECSS), which aims to find a minimum size 2-edge-connected spanning subgraph in a given graph. TSP and 2ECSS have many real world applications. However, both problems are NP-hard which means it is unlikely that polynomial time algorithms exist to solve them, so methods that return close to optimal solutions are sought. In this thesis we mainly focus on k-approximation algorithms for the two problems, which efficiently return a solution within k times of the optimal solution. For a special case of TSP called graph TSP, using ideas from Momke and Svensson, we present a 25/18-approximation algorithm for a special class of graphs using circulations and T-joins, which improves the previous known best bound of 7/5 for such graphs. Moreover, if the graph does not contain special nodes, our algorithm ensures the ratio of 4/3. For 2ECSS, given any k-edge-connected graph G=(V,E), \|V\|=n, \|E\|=m, we present an approximation algorithm that gives a 2-edge-connected spanning subgraph with the number of edges at most n+(m-n)/(k-1)-(k-2)/(k-1) with a novel use of circulations, which improves both the approximation ratio and the simplicity of the proof compared to a result by Huh in 2004. travelling salesman problem approximation algorithm circulations T-joins integrality gap
7	On Applying Methods for Graph-TSP to Metric TSP Desjardins, Nicholas January 2016 (has links) The Metric Travelling Salesman Problem, henceforth metric TSP, is a fundamental problem in combinatorial optimization which consists of finding a minimum cost Hamiltonian cycle (also called a TSP tour) in a weighted complete graph in which the costs are metric. Metric TSP is known to belong to a class of problems called NP-hard even in the special case of graph-TSP, where the metric costs are based on a given graph. Thus, it is highly unlikely that efficient methods exist for solving large instances of these problems exactly. In this thesis, we develop a new heuristic for metric TSP based on extending ideas successfully used by Mömke and Svensson for the special case of graph-TSP to the more general case of metric TSP. We demonstrate the efficiency and usefulness of our heuristic through empirical testing. Additionally, we turn our attention to graph-TSP. For this special case of metric TSP, there has been much recent progress with regards to improvements on the cost of the solutions. We find the exact value of the ratio between the cost of the optimal TSP tour and the cost of the optimal subtour linear programming relaxation for small instances of graph-TSP, which was previously unknown. We also provide a simplified algorithm for special graph-TSP instances based on the subtour linear programming relaxation. travelling salesman problem integrality gap T-joins approximation algorithm heuristic metric travelling salesman problem
8	Applications of Circulations and Removable Pairings to TSP and 2ECSS Fu, Yao January 2014 (has links) In this thesis we focus on two NP-hard and intensively studied problems: The travelling salesman problem (TSP), which aims to find a minimum cost tour that visits every node exactly once in a complete weighted graph, and the 2-edge-connected spanning subgraph problem (2ECSS), which aims to find a minimum size 2-edge-connected spanning subgraph in a given graph. TSP and 2ECSS have many real world applications. However, both problems are NP-hard which means it is unlikely that polynomial time algorithms exist to solve them, so methods that return close to optimal solutions are sought. In this thesis we mainly focus on k-approximation algorithms for the two problems, which efficiently return a solution within k times of the optimal solution. For a special case of TSP called graph TSP, using ideas from Momke and Svensson, we present a 25/18-approximation algorithm for a special class of graphs using circulations and T-joins, which improves the previous known best bound of 7/5 for such graphs. Moreover, if the graph does not contain special nodes, our algorithm ensures the ratio of 4/3. For 2ECSS, given any k-edge-connected graph G=(V,E), \|V\|=n, \|E\|=m, we present an approximation algorithm that gives a 2-edge-connected spanning subgraph with the number of edges at most n+(m-n)/(k-1)-(k-2)/(k-1) with a novel use of circulations, which improves both the approximation ratio and the simplicity of the proof compared to a result by Huh in 2004. travelling salesman problem approximation algorithm circulations T-joins integrality gap
9	Scalable algorithms for monitoring activity traces / Algorithmes pour le monitoring de traces d'activité à grande échelle Pilourdault, Julien 28 September 2017 (has links) Dans cette thèse, nous étudions des algorithmes pour le monitoring des traces d’activité à grande échelle. Le monitoring est une aptitude clé dans plusieurs domaines, permettant d’extraire de la valeur des données ou d’améliorer les performances d’un système. Nous explorons d’abord le monitoring de données temporelles. Nous présentons un nouveau type de jointure sur des intervalles, qui inclut des fonctions de score caractérisant le degré de satisfaction de prédicats temporels. Nous étudions ces jointures dans le contexte du batch processing (traitement par lots). Nous formalisons la Ranked Temporal Join (RTJ), une jointure qui combine des collections d’intervalles et retourne les k meilleurs résultats. Nous montrons comment exploiter les propriétés des prédicats temporels et de la sémantique de score associée afin de concevoir TKIJ , une méthode d’évaluation de requête distribuée basée sur Map-Reduce. Nos expériences sur des données synthétiques et réelles montrent que TKIJ est plus performant que les techniques de l’état de l’art et démontre de bonnes performances sur des requêtes RTJ n-aires sur des données temporelles. Nous proposons également une étude préliminaire afin d’étendre nos travaux sur TKIJ au domaine du stream processing (traitement de flots). Nous explorons ensuite le monitoring dans le crowdsourcing (production participative). Nous soutenons la nécessité d’intégrer la motivation des travailleurs dans le processus d’affectation des tâches. Nous proposons d’étudier une approche adaptative, qui évalue la motivation des travailleurs lors de l’exécution des tâches et l’exploite afin d’améliorer l’affectation de tâches qui est réalisée de manière itérative. Nous explorons une première variante nommée Individual Task Assignment (Ita), dans laquelle les tâches sont affectées individuellement, un travailleur à la fois. Nous modélisons Ita et montrons que ce problème est NP-Difficile. Nous proposons trois méthodes d’affectation de tâches qui poursuivent différents objectifs. Nos expériences en ligne étudient l’impact de chaque méthode sur la performance globale dans l’exécution de tâches. Nous observons que différentes stratégies sont dominantes sur les différentes dimensions de performance. En particulier, la méthode affectant des tâches aléatoires et correspondant aux intérêts d’un travailleur donne le meilleur flux d’exécution de tâches. La méthode affectant des tâches correspondant au compromis d’un travailleur entre diversité et niveau de rémunération des tâches donne le meilleur niveau de qualité. Nos expériences confirment l’utilité d’une affectation de tâches adaptative et tenant compte de la motivation. Nous étudions une deuxième variante nommée Holistic Task Assignment (Hta), où les tâches sont affectées à tous les travailleurs disponibles, de manière holistique. Nous modélisons Hta et montrons que ce problème est NP-Difficile et MaxSNP-Difficile. Nous développons des algorithmes d’approximation pour Hta. Nous menons des expériences sur des données synthétiques pour évaluer l’efficacité de nos algorithmes. Nous conduisons également des expériences en ligne et comparons notre approche avec d’autres stratégies non adaptatives. Nous observons que notre approche présente le meilleur compromis sur les différentes dimensions de performance. / In this thesis, we study scalable algorithms for monitoring activity traces. In several domains, monitoring is a key ability to extract value from data and improve a system. This thesis aims to design algorithms for monitoring two kinds of activity traces. First, we investigate temporal data monitoring. We introduce a new kind of interval join, that features scoring functions reflecting the degree of satisfaction of temporal predicates. We study these joins in the context of batch processing: we formalize Ranked Temporal Join (RTJ), that combine collections of intervals and return the k best results. We show how to exploit the nature of temporal predicates and the properties of their associated scored semantics to design TKIJ , an efficient query evaluation approach on a distributed Map-Reduce architecture. Our extensive experiments on synthetic and real datasets show that TKIJ outperforms state-of-the-art competitors and provides very good performance for n-ary RTJ queries on temporal data. We also propose a preliminary study to extend our work on TKIJ to stream processing. Second, we investigate monitoring in crowdsourcing. We advocate the need to incorporate motivation in task assignment. We propose to study an adaptive approach, that captures workers’ motivation during task completion and use it to revise task assignment accordingly across iterations. We study two variants of motivation-aware task assignment: Individual Task Assignment (Ita) and Holistic Task Assignment (Hta). First, we investigate Ita, where we assign tasks to workers individually, one worker at a time. We model Ita and show it is NP-Hard. We design three task assignment strategies that exploit various objectives. Our live experiments study the impact of each strategy on overall performance. We find that different strategies prevail for different performance dimensions. In particular, the strategy that assigns random and relevant tasks offers the best task throughput and the strategy that assigns tasks that best match a worker’s compromise between task diversity and task payment has the best outcome quality. Our experiments confirm the need for adaptive motivation-aware task assignment. Then, we study Hta, where we assign tasks to all available workers, holistically. We model Hta and show it is both NP-Hard and MaxSNP-Hard. We develop efficient approximation algorithms with provable guarantees. We conduct offline experiments to verify the efficiency of our algorithms. We also conduct online experiments with real workers and compare our approach with various non-adaptive assignment strategies. We find that our approach offers the best compromise between performance dimensions thereby assessing the need for adaptability. Monitoring Données temporelles Traitement distribué Jointures Crowdsourcing Affectation de tâches Monitoring Temporal Data Distributed Processing Joins Crowdsourcing Task Assignment 004
10	Parallélisme et équilibrage de charges dans le traitement de la jointure sur des architectures distribuées / Parallelism and load balancing in the treatment of the join on distributed architectures Al Hajj Hassan, Mohamad 16 December 2009 (has links) L’émergence des applications de bases de données dans les domaines tels que le data warehousing,le data mining et l’aide à la décision qui font généralement appel à de très grands volumes de donnéesrend la parallélisation des algorithmes des jointures nécessaire pour avoir un temps de réponse acceptable.Une accélération linéaire est l’objectif principal des algorithmes parallèles, cependant dans les applicationsréelles, elle est difficilement atteignable : ceci est dû généralement d’une part aux coûts de communicationsinhérents aux systèmes multi-processeurs et d’autre part au déséquilibre des charges des différents processeurs.En plus, dans un environnement hétérogène multi-utilisateur, la charge des différents processeurspeut varier de manière dynamique et imprévisible.Dans le cadre de cette thèse, nous nous intéressons au traitement de la jointure et de la multi-jointure surles architectures distribuées hétérogènes, les grilles de calcul et les systèmes de fichiers distribués. Nousavons proposé une variété d’algorithmes, basés sur l’utilisation des histogrammes distribués, pour traiterde manière efficace le déséquilibre des données, tout en garantissant un équilibrage presque parfait dela charge des différents processeurs même dans un environnement hétérogène et multi-utilisateur. Cesalgorithmes sont basés sur une approche dynamique de redistribution des données permettant de réduire lescoûts de communication à un minimum tout en traitant de manière très efficace le problème de déséquilibredes valeurs de l’attribut de jointure.L’analyse de complexité de nos algorithmes et les résultats expérimentaux obtenus montrent que cesalgorithmes possèdent une accélération presque linéaire. / The appeal of parallel processing becomes very strong in applications which require ever higher performanceand particularly in applications such as : data-warehousing, decision support, On-Line Analytical Processing(OLAP) and more generally DBMS. A linear speed-up is the main objective of parallel algorithms. However,in real applications, it’s not obvious to reach this objective due to the high communication cost in parallel anddistributed systems and to the possible skew in the charge of different processors. In addition, on heterogeneousmulti-user architectures, the load of each processor may highly vary in a dynamic and unpredictableway.In this thesis, we are interested in treating the join and multi-join queries on distributed multi-user heteregeneoussystems, grid systems and distributed file systems. We have proposed several algorithms based onusing distributed histograms. These algorithms are based on a dynamic data distribution and task allocationwhich makes them insensitive to data skew and ensure perfect balancing properties during all stages of joincomputation even on heteregeneous multi-user environment. The complexity analysis of our algorithms andthe experimental results show that they have a near-linear speedup. Jointures parallèles Multi-jointure Déséquilibre des données Equilibrage dynamique de charges Parallel joins Multi-join Data skew Dynamic load balancing

Search results