Global ETD Search

11	Temporal Join Processing with Hilbert Curve Space Mapping Raigoza, Jaime Antonio 01 January 2013 (has links) Management of data with a time dimension increases the overhead of storage and query processing in large database applications especially with the join operation, which is a commonly used and expensive relational operator whose processing is dependent on the size of the input relations. An index-based approach has been shown to improve the processing of a join operation, which in turn, improves the performance of querying historical data. Temporal data consist of tuples associated with a time interval value having a valid life span of different lengths. With join processing on temporal data, since tuples with longer life spans tend to overlap a greater number of joining tuples, they are likely to be accessed more often. The efficient performance of a temporal join depending on index-clustered data is the main theme studied and researched in this work. The presence of intervals having an extended data range in temporal data makes the join evaluation harder because temporal data are intrinsically multidimensional. Some temporal join processing methods create duplicates of tuples with long life spans to achieve clustering of similar data, which improves the performance on tuples that tend to be accessed more frequently. The proposed Hilbert-Temporal Join (Hilbert-TJ) join algorithm overcomes the need of data duplication by mapping temporal data into Hilbert curve space that is inherently clustered, thus allowing for fast retrieval and storage. A balanced B+ tree index structure was implemented to manage and query the data. The query method identifies data pages containing matching tuples that intersect a multidimensional region. Given that data pages consist of contiguously mapped points on the curve, the query process successively traverses along the curve to determine the next page that intersects the query region by iteratively partitioning the data space. The proposed Adaptive Replacement Cache-Temporal Data (ARC-TD) buffer replacement policy is built upon the Adaptive Replacement Cache (ARC) policy by favoring the cache retention of data pages in proportion to the average life span of the tuples in the buffer. By giving preference to tuples having long life spans, a higher cache hit ratio was evident. The caching priority is also balanced between recently and frequently accessed data. An evaluation and comparison study of the proposed Hilbert-TJ algorithm determined the relative performance with respect to a nested-loop join, a sort-merge join, and a partition-based join algorithm that use a multiversion B+ tree (MVBT) index. The metrics are based on a comparison between the processing time (disk I/O time plus CPU time), cache hit ratio, and index storage size needed to perform the temporal join. The study was conducted with comparisons in terms of the Least Recently Used (LRU), Least Frequently Used (LFU), ARC, and the new ARC-TD buffer replacement policy. Under the given conditions, the expected outcome was that by reducing data redundancy and considering the longevity of frequently accessed temporal data, better performance was achieved. Additionally, the Hilbert-TJ algorithm offers support to both valid-time and transaction-time data. arc Hilbert join optimization query temporal Computer Sciences
12	在Cordial圖上的一些運算子 / Some operatiors on cordial graph 洪國銘 Unknown Date (has links) 論文摘要在離散數學的領域中有一熱門的分支一圖學論就是將問題以圖形的觀念來研究，其中優美圖問題是由來已久，尤其是在太空訊息的接收，雷達站之設立位置等科學性的研究中廣泛的被討論，我們企圖將優美圖的必要條件cordial 圖作適當的推展以這算子為架構將圖形類別視為運算元則 { (運算元) ，運算子} ===新圖形結合將可得到新的圓形使得新圖形是cordial 圖這樣一來就可製造出更多更複雄史實用的cordial 圖，有cordial 圖才有可能是優美圖. 研究之初我們收集有關優美圖的論文想知道一些優美圖標法與尚未解決的圖形，和由優美圖衍生出的特殊圖形，我們整理得到壹拾貳大類這些類別的圖形在簡單的情形時已有了不錯的結論，但是稍為複雜或條件放鬆則結果說不得而知，由於優美國的重要及熱門迫使我們不得不有這種動機嘗試將已無圓形配和圓形運算子而得到複雜的圖形式得到新類別的圓形並企圖使新圖形式複雜圖形是cordial 圖在論文中，我們找到一些還算子例如link ， corona, join , bridge and newcorona 並且導出一些結果. / Abstract Suppose G is a graph with vertex set V(G) and edge set E(G). Consider a labeling f: V(G) → { 0, 1} where f induces an edge-labeling i: E(G) -+ {0,1} defined by f(uv) = I f(u) - f(v) I for each edge uv E E( G). Let V f(i) be the set of vertices v of G with f( v)=i, and Ef(i) be the set of edges uv with f(uv)=i. The cardinalities of Yf(i) and Ef(i) are denoted by vf(i) and ef(i), respectively. A labeling f of a graph is cordial if Ivf (0)-vf (1) ? I and lef(0)- ef(1)?1. A graph G is cordial if it has a: cordial labeling. In this paper, we will study some operators such as link (0), corona( ), join(+), bridge (?), and newcorona( ?), and derive some results on cordial graphs. cordial labeling link (0) corona (*) join (+) bridge (ʉ) newcorona (©)
13	The Trouble with Diversity: Fork-Join Networks with Heterogenous Customer Population Nguyen, Viên 10 1900 (has links) Consider a feedforward network of single-server stations populated by multiple job types. Each job requires the completion of a number of tasks whose order of execution is determined by a set of deterministic precedence constraints. The precedence requirements allow some tasks to be done in parallel (in which case tasks would "fork") and require that others be processed sequentially (where tasks may "join"). Jobs of a. given type share the same precedence constraints, interarrival time distributions, and service time distributions, but these characteristics may vary across different job types. We show that the heavy traffic limit of certain processes associated with heterogeneous fork-join networks can be expressed as a semimartingale reflected Brownian motion with polyhedral state space. The polyhedral region typically has many more faces than its dimension, and the description of the state space becomes quite complicated in this setting. One can interpret the proliferation of additional faces in heterogeneous fork-join networks as (i) articulations of the fork and join constraints, and (ii) results of the disordering effects that occur when jobs fork and join in their sojourns through the network.
14	Keyword Join: Realizing Keyword Search in P2P-based Database Systems Yu, Bei, Liu, Ling, Ooi, Beng Chin, Tan, Kian Lee 01 1900 (has links) In this paper, we present a P2P-based database sharing system that provides information sharing capabilities through keyword-based search techniques. Our system requires neither a global schema nor schema mappings between different databases, and our keyword-based search algorithms are robust in the presence of frequent changes in the content and membership of peers. To facilitate data integration, we introduce keyword join operator to combine partial answers containing different keywords into complete answers. We also present an efficient algorithm that optimize the keyword join operations for partial answer integration. Our experimental study on both real and synthetic datasets demonstrates the effectiveness of our algorithms, and the efficiency of the proposed query processing strategies. / Singapore-MIT Alliance (SMA) keyword join keyword query Peer-to-Peer database
15	Scaling Big Data Cleansing Khayyat, Zuhair 31 July 2017 (has links) Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identify- ing and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is de- sired to optimize inequality joins in sophisticated error discovery rather than na ̈ıvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing. First, I present BigDansing as a general system to tackle efficiency, scalability, and ease-of-use issues in data cleansing for Big Data. It automatically parallelizes the user’s code on top of general-purpose distributed platforms. Its programming inter- face allows users to express data quality rules independently from the requirements of parallel and distributed environments. Without sacrificing their quality, BigDans- ing also enables parallel execution of serial repair algorithms by exploiting the graph representation of discovered errors. The experimental results show that BigDansing outperforms existing baselines up to more than two orders of magnitude. Although BigDansing scales cleansing jobs, it still lacks the ability to handle sophisticated error discovery requiring inequality joins. Therefore, I developed IEJoin as an algorithm for fast inequality joins. It is based on sorted arrays and space efficient bit-arrays to reduce the problem’s search space. By comparing IEJoin against well- known optimizations, I show that it is more scalable, and several orders of magnitude faster. BigDansing depends on vertex-centric graph systems, i.e., Pregel, to efficiently store and process discovered errors. Although Pregel scales general-purpose graph computations, it is not able to handle skewed workloads efficiently. Therefore, I introduce Mizan, a Pregel system that balances the workload transparently during runtime to adapt for changes in computing needs. Mizan is general; it does not assume any a priori knowledge of the graph structure or the algorithm behavior. Through extensive evaluations, I show that Mizan provides up to 84% improvement over techniques leveraging static graph pre-partitioning. data cleansing inequality join graph processing big data pregel spark
16	Supporting Advanced Queries on Scientific Array Data Ebenstein, Roee A. 18 December 2018 (has links) No description available. Computer Science DBMS Scientific Array Data Analytical Querying SDBMS Join Array Join Array Data Join Databases Array Databases NoDB Scientific Databases Query Optimization Plan Pruning Bitmap Join Bitmap Indexes Range Joins Data Skew Querying Skewed data
17	Multiple Continuous Query Processing with Relative Window Predicates "Juggler" Silva, Asima 27 May 2004 (has links) "Efficient querying over streaming data is a critical technology which requires the ability to handle numerous and possibly similar queries in real time dynamic environments such as the stock market and medical devices. Existing DBMS technology is not well suited for this domain since it was developed for static historical data. Queries over streams often contain relative window predicates such as in the query: ``Heart rate decreased to fifty-two beats per second within four seconds after the patient's temperature started rising." Relative window predicates are a specific type of join between streams that is based on the tuple's timestamp. In our operator, called Juggler, predicates are classified into three types: attribute, join, and window. Attribute predicates are stream values compared to a constant. Join predicates are stream values compared to another stream's values. Window predicates are join predicates where the streams' timestamp values are compared. Juggler's composite operator incorporates the processing of similar though not identical, query functionalities as one complex computation process. This execution strategy handles multi-way joins for multiple selection and join predicates. It adaptively orders the execution of predicates by their selectivity to efficiently process multiple continuous queries based on stream characteristics. In Juggler, all similar predicates are grouped into lists. These indices are represented by a collection of bits. Every tuple contains the bit structure representation of the predicate lists which encodes tuple predicate evaluation history. Every query also contains a similar bit structure to encode the predicate's relationship to the registered queries. The tuple's and query's bit structures are compared to assess if the tuple has satisfied a query. Juggler is designed and implemented in Java. Experiments were conducted to verify correctness and to assess the performance of Juggler's three features. Its adaptivity of reordering the evaluation of predicate types performed as well as the most selective predicate ordering. Its ability to exploit similar predicates in multiple queries showed reduction in number of comparisons. Its effectiveness when multiple queries are combined in a single Juggler operator indicated potential performance improvements after optimization of Juggler's data structures." reordering predicates multi-join operator sliding windows window predicates join algorithm continuous queries Query languages (Computer science)
18	Processamento distribuído da junção espacial de múltiplas bases de dados: multi-way spatial join Cunha, Anderson Rogério 19 February 2014 (has links) Submitted by Erika Demachki (erikademachki@gmail.com) on 2014-12-29T15:33:04Z No. of bitstreams: 2 license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Dissertação - Anderson Rogério Cunha - 2014.pdf: 4853685 bytes, checksum: d50cf557f1a067a91c2034443ee62df2 (MD5) / Approved for entry into archive by Erika Demachki (erikademachki@gmail.com) on 2014-12-29T15:39:23Z (GMT) No. of bitstreams: 2 license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Dissertação - Anderson Rogério Cunha - 2014.pdf: 4853685 bytes, checksum: d50cf557f1a067a91c2034443ee62df2 (MD5) / Made available in DSpace on 2014-12-29T15:39:23Z (GMT). No. of bitstreams: 2 license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Dissertação - Anderson Rogério Cunha - 2014.pdf: 4853685 bytes, checksum: d50cf557f1a067a91c2034443ee62df2 (MD5) Previous issue date: 2014-02-19 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / Spatial join is one of the spatial operations of higher computational cost. Its complexity increases significantly when it involves multiple databases (multi-way spatial join). Traditional processing strategies of multi-way spatial join apply combinations of binary join algorithms on centralized computing environments. For complex queries, this approach requires much computational power, making it often unfeasible in centralized environments. This work proposes the Distributed Synchronous Traversal algorithm (DST), whose goal is to enable the distributed processing of multi-way spatial joins on a cluster of computers. The DST algorithm is based on Synchronous Traversal algorithm and processes the multiway spatial join in a single synchronous descent upon R-Trees levels of the database entries (the final outcome is built incrementally, without creating temporary databases). To the best of our knowledge, there are no other proposals in the literature that deal with this problem in a distributed fashion and on a peer-to-peer architecture. Many challenges had to be overcome, such as the definition of data structures that enabled the mapping of the semantics of queries of multi-way spatial join and coordination of the required distributed processing. DST proved to be satisfactorily parallelizable and scalable process real datasets in experiments performed in clusters of 1, 2, 4 and 8 servers. / A junção espacial (Spatial Join) é uma das operações espaciais de maior custo computacional. Sua complexidade aumenta significativamente quando envolve múltiplas bases de dados (multi-way spatial join). Estratégias tradicionais de processamento do multi-way spatial join aplicam combinações de algoritmos de junção binária sobre ambientes computacionais centralizados. Em consultas complexas, esse tipo de abordagem exige grande capacidade computacional muitas vezes inviável em ambientes centralizados. Neste trabalho é proposto o algoritmo Distributed Synchronous Traversal (DST), cujo objetivo é tornar viável a execução distribuída do multi-way spatial join em um cluster de computadores. O DST se baseia no algoritmo Synchronous Traversal e processa o multiway spatial join em uma única descida síncrona sobre os níveis das R-Trees das bases de dados de entrada. O resultado final é construído incrementalmente, sem a consolidação de dados intermediários. Até onde conhecemos, não há outras propostas na literatura para multi-way spatial join distribuído sobre uma arquitetura peer-to-peer. Muitos desafios tiveram que ser superados, como a definição de estruturas de dados que possibilitassem o mapeamento da semântica das consultas de multi-way spatial join e a coordenação do processamento distribuído das mesmas. O DST se mostrou satisfatoriamente paralelizável e escalável ao processar bases de dados reais em clusters de até 8 servidores. Processamento distribuído Junção espacial Multi-way spatial join R-Tree Distributed processing Spatial join
19	Seamless concurrent programming of objects, aspects and events / Intégration de la programmation concurrente à la programmation par objets, aspects et événements Van Ham, Jurgen Michael 09 March 2015 (has links) L’utilisation de concepts avancés de programmation concurrente permet de dépasser les inconvénients de l’utilisation de techniques de bas niveau à base de verrous ou de moniteurs. Elle augmente le niveau d’abstraction, libérant les programmeurs d’applications concurrentes d’une focalisation excessive sur des détails. Cependant, avec les approches actuelles, la logique nécessaire à la mise en place de schémas de coordinations complexes est fragmentée en plusieurs points de l’application sous forme de« join patterns », de notifications et de la logique applicative qui crée implicitement des dépendances entre les canaux de communication et donc, indirectement, les « join patterns » (qui définissent ces canaux). Nous présentons JEScala, un langage qui capture les schémas de coordination (d’une application concurrente) d’une manière plus expressive et modulaire, en s’appuyant sur l’intégration fine d’un système d’évènements avancé et des « join patterns ». Nous implémentons des automates finis à partir de « joins » à l’aide de JEScala et introduisons un langage dédié à la définition de ces automates finis permettant d’en obtenir des implémentations plus efficaces. Nous validons notre approche avec des études de cas et évaluons l’efficacité de son exécution. Nous comparons la performance de trois implémentations d’un automate fini. Nous validons enfin l’idée d’un moniteur d’évènements en créant un programme JEScala concurrent à partir d’un découpage d’un programme séquentiel. / The advanced concurrency abstractions provided by the Join calculus overcome the drawbacks of low-level techniques such as locks and monitors. They rise the level of abstraction, freeing programmers that implement concurrent applications from the burden of concentrating on low-level details. However, with current approaches the coordination logic involved incomplex coordination schemas is fragmented into several pieces including join patterns, data emissions triggered in different places of the application, and the application logic that implicitly creates dependencies among channels, hence indirectly among join patterns. We present JEScala, a language that captures coordination schemas in a more expressive and modular way by leveraging a seamless integration of an advanced event system with join abstractions. We implement Joins-based state machines using JEScala and introduce a domain specific language for finite state machines that make faster alternative implementations possible. We validate our approach with case studies and we provide a first performance assessment. We compare the performance of three different implementations of a finite state machine. Finally, we validate the idea of constructing a concurrent JEScala program by using the parts of a sequential Event-Based program in combination with an event monitor, a component that synchronizes handling of multiple events. Programmation par événements Programmation par aspects Concurrence « Join Patterns » Scala Event-driven Programming Aspect-Oriented Programming Concurrency Join Patterns Scala
20	Resource-efficient and fast Point-in-Time joins for Apache Spark : Optimization of time travel operations for the creation of machine learning training datasets / Resurseffektiva och snabba Point-in-Time joins i Apache Spark : Optimering av tidsresningsoperationer för skapande av träningsdata för maskininlärningsmodeller Pettersson, Axel January 2022 (has links) A scenario in which modern machine learning models are trained is to make use of past data to be able to make predictions about the future. When working with multiple structured and time-labeled datasets, it has become a more common practice to make use of a join operator called the Point-in-Time join, or PIT join, to construct these datasets. The PIT join matches entries from the left dataset with entries of the right dataset where the matched entry is the row whose recorded event time is the closest to the left row’s timestamp, out of all the right entries whose event time occurred before or at the same time of the left event time. This feature has long only been a part of time series data processing tools but has recently received a new wave of attention due to the rise of the popularity of feature stores. To be able to perform such an operation when dealing with a large amount of data, data engineers commonly turn to large-scale data processing tools, such as Apache Spark. However, Spark does not have a native implementation when performing these joins and there has not been a clear consensus by the community on how this should be achieved. This, along with previous implementations of the PIT join, raises the question: ”How to perform fast and resource efficient Pointin- Time joins in Apache Spark?”. To answer this question, three different algorithms have been developed and compared for performing a PIT join in Spark in terms of resource consumption and execution time. These algorithms were benchmarked using generated datasets using varying physical partitions and sorting structures. Furthermore, the scalability of the algorithms was tested by running the algorithms on Apache Spark clusters of varying sizes. The results received from the benchmarks showed that the best measurements were achieved by performing the join using Early Stop Sort-Merge Join, a modified version of the regular Sort-Merge Join native to Spark. The best performing datasets were the datasets that were sorted by timestamp and primary key, ascending or descending, using a suitable number of physical partitions. Using this new information gathered by this project, data engineers have been provided with general guidelines to optimize their data processing pipelines to be able to perform more resource-efficient and faster PIT joins. / Ett vanligt scenario för maskininlärning är att träna modeller på tidigare observerad data för att för att ge förutsägelser om framtiden. När man jobbar med ett flertal strukturerade och tidsmärkta dataset har det blivit vanligare att använda sig av en join-operator som kallas Point-in-Time join, eller PIT join, för att konstruera dessa datauppsättningar. En PIT join matchar rader från det vänstra datasetet med rader i det högra datasetet där den matchade raden är den raden vars registrerade händelsetid är närmaste den vänstra raden händelsetid, av alla rader i det högra datasetet vars händelsetid inträffade före eller samtidigt som den vänstra händelsetiden. Denna funktionalitet har länge bara varit en del av datahanteringsverktyg för tidsbaserad data, men har nyligen fått en ökat popularitet på grund av det ökande intresset för feature stores. För att kunna utföra en sådan operation vid hantering av stora mängder data vänder sig data engineers vanligvis till storskaliga databehandlingsverktyg, såsom Apache Spark. Spark har dock ingen inbyggd implementation för denna join-operation, och det finns inte ett tydligt konsensus från Spark-rörelsen om hur det ska uppnås. Detta, tillsammans med de tidigare implementationerna av PIT joins, väcker frågan: ”Vad är det mest effektiva sättet att utföra en PIT join i Apache Spark?”. För att svara på denna fråga har tre olika algoritmer utvecklats och jämförts med hänsyn till resursförbrukning och exekveringstid. För att jämföra algoritmerna, exekverades de på genererade datauppsättningar med olika fysiska partitioner och sorteringstrukturer. Dessutom testades skalbarheten av algoritmerna genom att köra de på Spark-kluster av varierande storlek. Resultaten visade att de bästa mätvärdena uppnåddes genom att utföra operationen med algoritmen early stop sort-merge join, en modifierad version av den vanliga sort-merge join som är inbyggd i Spark, med en datauppsättning som är sorterad på tidsstämpel och primärnyckel, antingen stigande eller fallande. Fysisk partitionering av data kunde även ge bättre resultat, men det optimala antal fysiska partitioner kan variera beroende på datan i sig. Med hjälp av denna nya information som samlats in av detta projekt har data engineers försetts med allmänna riktlinjer för att optimera sina databehandlings-pipelines för att kunna utföra mer resurseffektiva och snabbare PIT joins Apache Spark Point-in-Time ASOF Join Optimizations Time travel Apache Spark Point-in-Time ASOF Join Optimeringar Tidsresning Software Engineering Programvaruteknik

Search results