Global ETD Search

1	Causal modeling and prediction over event streams Acharya, Saurav 01 January 2014 (has links) In recent years, there has been a growing need for causal analysis in many modern stream applications such as web page click monitoring, patient health care monitoring, stock market prediction, electric grid monitoring, and network intrusion detection systems. The detection and prediction of causal relationships help in monitoring, planning, decision making, and prevention of unwanted consequences. An event stream is a continuous unbounded sequence of event instances. The availability of a large amount of continuous data along with high data throughput poses new challenges related to causal modeling over event streams, such as (1) the need for incremental causal inference for the unbounded data, (2) the need for fast causal inference for the high throughput data, and (3) the need for real-time prediction of effects from the events seen so far in the continuous event streams. This dissertation research addresses these three problems by focusing on utilizing temporal precedence information which is readily available in event streams: (1) an incremental causal model to update the causal network incrementally with the arrival of a new batch of events instead of storing the complete set of events seen so far and building the causal network from scratch with those stored events, (2) a fast causal model to speed up the causal network inference time, and (3) a real-time top-k predictive query processing mechanism to find the most probable k effects with the highest scores by proposing a run-time causal inference mechanism which addresses cyclic causal relationships. In this dissertation, the motivation, related work, proposed approaches, and the results are presented in each of the three problems. Bayesian network Causal model Prediction Top-k query Computer Sciences
2	CPR: Complex Pattern Ranking for Evaluating Top-k Pattern Queries over Event Streams January 2011 (has links) abstract: Most existing approaches to complex event processing over streaming data rely on the assumption that the matches to the queries are rare and that the goal of the system is to identify these few matches within the incoming deluge of data. In many applications, such as stock market analysis and user credit card purchase pattern monitoring, however the matches to the user queries are in fact plentiful and the system has to efficiently sift through these many matches to locate only the few most preferable matches. In this work, we propose a complex pattern ranking (CPR) framework for specifying top-k pattern queries over streaming data, present new algorithms to support top-k pattern queries in data streaming environments, and verify the effectiveness and efficiency of the proposed algorithms. The developed algorithms identify top-k matching results satisfying both patterns as well as additional criteria. To support real-time processing of the data streams, instead of computing top-k results from scratch for each time window, we maintain top-k results dynamically as new events come and old ones expire. We also develop new top-k join execution strategies that are able to adapt to the changing situations (e.g., sorted and random access costs, join rates) without having to assume a priori presence of data statistics. Experiments show significant improvements over existing approaches. / Dissertation/Thesis / M.S. Computer Science 2011 Computer Science complex event processing pattern ranking top-k query
3	MTopS: Multi-Query Optimization for Continuous Top-K Query Workloads Shastri, Avani 05 May 2011 (has links) A continuous top-k query retrieves the k most preferred objects from a data stream according to a given preference function. These queries are important for a broad spectrum of applications from web-based advertising, network traffic monitoring, to financial analysis. Given the nature of such applications, a data stream may be subjected at any given time to multiple top-k queries with varying parameter settings requested simultaneously by different users. This workload of simultaneous top-k queries must be executed efficiently to assure real time responsiveness. However, existing methods in the literature focus on optimizing single top-k query processing, thus would handle each query independently. They are thus not suitable for handling large numbers of such simultaneous top-k queries due to their unsustainable resource demands. In this thesis, we present a comprehensive framework, called MTopS for Multiple Top-K Optimized Processing System. MTopS achieves resource sharing at the query level by analyzing parameter settings of all queries in the workload, including window-specific parameters and top-k parameters. We further optimize the shared processing by identifying the minimal object set from the data stream that is both necessary and sufficient for top-k monitoring of all queries in the workload. Within this framework, we design the MTopBand algorithm that maintains the up-to-date top-k result set in the size of O (k), where k is the required top-k result set, eliminating the need for any recomputation. To overcome the overhead caused by MTopBand to maintain replicas of the top-k result set across sliding windows, we optimize this algorithm further by integrating these views into one integrated structure, called MTopList. Our associated top-k maintenance algorithm, also called MTopList algorithm, is able to maintain this linear integrated structure, thus able to efficiently answer all queries in the workload. MTopList is shown to be memory optimal because it maintains only the distinct objects that are part of top-k results of at least one query. Our experimental study, using real data streams from domains of stock trades and moving object monitoring, demonstrates that both the efficiency and scalability in the query workload of our proposed technique is superior to the state-of-the-art solutions. meta query strategy MTopS predicted top-k results MTopList MTopBand multi top-k query
4	Efficient and Reliable In-Network Query Processing in Wireless Sensor Networks Malhotra, Baljeet Singh 11 1900 (has links) The Wireless Sensor Networks (WSNs) have emerged as a new paradigm for collecting and processing data from physical environments, such as wild life sanctuaries, large warehouses, and battlefields. Users can access sensor data by issuing queries over the network, e.g., to find what are the 10 highest temperature values in the network. Typically, a WSN operates by constructing a logical topology, such as a spanning tree, built on top of the physical topology of the network. The constructed logical topology is then used to disseminate queries in the network, and also to process and return the results of such queries back to the user. A major challenge in this context is prolonging the network's lifetime that mainly depends on the energy cost of data communication via wireless radios, which is known to be very expensive as compared to the cost of data processing within the network. In this research, we investigate some of the core problems that deal with the different aspects of in-network query processing in WSNs. In that context, we propose an efficient filtering based algorithm for the top-k query processing in WSNs. Through a systematic study of the top-k query processing in WSNs we propose several solutions in this thesis, which are applicable not only to the top-k queries, but also to in-network query processing problems in general. Specifically, we consider broadcasting and convergecasting, which are two basic operations that are required by many in-network query processing solutions. Scheduling broadcasting and convergecasting is another problem that is important for energy efficiency in WSNs. Failure of communication links, which are common in WSNs, is yet another important issue that needs to be addressed. In this research, we take a holistic approach to deal with the above problems while processing the top-k queries in WSNs. To this end, the thesis makes several contributions. In particular, our proposed solutions include new logical topologies, scheduling algorithms, and an overall sophisticated communication framework, which allows to process the top-k queries efficiently and with increased reliability. Extensive simulation studies reveal that our solutions are not only energy efficient, saving up to 50% of the energy cost as compared to the current state-of-the-art solutions, but they are also robust to link failures. Wireless Sensor Networks Data Query Processing Top-k Query Broadcast Convergecast Scheduling Link Failures Failure Recovery
5	Efficient and Reliable In-Network Query Processing in Wireless Sensor Networks Malhotra, Baljeet Singh Unknown Date No description available. Wireless Sensor Networks Data Query Processing Top-k Query Broadcast Convergecast Scheduling Link Failures Failure Recovery
6	Efficient and Effective Local Algorithms for Analyzing Massive Graphs Wu, Yubao 31 May 2016 (has links) No description available. Computer Science Bioinformatics community detection random walk top-k query densest subgraph dual networks
7	Préservation de la confidentialité des données externalisées dans le traitement des requêtes top-k / Privacy preserving top-k query processing over outsourced data Mahboubi, Sakina 21 November 2018 (has links) L’externalisation de données d’entreprise ou individuelles chez un fournisseur de cloud, par exemple avec l’approche Database-as-a-Service, est pratique et rentable. Mais elle introduit un problème majeur: comment préserver la confidentialité des données externalisées, tout en prenant en charge les requêtes expressives des utilisateurs. Une solution simple consiste à crypter les données avant leur externalisation. Ensuite, pour répondre à une requête, le client utilisateur peut récupérer les données cryptées du cloud, les décrypter et évaluer la requête sur des données en texte clair (non cryptées). Cette solution n’est pas pratique, car elle ne tire pas parti de la puissance de calcul fournie par le cloud pour évaluer les requêtes.Dans cette thèse, nous considérons un type important de requêtes, les requêtes top-k, et le problème du traitement des requêtes top-k sur des données cryptées dans le cloud, tout en préservant la vie privée. Une requête top-k permet à l’utilisateur de spécifier un nombre k de tuples les plus pertinents pour répondre à la requête. Le degré de pertinence des tuples par rapport à la requête est déterminé par une fonction de notation.Nous proposons d’abord un système complet, appelé BuckTop, qui est capable d’évaluer efficacement les requêtes top-k sur des données cryptées, sans avoir à les décrypter dans le cloud. BuckTop inclut un algorithme de traitement des requêtes top-k qui fonctionne sur les données cryptées, stockées dans un nœud du cloud, et retourne un ensemble qui contient les données cryptées correspondant aux résultats top-k. Il est aidé par un algorithme de filtrage efficace qui est exécuté dans le cloud sur les données chiffrées et supprime la plupart des faux positifs inclus dans l’ensemble renvoyé. Lorsque les données externalisées sont volumineuses, elles sont généralement partitionnées sur plusieurs nœuds dans un système distribué. Pour ce cas, nous proposons deux nouveaux systèmes, appelés SDB-TOPK et SD-TOPK, qui permettent d’évaluer les requêtes top-k sur des données distribuées cryptées sans avoir à les décrypter sur les nœuds où elles sont stockées. De plus, SDB-TOPK et SD-TOPK ont un puissant algorithme de filtrage qui filtre les faux positifs autant que possible dans les nœuds et renvoie un petit ensemble de données cryptées qui seront décryptées du côté utilisateur. Nous analysons la sécurité de notre système et proposons des stratégies efficaces pour la mettre en œuvre.Nous avons validé nos solutions par l’implémentation de BuckTop, SDB-TOPK et SD-TOPK, et les avons comparé à des approches de base par rapport à des données synthétiques et réelles. Les résultats montrent un excellent temps de réponse par rapport aux approches de base. Ils montrent également l’efficacité de notre algorithme de filtrage qui élimine presque tous les faux positifs. De plus, nos systèmes permettent d’obtenir une réduction significative des coûts de communication entre les nœuds du système distribué lors du calcul du résultat de la requête. / Outsourcing corporate or individual data at a cloud provider, e.g. using Database-as-a-Service, is practical and cost-effective. But it introduces a major problem: how to preserve the privacy of the outsourced data, while supporting powerful user queries. A simple solution is to encrypt the data before it is outsourced. Then, to answer a query, the user client can retrieve the encrypted data from the cloud, decrypt it, and evaluate the query over plaintext (non encrypted) data. This solution is not practical, as it does not take advantage of the computing power provided by the cloud for evaluating queries.In this thesis, we consider an important kind of queries, top-k queries,and address the problem of privacy-preserving top-k query processing over encrypted data in the cloud.A top-k query allows the user to specify a number k, and the system returns the k tuples which are most relevant to the query. The relevance degree of tuples to the query is determined by a scoring function.We first propose a complete system, called BuckTop, that is able to efficiently evaluate top-k queries over encrypted data, without having to decrypt it in the cloud. BuckTop includes a top-k query processing algorithm that works on the encrypted data, stored at one cloud node,and returns a set that is proved to contain the encrypted data corresponding to the top-k results. It also comes with an efficient filtering algorithm that is executed in the cloud on encypted data and removes most of the false positives included in the set returned.When the outsourced data is big, it is typically partitioned over multiple nodes in a distributed system. For this case, we propose two new systems, called SDB-TOPK and SD-TOPK, that can evaluate top-k queries over encrypted distributed data without having to decrypt at the nodes where they are stored. In addition, SDB-TOPK and SD-TOPK have a powerful filtering algorithm that filters the false positives as much as possible in the nodes, and returns a small set of encrypted data that will be decrypted in the user side. We analyze the security of our system, and propose efficient strategies to enforce it.We validated our solutions through implementation of BuckTop , SDB-TOPK and SD-TOPK, and compared them to baseline approaches over synthetic and real databases. The results show excellent response time compared to baseline approaches. They also show the efficiency of our filtering algorithm that eliminates almost all false positives. Furthermore, our systems yieldsignificant reduction in communication cost between the distributed system nodes when computing the query result. Préservation de la confidentialité Requêtes top-K Données sensibles Sécurité du cloud Systèmes distribués Privacy Preserving Top-K Query Processing Sensitive Data Cloud security Distributed Systems
8	針對複合式競賽挑選最佳球員組合的方法 / Selecting the best group of players for a composite competition 鄧雅文, Teng, Ya Wen Unknown Date (has links) 在資料庫的處理中，top-k查詢幫助使用者從龐大的資料中萃取出具有價值的物件，它將資料庫中的物件依照給分公式給分後，選擇出分數最高的前k個回傳給使用者。然而在多數的情況下，一個物件也許不只有一個分數，要如何在多個分數中仍然選擇出整體最高分的前k個物件，便成為一個新的問題。在本研究中，我們將這樣的物件用不確定資料來表示，而每個物件的不確定性則是其帶有機率的分數以表示此分數出現的可能性，並提出一個新的問題：Best-kGROUP查詢。在此我們將情況模擬為一個複合式競賽，其中有多個子項目，每個項目的參賽人數各異，且最多需要k個人參賽；我們希望能針對此複合式競賽挑選出最佳的k個球員組合。當我們定義一個較佳的組合為其在較多項目居首位的機率比另一組合高，而最佳的組合則是沒有比它更佳的組合。為了加快挑選的速度，我們利用動態規劃的方式與篩選的演算法，將不可能的組合先剔除；所剩的組合則是具有天際線特質的組合，在這些天際線組合中，我們可以輕易的找出最佳的組合。此外，在實驗中，對於在所有球員中挑選最佳的組合，Best-kGROUP查詢也有非常優異的表現。 / In a large database, top-k query is an important mechanism to retrieve the most valuable information for the users. It ranks data objects with a ranking function and reports the k objects with the highest scores. However, when an object has multiple scores, how to rank objects without information loss becomes challenging. In this paper, we model the object with multiple scores as an uncertain data object and the uncertainty of the object as a distribution of the scores, and consider a novel problem named Best-kGROUP query. Imagine the following scenario. Assume there is a composite competition consisting of several games each of which requires a distinct number of players. Suppose the largest number is k, and we want to select the best group of k players from all the players for the competition. A group x is considered better than another group y if x has higher aggregated probability to be the top ones in more games than y. In order to speed up the selection process, the groups worse than another group definitely should first be discarded. We identify these groups using a dynamic programming based approach and a filtering algorithm. The remaining groups with the property that none of them have higher aggregated probability to be the top ones for all games against the other groups are called skyline groups. From these skyline groups, we can easily compare them to select the best group for the composite competition. The experiments show that our approach outperforms the other approaches in selecting the best group to defeat the other groups in the composite competitions. Top-k查詢不確定資料 Best-kGROUP查詢 Top-k query uncertain data Best-kGROUP query skyline kGROUP
9	Query-Time Data Integration Eberius, Julian 16 December 2015 (has links) (PDF) Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections. Datenintegration Ad-hoc Integration Top-k Anfrageverarbeitung Webdaten data integration ad-hoc integration top-k query processing web data ddc:004 rvk:ST 270
10	Query-Time Data Integration Eberius, Julian 10 December 2015 (has links) Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections. info:eu-repo/classification/ddc/004 ddc:004

Search results