Global ETD Search

1	Probabilistic threshold range aggregate query processing over uncertain data Yang, Shuxiang, Computer Science & Engineering, Faculty of Engineering, UNSW January 2009 (has links) Uncertainty is inherent in many novel and important applications such as market surveillance, information extraction sensor data analysis, etc. In the recent a few decades, uncertain data has attracted considerable research attention. There are various factors that cause the uncertainty, for instance randomness or incompleteness of data, limitations of equipment and delay or loss in data transfer. A probabilistic threshold range aggregate (PRTA) query retrieves summarized information about the uncertain objects in the database satisfying a range query, with respect to a given probability threshold. This thesis is trying to address and handle this important type of query which there is no previous work studying on. We formulate the problem in both discrete and continuous uncertain data model and develop a novel index structure, asU-tree (aggregate-based sampling-auxiliary U-tree) which not only supports exact query answering but also provides approximate results with accuracy guarantee if efficiency is more concerned. The new asU-tree structure is totally dynamic. Query processing algorithms for both exact answer and approximate answer based on this new index structure are also proposed. An extensive experimental study shows that asU-tree is very efficient and effective over real and synthetic datasets. asU-tree Uncertain data Range aggregate query
2	Probabilistic threshold range aggregate query processing over uncertain data Yang, Shuxiang, Computer Science & Engineering, Faculty of Engineering, UNSW January 2009 (has links) Uncertainty is inherent in many novel and important applications such as market surveillance, information extraction sensor data analysis, etc. In the recent a few decades, uncertain data has attracted considerable research attention. There are various factors that cause the uncertainty, for instance randomness or incompleteness of data, limitations of equipment and delay or loss in data transfer. A probabilistic threshold range aggregate (PRTA) query retrieves summarized information about the uncertain objects in the database satisfying a range query, with respect to a given probability threshold. This thesis is trying to address and handle this important type of query which there is no previous work studying on. We formulate the problem in both discrete and continuous uncertain data model and develop a novel index structure, asU-tree (aggregate-based sampling-auxiliary U-tree) which not only supports exact query answering but also provides approximate results with accuracy guarantee if efficiency is more concerned. The new asU-tree structure is totally dynamic. Query processing algorithms for both exact answer and approximate answer based on this new index structure are also proposed. An extensive experimental study shows that asU-tree is very efficient and effective over real and synthetic datasets. asU-tree Uncertain data Range aggregate query
3	In silico modeling for uncertain biochemical data Gusenleitner, Daniel January 2009 (has links) Analyzing and modeling data is a well established research area and a vast variety of different methods have been developed over the last decades. Most of these methods assume fixed positions of data points; only recently uncertainty in data has caught attention as potentially useful source of information. In order to provide a deeper insight into this subject, this thesis concerns itself with the following essential question: Can information on uncertainty of feature values be exploited to improve in silico modeling? For this reason a state-of-art random forest algorithm is developed using Matlab R. In addition, three techniques of handling uncertain numeric features are presented and incorporated in different modified versions of random forests. To test the hypothesis six realworld data sets were provided by AstraZeneca. The data describe biochemical features of chemical compounds, including the results of an Ames test; a widely used technique to determine the mutagenicity of chemical substances. Each of the datasets contains a single uncertain numeric feature, represented as an expected value and an error estimate. Themodified algorithms are then applied on the six data sets in order to obtain classifiers, able to predict the outcome of an Ames test. The hypothesis is tested using a paired t-test and the results reveal that information on uncertainty can indeed improve the performance of in silico models. Uncertain Data Random Forest Ames Test Bioinformatics Bioinformatik
4	In silico modeling for uncertain biochemical data Gusenleitner, Daniel January 2009 (has links) <p>Analyzing and modeling data is a well established research area and a vast variety of different methods have been developed over the last decades. Most of these methods assume fixed positions of data points; only recently uncertainty in data has caught attention as potentially useful source of information. In order to provide a deeper insight into this subject, this thesis concerns itself with the following essential question: Can information on uncertainty of feature values be exploited to improve in silico modeling? For this reason a state-of-art random forest algorithm is developed using Matlab R. In addition, three techniques of handling uncertain numeric features are presented and incorporated in different modified versions of random forests. To test the hypothesis six realworld data sets were provided by AstraZeneca. The data describe biochemical features of chemical compounds, including the results of an Ames test; a widely used technique to determine the mutagenicity of chemical substances. Each of the datasets contains a single uncertain numeric feature, represented as an expected value and an error estimate. Themodified algorithms are then applied on the six data sets in order to obtain classifiers, able to predict the outcome of an Ames test. The hypothesis is tested using a paired t-test and the results reveal that information on uncertainty can indeed improve the performance of in silico models.</p> Uncertain Data Random Forest Ames Test Bioinformatics Bioinformatik
5	Geometric Computing over Uncertain Data Zhang, Wuzhou January 2015 (has links) <p>Entering the era of big data, human beings are faced with an unprecedented amount of geometric data today. Many computational challenges arise in processing the new deluge of geometric data. A critical one is data uncertainty: the data is inherently noisy and inaccuracy, and often lacks of completeness. The past few decades have witnessed the influence of geometric algorithms in various fields including GIS, spatial databases, and computer vision, etc. Yet most of the existing geometric algorithms are built on the assumption of the data being precise and are incapable of properly handling data in the presence of uncertainty. This thesis explores a few algorithmic challenges in what we call geometric computing over uncertain data.</p><p>We study the nearest-neighbor searching problem, which returns the nearest neighbor of a query point in a set of points, in a probabilistic framework. This thesis investigates two different nearest-neighbor formulations: expected nearest neighbor (ENN), where we consider the expected distance between each input point and a query point, and probabilistic nearest neighbor (PNN), where we estimate the probability of each input point being the nearest neighbor of a query point.</p><p>For the ENN problem, we consider a probabilistic framework in which the location of each input point and/or query point is specified as a probability density function and the goal is to return the point that minimizes the expected distance. We present methods for computing an exact ENN or an \\eps-approximate ENN, for a given error parameter 0 < \\eps < 1, under different distance functions. These methods build an index of near-linear size and answer ENN queries in polylogarithmic or sublinear time, depending on the underlying function. As far as we know, these are the first nontrivial methods for answering exact or \\eps-approximate ENN queries with provable performance guarantees. Moreover, we extend our results to answer exact or \\eps-approximate k-ENN queries. Notably, when only the query points are uncertain, we obtain state-of-the-art results for top-k aggregate (group) nearest-neighbor queries in the L1 metric using the weighted SUM operator.</p><p>For the PNN problem, we consider a probabilistic framework in which the location of each input point is specified as a probability distribution function. We present efficient algorithms for (i) computing all points that are nearest neighbors of a query point with nonzero probability; (ii) estimating, within a specified additive error, the probability of a point being the nearest neighbor of a query point; (iii) using it to return the point that maximizes the probability being the nearest neighbor, or all the points with probabilities greater than some threshold to be the nearest neighbor. We also present some experimental results to demonstrate the effectiveness of our approach.</p><p>We study the convex-hull problem, which asks for the smallest convex set that contains a given point set, in a probabilistic setting. In our framework, the uncertainty of each input point is described by a probability distribution over a finite number of possible locations including a null location to account for non-existence of the point. Our results include both exact and approximation algorithms for computing the probability of a query point lying inside the convex hull of the input, time-space tradeoffs for the membership queries, a connection between Tukey depth and membership queries, as well as a new notion of \\beta-hull that may be a useful representation of uncertain hulls.</p><p>We study contour trees of terrains, which encode the topological changes of the level set of the height value \\ell as we raise \\ell from -\\infty to +\\infty on the terrains, in a probabilistic setting. We consider a terrain that is defined by linearly interpolating each triangle of a triangulation. In our framework, the uncertainty lies in the height of each vertex in the triangulation, and we assume that it is described by a probability distribution. We first show that the probability of a vertex being a critical point, and the expected number of nodes (resp. edges) of the contour tree, can be computed exactly efficiently. Then we present efficient sampling-based methods for estimating, with high probability, (i) the probability that two points lie on an edge of the contour tree, within additive error; (ii) the expected distance of two points p, q and the probability that the distance of p, q is at least \\ell on the contour tree, within additive error and/or relative error, where the distance of p, q on a contour tree is defined to be the difference between the maximum height and the minimum height on the unique path from p to q on the contour tree.</p> / Dissertation Computer science aggregate nearest neighbor contour tree convex hull geometric computing nearest-neighbor query uncertain data
6	Geometric Facility Location Problems on Uncertain Data Zhang, Jingru 01 August 2017 (has links) Facility location, as an important topic in computer science and operations research, is concerned with placing facilities for "serving" demand points (each representing a customer) to minimize the (service) cost. In the real world, data is often associated with uncertainty because of measurement inaccuracy, sampling discrepancy, outdated data sources, resource limitation, etc. Hence, problems on uncertain data have attracted much attention. In this dissertation, we mainly study a classical facility location problem: the k- center problem and several of its variations, on uncertain points each of which has multiple locations that follow a probability density function (pdf). We develop efficient algorithms for solving these problems. Since these problems more or less have certain geometric flavor, computational geometry techniques are utilized to help develop the algorithms. In particular, we first study the k-center problem on uncertain points on a line, which is aimed to find k centers on the line to minimize the maximum expected distance from all uncertain points to their expected closest centers. We develop efficient algorithms for both the continuous case where the location of every uncertain point follows a continuous piecewise-uniform pdf and the discrete case where each uncertain point has multiple discrete locations each associated with a probability. The time complexities of our algorithms are nearly linear and match those for the same problem on deterministic points. Then, we consider the one-center problem (i.e., k= 1) on a tree, where each uncertain point has multiple locations in the tree and we want to compute a center in the tree to minimize the maximum expected distance from it to all uncertain points. We solve the problem in linear time by proposing a new algorithmic scheme, called the refined prune-and-search. Next, we consider the one-dimensional one-center problem of uncertain points with continuous pdfs, and the one-center problem in the plane under the rectilinear metric for uncertain points with discrete locations. We solve both problems in linear time, again by using the refined prune-and-search technique. In addition, we study the k-center problem on uncertain points in a tree. We present an efficient algorithm for the problem by proposing a new tree decomposition and developing several data structures. The tree decomposition and these data structures may be interesting in their own right. Finally, we consider the line-constrained k-center problem on deterministic points in the plane where the centers are required to be located on a given line. Several distance metrics including L1, L2, and L1 are considered. We also study the line-constrained k-median and k-means problems in the plane. These problems have been studied before. Based on geometric observations, we design new algorithms that improve the previous work. The algorithms and techniques we developed in this dissertation may and other applications as well, in particular, on solving other related problems on uncertain data. Algorithms computational geometry facility location k-center uncertain data Computer Sciences
7	High-Performance Processing of Continuous Uncertain Data Tran, Thanh Thi Lac 01 May 2013 (has links) Uncertain data has arisen in a growing number of applications such as sensor networks, RFID systems, weather radar networks, and digital sky surveys. The fact that the raw data in these applications is often incomplete, imprecise and even misleading has two implications: (i) the raw data is not suitable for direct querying, (ii) feeding the uncertain data into existing systems produces results of unknown quality. This thesis presents a system for uncertain data processing that has two key functionalities, (i) capturing and transforming raw noisy data to rich queriable tuples that carry attributes needed for query processing with quantified uncertainty, and (ii) performing query processing on such tuples, which captures changes of uncertainty as data goes through various query operators. The proposed system considers data naturally captured by continuous distributions, which is prevalent in sensing and scientific applications. The first part of the thesis addresses data capture and transformation by proposing a probabilistic modeling and inference approach. Since this task is application-specific and requires domain knowledge, this approach is demonstrated for RFID data from mobile readers. More specifically, the proposed solution involves an inference and cleaning substrate to transform raw RFID data streams to object location tuple streams where locations are inferred from raw noisy data and their uncertain values are captured by probability distributions. The second, also the main part, of this thesis examines query processing for uncertain data modeled by continuous random variables. The proposed system includes new data models and algorithms for relational processing, with a focus on aggregation and conditioning operations. For operations of high complexity, optimizations including approximations with guaranteed error bounds are considered. Then complex queries involving a mix of operations are addressed by query planning, which given a query, finds an efficient plan that meets user-defined accuracy requirements. Besides relational processing, this thesis also provides the support for user-defined functions (UDFs) on uncertain data, which aims to compute the output distribution given uncertain input and a black-box UDF. The proposed solution employs a learning-based approach using Gaussian processes to compute approximate output with error bounds, and a suite of optimizations for high performance in online settings such as data stream processing and interactive data analysis. The techniques proposed in this thesis are thoroughly evaluated using both synthetic data with controlled properties and various real-world datasets from the domains of severe weather monitoring, object tracking using RFID readers, and computational astrophysics. The experimental results show that these techniques can yield high accuracy, meet stream speeds, and outperform existing techniques such as Monte Carlo sampling for many important workloads . databases data management data models data streams processing algorithms uncertain data Computer Sciences
8	針對複合式競賽挑選最佳球員組合的方法 / Selecting the best group of players for a composite competition 鄧雅文, Teng, Ya Wen Unknown Date (has links) 在資料庫的處理中，top-k查詢幫助使用者從龐大的資料中萃取出具有價值的物件，它將資料庫中的物件依照給分公式給分後，選擇出分數最高的前k個回傳給使用者。然而在多數的情況下，一個物件也許不只有一個分數，要如何在多個分數中仍然選擇出整體最高分的前k個物件，便成為一個新的問題。在本研究中，我們將這樣的物件用不確定資料來表示，而每個物件的不確定性則是其帶有機率的分數以表示此分數出現的可能性，並提出一個新的問題：Best-kGROUP查詢。在此我們將情況模擬為一個複合式競賽，其中有多個子項目，每個項目的參賽人數各異，且最多需要k個人參賽；我們希望能針對此複合式競賽挑選出最佳的k個球員組合。當我們定義一個較佳的組合為其在較多項目居首位的機率比另一組合高，而最佳的組合則是沒有比它更佳的組合。為了加快挑選的速度，我們利用動態規劃的方式與篩選的演算法，將不可能的組合先剔除；所剩的組合則是具有天際線特質的組合，在這些天際線組合中，我們可以輕易的找出最佳的組合。此外，在實驗中，對於在所有球員中挑選最佳的組合，Best-kGROUP查詢也有非常優異的表現。 / In a large database, top-k query is an important mechanism to retrieve the most valuable information for the users. It ranks data objects with a ranking function and reports the k objects with the highest scores. However, when an object has multiple scores, how to rank objects without information loss becomes challenging. In this paper, we model the object with multiple scores as an uncertain data object and the uncertainty of the object as a distribution of the scores, and consider a novel problem named Best-kGROUP query. Imagine the following scenario. Assume there is a composite competition consisting of several games each of which requires a distinct number of players. Suppose the largest number is k, and we want to select the best group of k players from all the players for the competition. A group x is considered better than another group y if x has higher aggregated probability to be the top ones in more games than y. In order to speed up the selection process, the groups worse than another group definitely should first be discarded. We identify these groups using a dynamic programming based approach and a filtering algorithm. The remaining groups with the property that none of them have higher aggregated probability to be the top ones for all games against the other groups are called skyline groups. From these skyline groups, we can easily compare them to select the best group for the composite competition. The experiments show that our approach outperforms the other approaches in selecting the best group to defeat the other groups in the composite competitions. Top-k查詢不確定資料 Best-kGROUP查詢 Top-k query uncertain data Best-kGROUP query skyline kGROUP
9	A framework for processing correlated probabilistic data van Schaik, Sebastiaan Johannes January 2014 (has links) The amount of digitally-born data has surged in recent years. In many scenarios, this data is inherently uncertain (or: probabilistic), such as data originating from sensor networks, image and voice recognition, location detection, and automated web data extraction. Probabilistic data requires novel and different approaches to data mining and analysis, which explicitly account for the uncertainty and the correlations therein. This thesis introduces ENFrame: a framework for processing and mining correlated probabilistic data. Using this framework, it is possible to express both traditional and novel algorithms for data analysis in a special user language, without having to explicitly address the uncertainty of the data on which the algorithms operate. The framework will subsequently execute the algorithm on the probabilistic input, and perform exact or approximate parallel probability computation. During the probability computation, correlations and provenance are succinctly encoded using probabilistic events. This thesis contains novel contributions in several directions. An expressive user language – a subset of Python – is introduced, which allows a programmer to implement algorithms for probabilistic data without requiring knowledge of the underlying probabilistic model. Furthermore, an event language is presented, which is used for the probabilistic interpretation of the user program. The event language can succinctly encode arbitrary correlations using events, which are the probabilistic counterparts of deterministic user program variables. These highly interconnected events are stored in an event network, a probabilistic interpretation of the original user program. Multiple techniques for exact and approximate probability computation (with error guarantees) of such event networks are presented, as well as techniques for parallel computation. Adaptations of multiple existing data mining algorithms are shown to work in the framework, and are subsequently subjected to an extensive experimental evaluation. Additionally, a use-case is presented in which a probabilistic adaptation of a clustering algorithm is used to predict faults in energy distribution networks. Lastly, this thesis presents techniques for integrating a number of different probabilistic data formalisms for use in this framework and in other applications. 519.20285
10	Clustering Uncertain Data with Possible Worlds Lehner, Wolfgang, Volk, Peter Benjamin, Rosenthal, Frank, Hahmann, Martin, Habich, Dirk 16 August 2022 (has links) The topic of managing uncertain data has been explored in many ways. Different methodologies for data storage and query processing have been proposed. As the availability of management systems grows, the research on analytics of uncertain data is gaining in importance. Similar to the challenges faced in the field of data management, algorithms for uncertain data mining also have a high performance degradation compared to their certain algorithms. To overcome the problem of performance degradation, the MCDB approach was developed for uncertain data management based on the possible world scenario. As this methodology shows significant performance and scalability enhancement, we adopt this method for the field of mining on uncertain data. In this paper, we introduce a clustering methodology for uncertain data and illustrate current issues with this approach within the field of clustering uncertain data. info:eu-repo/classification/ddc/004 ddc:004

Search results