Spelling suggestions: "subject:"probabilistic data"" "subject:"probabilistic mata""
1 |
New Probabilistic Interest Measures for Association RulesHahsler, Michael, Hornik, Kurt January 2006 (has links) (PDF)
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significant better performance than lift for applications where spurious rules are problematic. / Series: Research Report Series / Department of Statistics and Mathematics
|
2 |
Ultra WideBand Impulse Radio in Multiple Access Wireless CommunicationsLai, Weei-Shehng 25 July 2004 (has links)
Ultra-Wideband impulse radio (UWB-IR) technology is an attractive method on multi-user for high data rate transmitting structures. In this thesis, we use the ultra wideband (UWB) signal that is modulated by the time-hopping spread spectrum technique in a wireless multiple access environments, and discuss the influences of multiple access interference. We discuss two parts of the influences of multiple access interference in this thesis. The first, we analyze the multiple access interferences on the conventional correlation receiver, and discuss the influences by using the time hopping code on different multiple access structures. The second, we know that the performances of user detection and system capacity would be degraded by the conventional correlation receiver in the multiple access channels. The Probabilistic Data Association(PDA) multi-user detection technology can eliminate multiple access interferences in this part. We will use this method to verify the system performance through the computer simulations, and compare to other multi-user detectors with convention correlation receivers. Finally, the simulation results show that the performance of the PDA multi-user detections is improved when the system is full loaded.
|
3 |
Visual Tracking With Group Motion ApproachArslan, Ali Erkin 01 January 2003 (has links) (PDF)
An algorithm for tracking single visual targets is developed in this study.
Feature detection is the necessary and appropriate image processing technique for
this algorithm. The main point of this approach is to use the data supplied by the
feature detection as the observation from a group of targets having similar motion
dynamics. Therefore a single visual target is regarded as a group of multiple targets.
Accurate data association and state estimation under clutter are desired for this
application similar to other multi-target tracking applications. The group tracking
approach is used with the well-known probabilistic data association technique to
cope with data association and estimation problems. The applicability of this
method particularly for visual tracking and for other cases is also discussed.
|
4 |
Utility of Considering Multiple Alternative Rectifications in Data CleaningJanuary 2013 (has links)
abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This is not possible in many cases, where the most a cleaning system can do is to generate a (hopefully small) set of clean candidates for each dirty tuple. When the cleaning system is required to output a deterministic database, it is forced to pick one clean candidate (say the "most likely" candidate) per tuple. Such an approach can lead to loss of information. For example, consider a situation where there are three equally likely clean candidates of a dirty tuple. An appealing alternative that avoids such an information loss is to abandon the requirement that the output database be deterministic. In other words, even though the input (dirty) database is deterministic, I allow the reconstructed database to be probabilistic. Although such an approach does avoid the information loss, it also brings forth several challenges. For example, how many alternatives should be kept per tuple in the reconstructed database? Maintaining too many alternatives increases the size of the reconstructed database, and hence the query processing time. Second, while processing queries on the probabilistic database may well increase recall, how would they affect the precision of the query processing? In this thesis, I investigate these questions. My investigation is done in the context of a data cleaning system called BayesWipe that has the capability of producing multiple clean candidates per each dirty tuple, along with the probability that they are the correct cleaned version. I represent these alternatives as tuples in a tuple disjoint probabilistic database, and use the Mystiq system to process queries on it. This probabilistic reconstruction (called BayesWipe–PDB) is compared to a deterministic reconstruction (called BayesWipe–DET)—where the most likely clean candidate for each tuple is chosen, and the rest of the alternatives discarded. / Dissertation/Thesis / M.S. Computer Science 2013
|
5 |
A Methodology for the Development of a Production Experience Database for Earthmoving Operations Using Automated Data CollectionKannan, Govindan 26 June 1999 (has links)
Automated data acquisition has revolutionized the reliability of product design in recent years. A noteworthy example is the improvement in the design of aircrafts through field data. This research proposes a similar improvement in the reliability of process design of earthmoving operations through automated field data acquisition. The segment of earthmoving operations addressed in this research constitutes the truck-loader operation. Therefore, the applicability of this research extends to other industries involving truck-operation such as mining, agriculture and forest logging and is closely related to wheel-based earthmoving operations such as scrapers.
The context of this research is defined by data collection needed to increase the validity of the results obtained by analysis tools such as simulation, performance measures and graphical representation of variance in an activity's performance, and the relation between operating conditions and the variance in an activity's performance. The automated cycle time data collection is facilitated by instrumented trucks and the collection of information on operating conditions is facilitated by image database and paper forms. The cycle time data and the information on operating conditions are linked together to form the experience database.
This research developed methods to extract, quantify and understand the variation in each component of the earthmoving cycle namely, load, haul and return, and dump activities. For the load activity, the simultaneous variation in payload and load time is illustrated through the development of a PLT (PayLoad Time) Map. Among the operating conditions, material type, load area floor, space constraints and shift are investigated. A dynamic normalization process of determining the ratio of actual travel time to expected travel time is developed for the haul and return activities. The length of the haul road, sequence of gear downshifts and shift are investigated for their effect on the travel time. The discussion on the dump activity is presented in a qualitative form due to the lack of data.
Each component is integrated within the framework of the experience database. The implementation aspects with respect to developing and using the experience database are also described in detail. The practical relevance of this study is highlighted using an example. / Ph. D.
|
6 |
B-Spline Based Multitarget TrackingSithiravel, Rajiv January 2014 (has links)
Multitarget tracking in the presence of false alarm is a difficult problem to consider. The objective of multitarget tracking is to estimate the number of targets and their states recursively from available observations. At any given time, targets can be born, die and spawn from already existing targets. Sensors can detect these targets with a defined threshold, where normally the observation is influenced by false alarm. Also if the targets are with low signal to noise ratio (SNR) then the targets may not be detected.
The Random Finite Set (RFS) filters can be used to solve such multitarget problem efficiently. Specially, one of the best and most widely used RFS based filter is the Probability Hypothesis Density (PHD) filter. The PHD filter approximates the posterior probability density function (PDF) by the first order moment only, where the targets SNR assumed to be much higher. The PHD filter supports targets die, born, spawn and missed-detection by using the well known implementations including Sequential Monte Carlo Probability Hypothesis Density (SMC-PHD) and Gaussian Mixture Probability Hypothesis Density (GM-PHD) methods. The SMC-PHD filter suffers from the well known degeneracy problems while GM-PHD filter may not be suitable for nonlinear and non-Gaussian target tracking problems.
It is desirable to have a filter that can provide continuous estimates for any distribution. This is the motivation for the use of B-Splines in this thesis. One of the main focus of the thesis is the B-Spline based PHD (SPHD) filters. The Spline is a well developed theory and been used in academia and industry for more than five decades. The B-Spline can represent any numerical, geometrical and statistical functions and models including the PDF and PHD. The SPHD filter can be applied to linear, nonlinear, Gaussian and non-Gaussian multitarget tracking applications. The SPHD continuity can be maintained by selecting splines with order of three or more, which avoids the degeneracy-related problem. Another important characteristic of the SPHD filter is that the SPHD can be locally controlled, which allow the manipulations of the SPHD and its natural tendency for handling the nonlinear problems. The SPHD filter can be further extended to support maneuvering multitarget tracking, where it can be an alternative to any available PHD filter implementations.
The PHD filter does not work well for very low observable (VLO) target tracking problems, where the targets SNR is normally very low. For very low SNR scenarios the PDF must be approximated by higher order moments. Therefore the PHD implementations may not be suitable for the problem considered in this thesis. One of the best estimator to use in VLO target tracking problem is the Maximum-Likelihood Probability Data Association (ML-PDA) algorithm. The standard ML-PDA algorithm is widely used in single target initialization or geolocation problems with high false alarm. The B-Spline is also used in the ML-PDA (SML-PDA) implementations. The SML-PDA algorithm has the capability to determine the global maximum of ML-PDA log-likelihood ratio with high efficiency in terms of state estimates and low computational complexity. For fast passive track initialization, search and rescue operations the SML-PDA algorithm can be used more efficiently compared to the standard ML-PDA algorithm. Also the SML-PDA algorithm with the extension supports the multitarget tracking. / Thesis / Doctor of Philosophy (PhD)
|
7 |
Managing large-scale scientific hypotheses as uncertain and probabilistic data / Gerência de hipóteses científicas de larga-escala como dados incertos e probabilísticosGonçalves, Bernardo Nunes 28 January 2015 (has links)
Submitted by Maria Cristina (library@lncc.br) on 2015-04-02T17:47:07Z
No. of bitstreams: 1
bernardo-thesis.pdf: 1669339 bytes, checksum: fbd578e31ff13004edbe4fe1eec0ef5f (MD5) / Approved for entry into archive by Maria Cristina (library@lncc.br) on 2015-04-02T17:47:51Z (GMT) No. of bitstreams: 1
bernardo-thesis.pdf: 1669339 bytes, checksum: fbd578e31ff13004edbe4fe1eec0ef5f (MD5) / Made available in DSpace on 2015-04-02T17:48:29Z (GMT). No. of bitstreams: 1
bernardo-thesis.pdf: 1669339 bytes, checksum: fbd578e31ff13004edbe4fe1eec0ef5f (MD5)
Previous issue date: 2015-01-28 / Conselho Nacional de Desenvolvimento Científico e Tecnológico / Fundação Carlos Chagas Filho de Amparo à Pesquisa do estado do Rio de Janeiro / Tendo em vista a mudança de paradigma que faz da ciência cada vez mais guiada por dados, nesta tese propomos um método para codifica e gerência de hipóteses científicas determinísticas de larga escala como dados incertos e probabilísticos.
Na forma de equações matemáticas, hipóteses relacionam simetricamente aspectos do fenômeno de estudo. Para computação de predições, no entanto, hipóteses determinísticas podem ser abstraídas como funções. Levamos adiante a no de Simon de equações estruturais para extrair de forma eficiente a então chamada ordenação causal implícita na estrutura de uma hipótese.
Mostramos como processar a estrutura preditiva de uma hipótese através de algoritmos originais para sua codifica ‹o como um conjunto de dependências funcionais (df's) e então realizamos inferência causal em termos de raciocínio acíclico pseudo-transitivo sobre df's.
Tal raciocínio revela importantes dependências causais implícitas nos dados preditivos da hipótese, que conduzem nossa síntese do banco de dados probabilístico. Como na área de modelos gráficos (IA), o banco de dados probabilístico deve ser normalizado de tal forma que a incerteza oriunda de hipóteses alternativas seja decomposta em fatores e propagada propriamente recuperando sua distribuição de probabilidade conjunta via junção 'lossless.' Isso é motivado como um princípio teórico de projeto para gerência e análise de hip teses.
O método proposto é aplicável a hipóteses determinísticas quantitativas e qualitativas e é demonstrado em casos realísticos de ciência computacional. / In view of the paradigm shift that makes science ever more data-driven, in this thesis we propose a synthesis method for encoding and managing large-scale deterministic scientific hypotheses as uncertain and probabilistic data.
In the form of mathematical equations, hypotheses symmetrically relate aspects of the studied phenomena. For computing predictions, however, deterministic hypotheses can be abstracted as functions. We build upon Simon's notion of structural equations in order to efficiently extract the (so-called) causal ordering between variables, implicit in a hypothesis structure (set of mathematical equations).
We show how to process the hypothesis predictive structure effectively through original algorithms for encoding it into a set of functional dependencies (fd's) and then performing causal reasoning in terms of acyclic pseudo-transitive reasoning over fd's. Such reasoning reveals important causal dependencies implicit in the hypothesis predictive data and guide our synthesis of a probabilistic database. Like in the field of graphical models in AI, such a probabilistic database should be normalized so that the uncertainty arisen from competing hypotheses is decomposed into factors and propagated properly onto predictive data by recovering its joint probability distribution through a lossless join. That is motivated as a design-theoretic principle for data-driven hypothesis management and predictive analytics.
The method is applicable to both quantitative and qualitative deterministic hypotheses and demonstrated in realistic use cases from computational science.
|
8 |
Optimal Active Learning: experimental factors and membership query learningYu-hui Yeh Unknown Date (has links)
The field of Machine Learning is concerned with the development of algorithms, models and techniques that solve challenging computational problems by learning from data representative of the problem (e.g. given a set of medical images previously classified by a human expert, build a model to predict unseen images as either benign or malignant). Many important real-world problems have been formulated as supervised learning problems. The assumption is that a data set is available containing the correct output (e.g. class label or target value) for each given data point. In many application domains, obtaining the correct outputs (labels) for data points is a costly and time-consuming task. This has provided the motivation for the development of Machine Learning techniques that attempt to minimize the number of labeled data points while maintaining good generalization performance on a given problem. Active Learning is one such class of techniques and is the focus of this thesis. Active Learning algorithms select or generate unlabeled data points to be labeled and use these points for learning. If successful, an Active Learning algorithm should be able to produce learning performance (e.g test set error) comparable to an equivalent supervised learner using fewer labeled data points. Theoretical, algorithmic and experimental Active Learning research has been conducted and a number of successful applications have been demonstrated. However, the scope of many of the experimental studies on Active Learning has been relatively small and there are very few large-scale experimental evaluations of Active Learning techniques. A significant amount of performance variability exists across Active Learning experimental results in the literature. Furthermore, the implementation details and effects of experimental factors have not been closely examined in empirical Active Learning research, creating some doubt over the strength and generality of conclusions that can be drawn from such results. The Active Learning model/system used in this thesis is the Optimal Active Learning algorithm framework with Gaussian Processes for regression problems (however, most of the research questions are of general interest in many other Active Learning scenarios). Experimental and implementation details of the Active Learning system used are described in detail, using a number of regression problems and datasets of different types. It is shown that the experimental results of the system are subject to significant variability across problem datasets. The hypothesis that experimental factors can account for this variability is then investigated. The results show the impact of sampling and sizes of the datasets used when generating experimental results. Furthermore, preliminary experimental results expose performance variability across various real-world regression problems. The results suggest that these experimental factors can (to a large extent) account for the variability observed in experimental results. A novel resampling technique for Optimal Active Learning, called '3-Sets Cross-Validation', is proposed as a practical solution to reduce experimental performance variability. Further results confirm the usefulness of the technique. The thesis then proposes an extension to the Optimal Active Learning framework, to perform learning via membership queries via a novel algorithm named MQOAL. The MQOAL algorithm employs the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method to sample data points for query selection. Experimental results show that MQOAL provides comparable performance to the pool-based OAL learner, using a very generic, simple MCMC technique, and is robust to experimental factors related to the MCMC implementation. The possibility of making queries in batches is also explored experimentally, with results showing that while some performance degradation does occur, it is minimal for learning in small batch sizes, which is likely to be valuable in some real-world problem domains.
|
9 |
Optimal Active Learning: experimental factors and membership query learningYu-hui Yeh Unknown Date (has links)
The field of Machine Learning is concerned with the development of algorithms, models and techniques that solve challenging computational problems by learning from data representative of the problem (e.g. given a set of medical images previously classified by a human expert, build a model to predict unseen images as either benign or malignant). Many important real-world problems have been formulated as supervised learning problems. The assumption is that a data set is available containing the correct output (e.g. class label or target value) for each given data point. In many application domains, obtaining the correct outputs (labels) for data points is a costly and time-consuming task. This has provided the motivation for the development of Machine Learning techniques that attempt to minimize the number of labeled data points while maintaining good generalization performance on a given problem. Active Learning is one such class of techniques and is the focus of this thesis. Active Learning algorithms select or generate unlabeled data points to be labeled and use these points for learning. If successful, an Active Learning algorithm should be able to produce learning performance (e.g test set error) comparable to an equivalent supervised learner using fewer labeled data points. Theoretical, algorithmic and experimental Active Learning research has been conducted and a number of successful applications have been demonstrated. However, the scope of many of the experimental studies on Active Learning has been relatively small and there are very few large-scale experimental evaluations of Active Learning techniques. A significant amount of performance variability exists across Active Learning experimental results in the literature. Furthermore, the implementation details and effects of experimental factors have not been closely examined in empirical Active Learning research, creating some doubt over the strength and generality of conclusions that can be drawn from such results. The Active Learning model/system used in this thesis is the Optimal Active Learning algorithm framework with Gaussian Processes for regression problems (however, most of the research questions are of general interest in many other Active Learning scenarios). Experimental and implementation details of the Active Learning system used are described in detail, using a number of regression problems and datasets of different types. It is shown that the experimental results of the system are subject to significant variability across problem datasets. The hypothesis that experimental factors can account for this variability is then investigated. The results show the impact of sampling and sizes of the datasets used when generating experimental results. Furthermore, preliminary experimental results expose performance variability across various real-world regression problems. The results suggest that these experimental factors can (to a large extent) account for the variability observed in experimental results. A novel resampling technique for Optimal Active Learning, called '3-Sets Cross-Validation', is proposed as a practical solution to reduce experimental performance variability. Further results confirm the usefulness of the technique. The thesis then proposes an extension to the Optimal Active Learning framework, to perform learning via membership queries via a novel algorithm named MQOAL. The MQOAL algorithm employs the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method to sample data points for query selection. Experimental results show that MQOAL provides comparable performance to the pool-based OAL learner, using a very generic, simple MCMC technique, and is robust to experimental factors related to the MCMC implementation. The possibility of making queries in batches is also explored experimentally, with results showing that while some performance degradation does occur, it is minimal for learning in small batch sizes, which is likely to be valuable in some real-world problem domains.
|
10 |
Navigation And Control Studies On Cruise MissilesEkutekin, Vedat 01 January 2007 (has links) (PDF)
A cruise missile is a guided missile that uses a lifting wing and a jet propulsion system to allow sustained flight. Cruise missiles are, in essence, unmanned aircraft and they are generally designed to carry a large conventional or nuclear warhead many hundreds of miles with excellent accuracy. In this study, navigation and control studies on cruise missiles are performed. Due to the variety and complexity of the subsystems of the cruise missiles, the main concern is limited with the navigation system. Navigation system determines the position, velocity, attitude and time solutions of the missile. Therefore, it can be concluded that an accurate self-contained navigation system directly influences the success of the missile. In the study, modern radar data association algorithms are implemented as new Terrain Aided Navigation (TAN) algorithms which can be used with low-cost Inertial Measurement Units (IMU&rsquo / s). In order to perform the study, first a thorough survey of the literature on mid-course navigation of cruise missiles is performed. Then, study on modern radar data association algorithms and their implementations to TAN are done with simple simulations. At the case study part, a six degree of freedom (6 DOF) flight simulation tool is developed which includes the aerodynamic and dynamic model of the cruise missile model including error model of the navigation system. Finally, the performances of the designed navigation systems with the implemented TAN algorithms are examined in detail with the help of the simulations performed.
|
Page generated in 0.081 seconds