Spelling suggestions: "subject:"nearest neighbor"" "subject:"nearest weighbor""
1 |
Classification via distance profile nearest neighborsMoraski, Ashley M. 04 May 2006 (has links)
Most classification rules can be expressed in terms of a distance (or dissimilarity) from the point to be classified to each of the candidate classes. For example, linear discriminant analysis classifies points into the class for which the (sample) Mahalanobis distance is smallest. However, dependence among these point-to-group distance measures is generally ignored. The primary goal of this project is to investigate the properties of a general non-parametric classification rule which takes this dependence structure into account. A review of classification procedures and applications is presented. The distance profile nearest-neighbor classification rule is defined. Properties of the rule are then explored via application to both real and simulated data and comparisons to other classification rules are discussed.
|
2 |
Mutual k Nearest Neighbor based ClassifierGupta, Nidhi January 2010 (has links)
No description available.
|
3 |
Evaluating nearest neighbor queries over uncertain databasesXie, Xike., 谢希科. January 2012 (has links)
Nearest Neighbor (NN in short) queries are important in emerging applications,
such as wireless networks, location-based services, and data stream applications,
where the data obtained are often imprecise. The imprecision or imperfection of
the data sources is modeled by uncertain data in recent research works. Handling
uncertainty is important because this issue affects the quality of query answers.
Although queries on uncertain data are useful, evaluating the queries on them can
be costly, in terms of I/O or computational efficiency. In this thesis, we study how
to efficiently evaluate NN queries on uncertain data.
Given a query point q and a set of uncertain objects O, the possible nearest neighbor query returns a set of candidates which have non-zero probabilities to be the
query answer. It is also interesting to ask \which region has the same set of possible nearest neighbors", and \which region has one specific object as its possible
nearest neighbor". To reveal the relationship between the query space and nearest
neighbor answers, we propose the UV-diagram, where the query space is split into
disjoint partitions, such that each partition is associated with a set of objects. If a
query point is located inside the partition, its possible nearest neighbors could be
directly retrieved. However, the number of such partitions is exponential and the
construction effort can be expensive. To tackle this problem, we propose an alternative concept, called UV-cell, and efficient algorithms for constructing it. The UV-cell has an irregular shape, which incurs difficulties in storage, maintenance,
and query evaluation. We design an index structure, called UV-index, which is
an approximated version of the UV-diagram. Extensive experiments show that
the UV-index could efficiently answer different variants of NN queries, such as
Probabilistic Nearest Neighbor Queries, Continuous Probabilistic Nearest Neighbor
Queries.
Another problem studied in this thesis is the trajectory nearest neighbor query.
Here the query point is restricted to a pre-known trajectory. In applications (e.g.
monitoring potential threats along a flight/vessel's trajectory), it is useful to derive
nearest neighbors for all points on the query trajectory. Simple solutions, such as
sampling or approximating the locations of uncertain objects as points, fails to
achieve a good query quality. To handle this problem, we design efficient algorithms
and optimization methods for this query. Experiments show that our solution can
efficiently and accurately answer this query. Our solution is also scalable to large
datasets and long trajectories. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
4 |
Non-Parametric Learning for Energy DisaggregationKhan, Mohammad Mahmudur Rahman 10 August 2018 (has links)
This thesis work presents a non-parametric learning method, the Extended Nearest Neighbor (ENN) algorithm, as a tool for data disaggregation in smart grids. The ENN algorithm makes the prediction according to the maximum gain of intra-class coherence. This algorithm not only considers the K nearest neighbors of the test sample but also considers whether these K data points consider the test sample as their nearest neighbor or not. So far, ENN has shown noticeable improvement in the classification accuracy for various real-life applications. To further enhance its prediction capability, in this thesis we propose to incorporate a metric learning algorithm, namely the Large Margin Nearest Neighbor (LMNN) algorithm, as a training stage in ENN. Our experiments on real-life energy data sets have shown significant performance improvement compared to several other traditional classification algorithms, including the classic KNN method and Support Vector Machines.
|
5 |
Geometric Computing over Uncertain DataZhang, Wuzhou January 2015 (has links)
<p>Entering the era of big data, human beings are faced with an unprecedented amount of geometric data today. Many computational challenges arise in processing the new deluge of geometric data. A critical one is data uncertainty: the data is inherently noisy and inaccuracy, and often lacks of completeness. The past few decades have witnessed the influence of geometric algorithms in various fields including GIS, spatial databases, and computer vision, etc. Yet most of the existing geometric algorithms are built on the assumption of the data being precise and are incapable of properly handling data in the presence of uncertainty. This thesis explores a few algorithmic challenges in what we call geometric computing over uncertain data.</p><p>We study the nearest-neighbor searching problem, which returns the nearest neighbor of a query point in a set of points, in a probabilistic framework. This thesis investigates two different nearest-neighbor formulations: expected nearest neighbor (ENN), where we consider the expected distance between each input point and a query point, and probabilistic nearest neighbor (PNN), where we estimate the probability of each input point being the nearest neighbor of a query point.</p><p>For the ENN problem, we consider a probabilistic framework in which the location of each input point and/or query point is specified as a probability density function and the goal is to return the point that minimizes the expected distance. We present methods for computing an exact ENN or an \\eps-approximate ENN, for a given error parameter 0 < \\eps < 1, under different distance functions. These methods build an index of near-linear size and answer ENN queries in polylogarithmic or sublinear time, depending on the underlying function. As far as we know, these are the first nontrivial methods for answering exact or \\eps-approximate ENN queries with provable performance guarantees. Moreover, we extend our results to answer exact or \\eps-approximate k-ENN queries. Notably, when only the query points are uncertain, we obtain state-of-the-art results for top-k aggregate (group) nearest-neighbor queries in the L1 metric using the weighted SUM operator.</p><p>For the PNN problem, we consider a probabilistic framework in which the location of each input point is specified as a probability distribution function. We present efficient algorithms for (i) computing all points that are nearest neighbors of a query point with nonzero probability; (ii) estimating, within a specified additive error, the probability of a point being the nearest neighbor of a query point; (iii) using it to return the point that maximizes the probability being the nearest neighbor, or all the points with probabilities greater than some threshold to be the nearest neighbor. We also present some experimental results to demonstrate the effectiveness of our approach.</p><p>We study the convex-hull problem, which asks for the smallest convex set that contains a given point set, in a probabilistic setting. In our framework, the uncertainty of each input point is described by a probability distribution over a finite number of possible locations including a null location to account for non-existence of the point. Our results include both exact and approximation algorithms for computing the probability of a query point lying inside the convex hull of the input, time-space tradeoffs for the membership queries, a connection between Tukey depth and membership queries, as well as a new notion of \\beta-hull that may be a useful representation of uncertain hulls.</p><p>We study contour trees of terrains, which encode the topological changes of the level set of the height value \\ell as we raise \\ell from -\\infty to +\\infty on the terrains, in a probabilistic setting. We consider a terrain that is defined by linearly interpolating each triangle of a triangulation. In our framework, the uncertainty lies in the height of each vertex in the triangulation, and we assume that it is described by a probability distribution. We first show that the probability of a vertex being a critical point, and the expected number of nodes (resp. edges) of the contour tree, can be computed exactly efficiently. Then we present efficient sampling-based methods for estimating, with high probability, (i) the probability that two points lie on an edge of the contour tree, within additive error; (ii) the expected distance of two points p, q and the probability that the distance of p, q is at least \\ell on the contour tree, within additive error and/or relative error, where the distance of p, q on a contour tree is defined to be the difference between the maximum height and the minimum height on the unique path from p to q on the contour tree.</p> / Dissertation
|
6 |
APPROXIMATE N-NEAREST NEIGHBOR CLUSTERING ON DISTRIBUTED DATABASES USING ITERATIVE REFINEMENTCALENDER, CHRISTOPHER R. 06 October 2004 (has links)
No description available.
|
7 |
Density Based Clustering using Mutual K-Nearest NeighborsDixit, Siddharth January 2015 (has links)
No description available.
|
8 |
Karst Database Implementation in Minnesota: Analysis of Sinkhole DistributionGao, Y., Alexander, E. C., Barnes, R. J. 01 May 2005 (has links)
This paper presents the overall sinkhole distributions and conducts hypothesis tests of sinkhole distributions and sinkhole formation using data stored in the Karst Feature Database (KFD) of Minnesota. Nearest neighbor analysis (NNA) was extended to include different orders of NNA, different scales of concentrated zones of sinkholes, and directions to the nearest sinkholes. The statistical results, along with the sinkhole density distribution, indicate that sinkholes tend to form in highly concentrated zones instead of scattered individuals. The pattern changes from clustered to random to regular as the scale of the analysis decreases from 10-100 km2 to 5-30 km 2 to 2-10 km2. Hypotheses that may explain this phenomenon are: (1) areas in the highly concentrated zones of sinkholes have similar geologic and topographical settings that favor sinkhole formation; (2) existing sinkholes change the hydraulic gradient in the surrounding area and increase the solution and erosional processes that eventually form more new sinkholes.
|
9 |
Advanced query processing on spatial networksYiu, Man-lung., 姚文龍. January 2006 (has links)
published_or_final_version / abstract / Computer Science / Doctoral / Doctor of Philosophy
|
10 |
On the Neutralome of Great Apes and Nearest Neighbor Search in Metric SpacesWoerner, August Eric, Woerner, August Eric January 2016 (has links)
Problems of population genetics are magnified by problems of big data. My dissertation spans the disciplines of computer science and population genetics, leveraging computational approaches to biological problems to address issues in genomics research. In this dissertation I develop more efficient metric search algorithms. I also show that vast majority of the genomes of great apes are impacted by the forces of natural selection. Finally, I introduce a heuristic to identify neutralomes—regions that are evolving with minimal selective pressures—and use these neutralomes for inferences on effective population size in great apes. We begin with a formal and far-reaching problem that impacts a broad array of disciplines including biology and computer science; the 𝑘-nearest neighbors problem in generalized metric spaces. The 𝑘-nearest neighbors (𝑘-NN) problem is deceptively simple. The problem is as follows: given a query q and dataset D of size 𝑛, find the 𝑘-closest points to q. This problem can be easily solved by algorithms that compute 𝑘th order statistics in O(𝑛) time and space. It follows that if D can be ordered, then it is perhaps possible to solve 𝑘-NN queries in sublinear time. While this is not possible for an arbitrary distance function on the points in D, I show that if the points are constrained by the triangle inequality (such as with metric spaces), then the dataset can be properly organized into a dispersion tree (Appendix A). Dispersion trees are a hierarchical data structure that is built around a large dispersed set of points. Dispersion trees have construction times that are sub-quadratic (O(𝑛¹·⁵ log 𝑛)) and use O(𝑛) space, and they use a provably optimal search strategy that minimizes the number of times the distance function is invoked. While all metric data structures have worst-case O(𝑛) search times, dispersion trees have average-case search times that are substantially faster than a large sampling of comparable data structures in the vast majority of spaces sampled. Exceptions to this include extremely high dimensional space (d>20) which devolve into near-linear scans of the dataset, and unstructured low-dimensional (d<6) Euclidean spaces. Dispersion trees have empirical search times that appear to scale as O(𝑛ᶜ) for 0<c<1. As solutions to the 𝑘-NN problem are in general too slow to be used effectively in the arena of big data in genomics, it is my hope that dispersion trees may help lift this barrier. With source-code that is freely available for academic use, dispersion trees may be useful for nearest neighbor classification problems in machine learning, fast read-mapping against a reference genome, and as a general computational tool for problems such clustering. Next, I turn to problems in population genomics. Genomic patterns of diversity are a complex function of the interplay between demographics, natural selection and mechanistic forces. A central tenet of population genetics is the neutral theory of molecular evolution which states the vast majority of changes at the molecular level are (relatively) selectively neutral; that is, they do not effect fitness. A corollary of the neutral theory is that the frequency of most alleles in populations are dictated by neutral processes and not selective processes. The forces of natural selection impact not just the site of selection, but linked neutral sites as well. I proposed an empirical assessment of the extents of linked selection in the human genome (Appendix B). Recombination decouples sites of selection from the genomic background, thus it serves to mitigate the effects of linked selection. I use two metrics on recombination, both the minimum genetic distance to genes and local rates of recombination, to parse the effects of linked selection into selection from genic and nongenic sources in the human genome. My empirical assessment shows profound linked selective effects from nongenic sources, with these effects being greater than that of genic sources on the autosomes, as well as generally greater effects on the X chromosome than on the autosomes. I quantify these trends using multiple linear regression, and then I model the effects of linked selection to conserved elements across the whole of the genome. Places predicted to be neutral by my model do not, unlike the vast majority of the genome, show these linked selective effects. This demonstrates that linkage to these regulatory elements, and not some other mechanistic force, accounts for our findings. Further, neutrally evolving regions are extremely rare (~1%) in the genome, and despite generally larger linked selective effects on the X chromosome, the size of this “neutralome” is proportionally larger on the X chromosome than on the autosomes. To account for this and to extend my findings to other great apes I improve on my procedure to find neutralomes, and apply this procedure to the genome of humans, Nigerian chimpanzees, bonobos, and western lowland gorillas (Appendix C). In doing so I show that like humans, these other apes are also enormously impacted by linked selection, with their neutralomes being substantially smaller than the neutralomes of humans. I then use my genomic predictions on neutrality to see how the landscape of linked selection changes across the X chromosome and the autosomes in regions close to, and far from, genes. While I had previously demonstrated the linked selective forces near genes are stronger on the X chromosome than on the autosomes in these taxa, I show that regions far from genes show the opposite; regions far from genes show more selection from noncoding targets on the autosomes than on the X chromosome. This finding is replicated across our great ape samples. Further, inferences on the relative effective population size of the X chromosome and the autosomes both near and far from genes can be biased as a result.
|
Page generated in 0.053 seconds