Spelling suggestions: "subject:"locality sensitive hashing"" "subject:"vocality sensitive hashing""
11 |
[en] LSHSIM: A LOCALITY SENSITIVE HASHING BASED METHOD FOR MULTIPLE-POINT GEOSTATISTICS / [pt] LSHSIM: UM MÉTODO DE GEOESTATÍSTICA MULTIPONTO BASEADO EM LOCALITY SENSITIVITY HASHINGPEDRO NUNO DE SOUZA MOURA 14 November 2017 (has links)
[pt] A modelagem de reservatórios consiste em uma tarefa de muita relevância na medida em que permite a representação de uma dada região geológica de interesse. Dada a incerteza envolvida no processo, deseja-se gerar uma grande quantidade de cenários possíveis para se determinar aquele que melhor representa essa região. Há, então, uma forte demanda de se gerar rapidamente cada simulação. Desde a sua origem, diversas metodologias foram propostas para esse propósito e, nas últimas duas décadas, Multiple-Point Geostatistics (MPS) passou a ser a dominante. Essa metodologia é fortemente baseada no conceito de imagem de treinamento (TI) e no uso de suas características, que são denominadas de padrões. No presente trabalho, é proposto um novo método de MPS que combina a aplicação de dois conceitos-chave: a técnica denominada Locality Sensitive Hashing (LSH), que permite a aceleração da busca por padrões similares a um dado objetivo; e a técnica de compressão Run-Length Encoding (RLE), utilizada para acelerar o cálculo da similaridade de Hamming. Foram realizados experimentos com imagens de treinamento tanto categóricas quanto contínuas que evidenciaram que o LSHSIM é computacionalmente
eficiente e produz realizações de boa qualidade, enquanto gera um espaço de incerteza de tamanho razoável. Em particular, para dados categóricos, os resultados sugerem que o LSHSIM é mais rápido do que o MS-CCSIM, que corresponde a um dos métodos componentes do estado-da-arte. / [en] Reservoir modeling is a very important task that permits the representation of a geological region of interest. Given the uncertainty involved in the process, one wants to generate a considerable number of possible scenarios so as to find those which best represent this region. Then, there is a strong demand for quickly generating each simulation. Since its inception, many methodologies have been proposed for this purpose and, in the last two decades, multiple-point geostatistics (MPS) has been the dominant one. This methodology is strongly based on the concept of training image (TI) and the use of its characteristics, which are called patterns. In this work, we propose a new MPS method that combines the application of a technique called Locality Sensitive Hashing (LSH), which permits to accelerate the search for patterns similar to a target one, with a Run-Length Encoding (RLE) compression technique that speeds up the calculation of the Hamming similarity. We have performed experiments with both categorical and continuous images which showed that LSHSIM is computationally efficient and produce good quality realizations, while achieving a reasonable space of uncertainty. In particular, for categorical data, the results suggest that LSHSIM is faster than MS-CCSIM, one of the state-of-the-art methods.
|
12 |
Mining Parallel Corpora from the Web / Mining Parallel Corpora from the WebKúdela, Jakub January 2016 (has links)
Title: Mining Parallel Corpora from the Web Author: Bc. Jakub Kúdela Author's e-mail address: jakub.kudela@gmail.com Department: Department of Software Engineering Thesis supervisor: Doc. RNDr. Irena Holubová, Ph.D. Supervisor's e-mail address: holubova@ksi.mff.cuni.cz Thesis consultant: RNDr. Ondřej Bojar, Ph.D. Consultant's e-mail adress: bojar@ufal.mff.cuni.cz Abstract: Statistical machine translation (SMT) is one of the most popular ap- proaches to machine translation today. It uses statistical models whose parame- ters are derived from the analysis of a parallel corpus required for the training. The existence of a parallel corpus is the most important prerequisite for building an effective SMT system. Various properties of the corpus, such as its volume and quality, highly affect the results of the translation. The web can be considered as an ever-growing source of considerable amounts of parallel data to be mined and included in the training process, thus increasing the effectiveness of SMT systems. The first part of this thesis summarizes some of the popular methods for acquiring parallel corpora from the web. Most of these methods search for pairs of parallel web pages by looking for the similarity of their structures. How- ever, we believe there still exists a non-negligible amount of parallel...
|
13 |
Scaling Analytics via Approximate and Distributed ComputingChakrabarti, Aniket 12 December 2017 (has links)
No description available.
|
14 |
Approximate Clustering Algorithms for High Dimensional Streaming and Distributed DataCarraher, Lee A. 22 May 2018 (has links)
No description available.
|
15 |
A Parallel Algorithm for Query Adaptive, Locality Sensitive Hash SearchCarraher, Lee A. 17 September 2012 (has links)
No description available.
|
16 |
REGION-BASED GEOMETRIC ACTIVE CONTOUR FOR CLASSIFICATION USING HYPERSPECTRAL REMOTE SENSING IMAGESYan, Lin 20 October 2011 (has links)
No description available.
|
17 |
Tackling the current limitations of bacterial taxonomy with genome-based classification and identification on a crowdsourcing Web serviceTian, Long 25 October 2019 (has links)
Bacterial taxonomy is the science of classifying, naming, and identifying bacteria. The scope and practice of taxonomy has evolved through history with our understanding of life and our growing and changing needs in research, medicine, and industry. As in animal and plant taxonomy, the species is the fundamental unit of taxonomy, but the genetic and phenotypic diversity that exists within a single bacterial species is substantially higher compared to animal or plant species. Therefore, the current "type"-centered classification scheme that describes a species based on a single type strain is not sufficient to classify bacterial diversity, in particular in regard to human, animal, and plant pathogens, for which it is necessary to trace disease outbreaks back to their source. Here we discuss the current needs and limitations of classic bacterial taxonomy and introduce LINbase, a Web service that not only implements current species-based bacterial taxonomy but complements its limitations by providing a new framework for genome sequence-based classification and identification independently of the type-centric species. LINbase uses a sequence similarity-based framework to cluster bacteria into hierarchical taxa, which we call LINgroups, at multiple levels of relatedness and crowdsources users' expertise by encouraging them to circumscribe these groups as taxa from the genus-level to the intraspecies-level. Circumscribing a group of bacteria as a LINgroup, adding a phenotypic description, and giving the LINgroup a name using the LINbase Web interface allows users to instantly share new taxa and complements the lengthy and laborious process of publishing a named species. Furthermore, unknown isolates can be identified immediately as members of a newly described LINgroup with fast and precise algorithms based on their genome sequences, allowing species- and intraspecies-level identification. The employed algorithms are based on a combination of the alignment-based algorithm BLASTN and the alignment-free method Sourmash, which is based on k-mers, and the MinHash algorithm. The potential of LINbase is shown by using examples of plant pathogenic bacteria. / Doctor of Philosophy / Life is always easier when people talk to each other in the same language. Taxonomy is the language that biologists use to communicate about life by 1. classifying organisms into groups, 2. giving names to these groups, and 3. identifying individuals as members of these named groups. When most scientists and the general public think of taxonomy, they think of the hierarchical structure of “Life”, “Domain”, “Kingdom”, “Phylum”, “Class”, “Order”, “Family”, “Genus” and “Species”. However, the basic goal of taxonomy is to allow the identification of an organism as a member of a group that is predictive of its characteristics and to provide a name to communicate about that group with other scientists and the public. In the world of micro-organism, taxonomy is extremely important since there are an estimated 10,000,000 to 1,000,000,000 different bacteria species. Moreover, microbiologists and pathologists need to consider differences among bacterial isolates even within the same species, a level, that the current taxonomic system does not even cover. Therefore, we developed a Web service, LINbase, which uses genome sequences to classify individual microbial isolates. The database at the backend of LINbase assigns Life Identification Numbers (LINs) that express how individual microbial isolates are related to each other above, at, and below the species level. The LINbase Web service is designed to be an interactive web-based encyclopedia of microorganisms where users can share everything they know about micro-organisms, be it individual isolates or groups of isolates, for professional and scientific purposes. To develop LINbase, efficient computer programs were developed and implemented. To show how LINbase can be used, several groups of bacteria that cause plant diseases were classified and described.
|
18 |
Distributed high-dimensional similarity search with music information retrieval applicationsFaghfouri, Aidin 29 August 2011 (has links)
Today, the advent of networking technologies and computer hardware have enabled more and more inexpensive PCs, various mobile devices, smart phones, PDAs, sensors and cameras to be linked to the Internet with better connectivity. In recent years, we have witnessed the emergence of several instances of distributed applications, providing infrastructures for social interactions over large-scale wide-area networks and facilitating the ways users share and publish data. User generated data today range from simple text files to (semi-) structured documents and multimedia content. With the emergence of Semantic Web, the number of features (associated with a content) that are used in order to index those large amounts of heterogenous pieces of data is growing dramatically. The feature sets associated with each content type can grow continuously as we discover new ways of describing a content in formulated terms.
As the number of dimensions in the feature data grow (as high as 100 to 1000), it becomes harder and harder to search for information in a dataset due to the curse of dimensionality and it is not appropriate to use naive search methods, as their performance degrade to linear search. As an alternative, we can distribute the content and the query processing load to a set of peers in a distributed Peer-to-Peer (P2P) network and incorporate high-dimensional distributed search techniques to attack the problem.
Currently, a large percentage of Internet traffic consists of video and music files shared and exchanged over P2P networks. In most present services, searching for music is performed through keyword search and naive string-matching algorithms using collaborative filtering techniques which mostly use tag based approaches. In music information retrieval (MIR) systems, the main goal is to make recommendations similar to the music that the user listens to. In these systems, techniques based on acoustic feature extraction can be employed to achieve content-based music similarity search (i.e., searching through music based on what can be heard from the music track). Using these techniques we can devise an automated measure of similarity that can replace the need for human experts (or users) who assign descriptive genre tags and meta-data to each recording and solve the famous cold-start problem associated with the collaborative filtering techniques.
In this work we explore the advantages of distributed structures by efficiently distributing the content features and query processing load on the peers in a P2P network. Using a family of Locality Sensitive Hash (LSH) functions based on p-stable distributions we propose an efficient, scalable and load-balanced system, capable of performing K-Nearest-Neighbor (KNN) and Range queries. We also propose a new load-balanced indexing algorithm and evaluate it using our Java based simulator.
Our results show that this P2P design ensures load-balancing and guarantees logarithmic number of hops for query processing. Our system is extensible to be used with all types of multi-dimensional feature data and it can also be employed as the main indexing scheme of a multipurpose recommendation system. / Graduate
|
19 |
Simultaneous real-time object recognition and pose estimation for artificial systems operating in dynamic environmentsVan Wyk, Frans Pieter January 2013 (has links)
Recent advances in technology have increased awareness of the necessity for automated systems in
people’s everyday lives. Artificial systems are more frequently being introduced into environments
previously thought to be too perilous for humans to operate in. Some robots can be used to extract
potentially hazardous materials from sites inaccessible to humans, while others are being developed
to aid humans with laborious tasks.
A crucial aspect of all artificial systems is the manner in which they interact with their immediate surroundings.
Developing such a deceivingly simply aspect has proven to be significantly challenging, as
it not only entails the methods through which the system perceives its environment, but also its ability
to perform critical tasks. These undertakings often involve the coordination of numerous subsystems,
each performing its own complex duty. To complicate matters further, it is nowadays becoming
increasingly important for these artificial systems to be able to perform their tasks in real-time.
The task of object recognition is typically described as the process of retrieving the object in a database
that is most similar to an unknown, or query, object. Pose estimation, on the other hand, involves
estimating the position and orientation of an object in three-dimensional space, as seen from an observer’s
viewpoint. These two tasks are regarded as vital to many computer vision techniques and and
regularly serve as input to more complex perception algorithms.
An approach is presented which regards the object recognition and pose estimation procedures as
mutually dependent. The core idea is that dissimilar objects might appear similar when observed
from certain viewpoints. A feature-based conceptualisation, which makes use of a database, is implemented
and used to perform simultaneous object recognition and pose estimation. The design
incorporates data compression techniques, originally suggested by the image-processing community,
to facilitate fast processing of large databases.
System performance is quantified primarily on object recognition, pose estimation and execution time
characteristics. These aspects are investigated under ideal conditions by exploiting three-dimensional
models of relevant objects. The performance of the system is also analysed for practical scenarios
by acquiring input data from a structured light implementation, which resembles that obtained from
many commercial range scanners.
Practical experiments indicate that the system was capable of performing simultaneous object recognition
and pose estimation in approximately 230 ms once a novel object has been sensed. An average
object recognition accuracy of approximately 73% was achieved. The pose estimation results were
reasonable but prompted further research. The results are comparable to what has been achieved using
other suggested approaches such as Viewpoint Feature Histograms and Spin Images. / Dissertation (MEng)--University of Pretoria, 2013. / gm2014 / Electrical, Electronic and Computer Engineering / unrestricted
|
20 |
Simutaneous real-time object recognition and pose estimation for artificial systems operating in dynamic environmentsVan Wyk, Frans-Pieter January 2013 (has links)
Recent advances in technology have increased awareness of the necessity for automated systems in
people’s everyday lives. Artificial systems are more frequently being introduced into environments
previously thought to be too perilous for humans to operate in. Some robots can be used to extract
potentially hazardous materials from sites inaccessible to humans, while others are being developed
to aid humans with laborious tasks.
A crucial aspect of all artificial systems is the manner in which they interact with their immediate surroundings.
Developing such a deceivingly simply aspect has proven to be significantly challenging, as
it not only entails the methods through which the system perceives its environment, but also its ability
to perform critical tasks. These undertakings often involve the coordination of numerous subsystems,
each performing its own complex duty. To complicate matters further, it is nowadays becoming
increasingly important for these artificial systems to be able to perform their tasks in real-time.
The task of object recognition is typically described as the process of retrieving the object in a database
that is most similar to an unknown, or query, object. Pose estimation, on the other hand, involves
estimating the position and orientation of an object in three-dimensional space, as seen from an observer’s
viewpoint. These two tasks are regarded as vital to many computer vision techniques and regularly serve as input to more complex perception algorithms.
An approach is presented which regards the object recognition and pose estimation procedures as
mutually dependent. The core idea is that dissimilar objects might appear similar when observed
from certain viewpoints. A feature-based conceptualisation, which makes use of a database, is implemented
and used to perform simultaneous object recognition and pose estimation. The design
incorporates data compression techniques, originally suggested by the image-processing community,
to facilitate fast processing of large databases.
System performance is quantified primarily on object recognition, pose estimation and execution time
characteristics. These aspects are investigated under ideal conditions by exploiting three-dimensional
models of relevant objects. The performance of the system is also analysed for practical scenarios
by acquiring input data from a structured light implementation, which resembles that obtained from
many commercial range scanners.
Practical experiments indicate that the system was capable of performing simultaneous object recognition
and pose estimation in approximately 230 ms once a novel object has been sensed. An average
object recognition accuracy of approximately 73% was achieved. The pose estimation results were
reasonable but prompted further research. The results are comparable to what has been achieved using
other suggested approaches such as Viewpoint Feature Histograms and Spin Images. / Dissertation (MEng)--University of Pretoria, 2013. / gm2014 / Electrical, Electronic and Computer Engineering / unrestricted
|
Page generated in 0.2747 seconds