11 |
Efficient Kernel Methods for Statistical DetectionSu, Wanhua 20 March 2008 (has links)
This research is motivated by a drug discovery problem -- the AIDS anti-viral database from the National Cancer Institute. The objective of the study is to develop effective statistical methods to model the relationship between the chemical structure of a compound and its activity against the HIV-1 virus. And as a result, the structure-activity model can be used to predict the activity of
new compounds and thus helps identify those active chemical compounds that can be used as drug candidates. Since active compounds are generally rare in a compound library, we recognize the drug discovery problem as an application of the so-called statistical detection problem. In a typical statistical detection problem, we have data {Xi,Yi}, where Xi is the predictor vector of the ith observation and Yi={0,1} is its class label. The objective of a statistical detection problem is to identify class-1 observations, which are extremely rare. Besides drug discovery problem, other applications of statistical detection include direct marketing and fraud detection.
We propose a computationally efficient detection method called LAGO, which stands for "locally adjusted GO estimator". The original idea is inspired by an ancient game known today as "GO". The construction of LAGO consists of two steps. In the first step, we estimate the density of class 1 with an adaptive bandwidth kernel density estimator. The kernel functions are located at and only at the class-1 observations. The bandwidth of the kernel function centered at a certain class-1 observation is calculated as the average distance between this class-1 observation and its K-nearest class-0 neighbors. In the second step, we adjust the density estimated in the first step locally according to the density of class 0. It can be shown that the amount of adjustment in the second step is approximately inversely proportional to the bandwidth calculated in the first step.
Application to the NCI data demonstrates that LAGO is superior to methods such as K nearest neighbors and support vector machines.
One drawback of the existing LAGO is that it only
provides a point estimate of a test point's possibility of being class 1, ignoring the uncertainty of the model. In the second part of this thesis, we present a Bayesian framework for LAGO, referred to as BLAGO. This Bayesian approach enables quantification of uncertainty. Non-informative priors are adopted. The posterior distribution is calculated over a grid of (K, alpha) pairs by integrating out beta0 and beta1 using the Laplace approximation, where K and alpha are two parameters to construct the LAGO score. The parameters beta0, beta1 are the coefficients of the logistic transformation that converts the LAGO score to the probability scale. BLAGO
provides proper probabilistic predictions that have support on (0,1) and captures uncertainty of the predictions as well. By avoiding Markov chain Monte Carlo algorithms and using the Laplace approximation, BLAGO is computationally very efficient. Without the need of cross-validation, BLAGO is even more computationally efficient than LAGO.
|
12 |
Efficient Kernel Methods for Statistical DetectionSu, Wanhua 20 March 2008 (has links)
This research is motivated by a drug discovery problem -- the AIDS anti-viral database from the National Cancer Institute. The objective of the study is to develop effective statistical methods to model the relationship between the chemical structure of a compound and its activity against the HIV-1 virus. And as a result, the structure-activity model can be used to predict the activity of
new compounds and thus helps identify those active chemical compounds that can be used as drug candidates. Since active compounds are generally rare in a compound library, we recognize the drug discovery problem as an application of the so-called statistical detection problem. In a typical statistical detection problem, we have data {Xi,Yi}, where Xi is the predictor vector of the ith observation and Yi={0,1} is its class label. The objective of a statistical detection problem is to identify class-1 observations, which are extremely rare. Besides drug discovery problem, other applications of statistical detection include direct marketing and fraud detection.
We propose a computationally efficient detection method called LAGO, which stands for "locally adjusted GO estimator". The original idea is inspired by an ancient game known today as "GO". The construction of LAGO consists of two steps. In the first step, we estimate the density of class 1 with an adaptive bandwidth kernel density estimator. The kernel functions are located at and only at the class-1 observations. The bandwidth of the kernel function centered at a certain class-1 observation is calculated as the average distance between this class-1 observation and its K-nearest class-0 neighbors. In the second step, we adjust the density estimated in the first step locally according to the density of class 0. It can be shown that the amount of adjustment in the second step is approximately inversely proportional to the bandwidth calculated in the first step.
Application to the NCI data demonstrates that LAGO is superior to methods such as K nearest neighbors and support vector machines.
One drawback of the existing LAGO is that it only
provides a point estimate of a test point's possibility of being class 1, ignoring the uncertainty of the model. In the second part of this thesis, we present a Bayesian framework for LAGO, referred to as BLAGO. This Bayesian approach enables quantification of uncertainty. Non-informative priors are adopted. The posterior distribution is calculated over a grid of (K, alpha) pairs by integrating out beta0 and beta1 using the Laplace approximation, where K and alpha are two parameters to construct the LAGO score. The parameters beta0, beta1 are the coefficients of the logistic transformation that converts the LAGO score to the probability scale. BLAGO
provides proper probabilistic predictions that have support on (0,1) and captures uncertainty of the predictions as well. By avoiding Markov chain Monte Carlo algorithms and using the Laplace approximation, BLAGO is computationally very efficient. Without the need of cross-validation, BLAGO is even more computationally efficient than LAGO.
|
13 |
Improving WiFi positioning through the use of successive in-sequence signal strength samplesHallström, Per, Dellrup, Per January 2006 (has links)
As portable computers and wireless networks are becoming ubiquitous, it is natural to consider the user’s position as yet another aspect to take into account when providing services that are tailored to meet the needs of the consumers. Location aware systems could guide persons through buildings, to a particular bookshelf in a library or assist in a vast variety of other applications that can benefit from knowing the user’s position. In indoor positioning systems, the most commonly used method for determining the location is to collect samples of the strength of the received signal from each base station that is audible at the client’s position and then pass the signal strength data on to a positioning server that has been previously fed with example signal strength data from a set of reference points where the position is known. From this set of reference points, the positioning server can interpolate the client’s current location by comparing the signal strength data it has collected with the signal strength data associated with every reference point. Our work proposes the use of multiple successive received signal strength samples in order to capture periodic signal strength variations that are the result of effects such as multi-path propagation, reflections and other types of radio interference. We believe that, by capturing these variations, it is possible to more easily identify a particular point; this is due to the fact that the signal strength fluctuations should be rather constant at every position, since they are the result of for example reflections on the fixed surfaces of the building’s interior. For the purpose of investigating our assumptions, we conducted measurements at a site at Växjö university, where we collected signal strength samples at known points. With the data collected, we performed two different experiments: one with a neural network and one where the k-nearest-neighbor method was used for position approximation. For each of the methods, we performed the same set of tests with single signal strength samples and with multiple successive signal strength samples, to evaluate their respective performances. We concluded that the k-nearest-neighbor method does not seem to benefit from multiple successive signal strength samples, at least not in our setup, compared to when using single signal strength samples. However, the neural network performed about 17% better when multiple successive signal strength samples were used.
|
14 |
A Hilbert Curve-Based Algorithm for Order-Sensitive Moving KNN QueriesFeng, Fei-Chung 11 July 2012 (has links)
¡@¡@Due to wireless communication technologies, positioning technologies, and mobile computing develop quickly, mobile services are becoming practical and important on big spatiotemporal databases management. Mobile service users move only inside a spatial space, e:g: a country. They often issue the K Nearest Neighbor (kNN) query to obtain data objects reachable through the spatial database. The challenge problem of mobile services is how to efficiently answer the data objects which users interest to the corresponding mobile users. One type of kNN query problems is the order-sensitive moving kNN (order-sensitive MkNN) query problem. In the order-sensitive MkNN query problem, the query point is dynamic and unpredictable, the kNN answers should be responded in real time and sorted by the distance in the ascending order. Therefore, how to respond the kNN answers effectively, incrementally and correctly is an important issue. Nutanong et al: have proposed the V*-kNN algorithm to process the order-sensitive MkNN query. The V*-kNN algorithm uses their the V*-diagram algorithm to generate the safe region. It also uses the Incremental Rank Updates algorithm (IRU) to handle the events while the query point passing the bisectors or the boundary of the safe region. However, the V*-kNN algorithm uses the BF-kNN algorithm to retrieve NNs, which is non-incremental. This makes the search time increase while the density of the object increases. Moreover, they do not consider the situation that there are multiple objects at the same order, and the situation that there are multiple events happen in a single step. These situations may cause that the kNN answers are incorrect. Therefore, in this thesis, we propose the Hilbert curve-based kNN algorithm (HC-kNN) algorithm to process the ordersensitive MkNN query. The HC-kNN algorithm can handle the situation that there are multiple events happen in a single step. We also propose new data structure of the kNN answers. Next, we propose the Intersection of Perpendicular Bisectors algorithm (IPB) in order to handle order update events of the kNN answers. The IPB algorithm handles the situation which there are multiple objects at the same order. Finally, based on the Hilbert curve index, we propose the ONHC-kNN algorithm to get NNs incrementally and to generate the safe region. The safe region will not be affected while the density of the object increases. The safe region of our algorithm is larger than that of the V*-kNN algorithm. From our simulation result, we show that the HC-kNN algorithm provides better performance than the V*-kNN algorithm.
|
15 |
The Incremental Benefits of the Nearest Neighbor Forecast of U.S. Energy Commodity PricesKudoyan, Olga 2010 December 1900 (has links)
This thesis compares the simple Autoregressive (AR) model against the k-
Nearest Neighbor (k-NN) model to make a point forecast of five energy commodity
prices. Those commodities are natural gas, heating oil, gasoline, ethanol, and crude oil.
The data for the commodities are monthly and, for each commodity, two-thirds of the
data are used for an in-sample forecast, and the remaining one-third of the data are used
to perform an out-of-sample forecast. Mean Absolute Error (MAE) and Root Mean
Squared Error (RMSE) are used to compare the two forecasts. The results showed that
one method is superior by one measure but inferior by another. Although the differences
of the two models are minimal, it is up to a decision maker as to which model to choose.
The Diebold-Mariano (DM) test was performed to test the relative accuracy of
the models. For all five commodities, the results failed to reject the null hypothesis
indicating that both models are equally accurate.
|
16 |
Improving WiFi positioning through the use of successive in-sequence signal strength samplesHallström, Per, Dellrup, Per January 2006 (has links)
<p>As portable computers and wireless networks are becoming ubiquitous, it is natural to consider the user’s position as yet another aspect to take into account when providing services that are tailored to meet the needs of the consumers. Location aware systems could guide persons through buildings, to a particular bookshelf in a library or assist in a vast variety of other applications that can benefit from knowing the user’s position.</p><p>In indoor positioning systems, the most commonly used method for determining the location is to collect samples of the strength of the received signal from each base station that is audible at the client’s position and then pass the signal strength data on to a positioning server that has been previously fed with example signal strength data from a set of reference points where the position is known. From this set of reference points, the positioning server can interpolate the client’s current location by comparing the signal strength data it has collected with the signal strength data associated with every reference point.</p><p>Our work proposes the use of multiple successive received signal strength samples in order to capture periodic signal strength variations that are the result of effects such as multi-path propagation, reflections and other types of radio interference. We believe that, by capturing these variations, it is possible to more easily identify a particular point; this is due to the fact that the signal strength fluctuations should be rather constant at every position, since they are the result of for example reflections on the fixed surfaces of the building’s interior.</p><p>For the purpose of investigating our assumptions, we conducted measurements at a site at Växjö university, where we collected signal strength samples at known points. With the data collected, we performed two different experiments: one with a neural network and one where the k-nearest-neighbor method was used for position approximation. For each of the methods, we performed the same set of tests with single signal strength samples and with multiple successive signal strength samples, to evaluate their respective performances.</p><p>We concluded that the k-nearest-neighbor method does not seem to benefit from multiple successive signal strength samples, at least not in our setup, compared to when using single signal strength samples. However, the neural network performed about 17% better when multiple successive signal strength samples were used.</p>
|
17 |
A scalable metric learning based voting method for expression recognitionWan, Shaohua 09 October 2013 (has links)
In this research work, we propose a facial expression classification method using metric learning-based k-nearest neighbor voting. To achieve accurate classification of a facial expression from frontal face images, we first learn a distance metric structure from training data that characterizes the feature space pattern, then use this metric to retrieve the nearest neighbors from the training dataset, and finally output the classification decision accordingly. An expression is represented as a fusion of face shape and texture. This representation is based on registering a face image with a landmarking shape model and extracting Gabor features from local patches around landmarks. This type of representation achieves robustness and effectiveness by using an ensemble of local patch feature detectors at a global shape level. A naive implementation of the metric learning-based k-nearest neighbor would incur a time complexity proportional to the size of the training dataset, which precludes this method being used with enormous datasets. To scale to potential larger databases, a similar approach to that in [24] is used to achieve an approximate yet efficient ML-based kNN voting based on Locality Sensitive Hashing (LSH). A query example is directly hashed to the bucket of a pre-computed hash table where candidate nearest neighbors can be found, and there is no need to search the entire database for nearest neighbors. Experimental results on the Cohn-Kanade database and the Moving Faces and People database show that both ML-based kNN voting and its LSH approximation outperform the state-of-the-art, demonstrating the superiority and scalability of our method. / text
|
18 |
ENABLING HYDROLOGICAL INTERPRETATION OF MONTHLY TO SEASONAL PRECIPITATION FORECASTS IN THE CORE NORTH AMERICAN MONSOON REGIONMaitaria, Kazungu January 2009 (has links)
The aim of the research undertaken in this dissertation was to use medium-range to seasonal precipitation forecasts for hydrologic applications for catchments in the core North American Monsoon (NAM) region. To this end, it was necessary to develop a better understanding of the physical and statistical relationships between runoff processes and the temporal statistics of rainfall. To achieve this goal, development of statistically downscaled estimates of warm season precipitation over the core region of the North American Monsoon Experiment (NAME) were developed. Currently, NAM precipitation is poorly predicted on local and regional scales by Global Circulation Models (GCMs). The downscaling technique used here, the K-Nearest Neighbor (KNN) model, combines information from retrospective GCM forecasts with simultaneous historical observations to infer statistical relationships between the low-resolution GCM fields and the locally-observed precipitation records. The stochastic nature of monsoon rainfall presents significant challenges for downscaling efforts and, therefore, necessitate a regionalization and an ensemble or probabilistic-based approach to quantitative precipitation forecasting. It was found that regionalization of the precipitation climatology prior to downscaling using KNN offered significant advantages in terms of improved skill scores.Selected output variables from retrospective ensemble runs of the National Centers for Environmental Predictions medium-range forecast (MRF) model were fed into the KNN downscaling model. The quality of the downscaled precipitation forecasts was evaluated in terms of a standard suite of ensemble verification metrics. This study represents the first time the KNN model has been successfully applied within a warm season convective climate regime and shown to produce skillful and reliable ensemble forecasts of daily precipitation out to a lead time of four to six days, depending on the forecast month.Knowledge of the behavior of the regional hydrologic systems in NAM was transferred into a modeling framework aimed at improving intra-seasonal hydrologic predictions. To this end, a robust lumped-parameter computational model of intermediate conceptual complexity was calibrated and applied to generate streamflow in three unregulated test basins in the core region of the NAM. The modeled response to different time-accumulated KNN-generated precipitation forcing was investigated. Although the model had some difficulty in accurately simulating hydrologic fluxes on the basis of Hortonian runoff principles only, the preliminary results achieved from this study are encouraging. The primary and most novel finding from this study is an improved predictability of the NAM system using state-of-the-art ensemble forecasting systems. Additionally, this research significantly enhanced the utility of the MRF ensemble forecasts and made them reliable for regional hydrologic applications. Finally, monthly streamflow simulations (from an ensemble-based approach) have been demonstrated. Estimated ensemble forecasts provide quantitative estimates of uncertainty associated with our model forecasts.
|
19 |
Time Efficient and Quality Effective K Nearest Neighbor Search in High Dimension SpaceJanuary 2011 (has links)
abstract: K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and LSH-Forest improve upon the basic LSH algorithm by varying hash bucket size dynamically at query time, so these two algorithms can answer different KNN queries adaptively. However, these two algorithms need a data access post-processing step after candidates' collection in order to get the final answer to the KNN query. In this thesis, Multi-Probe LSH with data access post-processing (Multi-Probe LSH with DAPP) algorithm and LSH-Forest with data access post-processing (LSH-Forest with DAPP) algorithm are improved by replacing the costly data access post-processing (DAPP) step with a much faster histogram-based post-processing (HBPP). Two HBPP algorithms: LSH-Forest with HBPP and Multi- Probe LSH with HBPP are presented in this thesis, both of them achieve the three goals for KNN search in large scale high dimensional data set: high search quality, high time efficiency, high space efficiency. None of the previous KNN algorithms can achieve all three goals. More specifically, it is shown that HBPP algorithms can always achieve high search quality (as good as LSH-Forest with DAPP and Multi-Probe LSH with DAPP) with much less time cost (one to several orders of magnitude speedup) and same memory usage. It is also shown that with almost same time cost and memory usage, HBPP algorithms can always achieve better search quality than LSH-Forest with random pick (LSH-Forest with RP) and Multi-Probe LSH with random pick (Multi-Probe LSH with RP). Moreover, to achieve a very high search quality, Multi-Probe with HBPP is always a better choice than LSH-Forest with HBPP, regardless of the distribution, size and dimension number of the data set. / Dissertation/Thesis / M.S. Computer Science 2011
|
20 |
Detecção de desvios vocais utilizando modelos auto regressivos e o algoritmo KNNTorres, Winnie de Lima 30 January 2018 (has links)
Submitted by Automação e Estatística (sst@bczm.ufrn.br) on 2018-05-02T22:45:42Z
No. of bitstreams: 1
WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5) / Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2018-05-07T21:40:35Z (GMT) No. of bitstreams: 1
WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5) / Made available in DSpace on 2018-05-07T21:40:35Z (GMT). No. of bitstreams: 1
WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5)
Previous issue date: 2018-01-30 / Alguns campos da ciência propõem-se a estudar distúrbios no trato vocal a partir de
análises sobre padrões de vibração da voz. Em geral, a importância dessas pesquisas está
na identificação, em uma fase mais específica, de doenças de maior ou menor gravidade,
a serem sanadas com terapia vocal ou que requerem maior atenção, gerando inclusive a
necessidade de procedimentos cirúrgicos para o seu controle. Embora, já exista na literatura
indicações de que o processamento digital de sinais permite diagnosticar, de um
modo não invasivo, patologias laríngeas, como doenças vocais que ocasionem edema, nódulo
e paralisia, não existe definição do método mais indicado e das características, ou
parâmetros, mais adequados para detectar a presença de desvios vocais. Sendo assim,
neste trabalho é proposto um algoritmo para detecção de desvios vocais por meio da análise
de sinais de voz. Para a realização deste trabalho, utilizou-se dados constantes no
banco de dados Disordered Voice Database, desenvolvido pelo Massachusetts Eye and
Ear Infirmary (MEEI), devido sua utilização em pesquisas na área acústica de voz. Foram
utilizados 166 sinais contidos nessa base de dados, com sinais de vozes saudáveis e
de vozes patológicas afetadas por edema, por nódulo e por paralisia nas pregas vocais. A
partir dos sinais de voz, foram gerados modelos Auto Regressivos (AR e ARMA) para
representação desses sinais e, utilizando os parâmetros dos modelos obtidos, foi utilizado
o algoritmo K-Nearest Neighbors (KNN) para a classificação dos sinais analisados. Com
o intuito de analisar a eficiência do algoritmo proposto neste estudo, os resultados obtidos
desse algoritmo foram comparados com um método de detecção considerando apenas
distância euclidiana entre os sinais. Os resultados encontrados apontam que o método
proposto neste trabalho apresenta um bom resultado, gerando uma taxa de acerto na classificação
acima de 71% (maior que os 31% a partir do uso da distância euclidiana). Além
disso, o método utilizado é de fácil implementação, podendo ser utilizado em hardwares
mais simples. Logo, essa pesquisa tem potencial para gerar um classificador barato
e acessível para a utilização em larga escala por profissionais de saúde, como uma alternativa
de pré análise não invasiva para detecção de patologias otorrinolaringológicas que
afetem a voz. / Some fields in Science propose to study vocal tract disorders from an analysis about
voice vibration patterns. Generally, the weight of those researches is given by the identification
– in a more specific level – of diseases in different stages of severity, which would
be redressed through voice therapy or means that require more attention, hence generating
the need of surgical procedures for its control. Although there are evidences in literature
that the Digital Signal Processing allows a non-invasive diagnosis of laryngeal pathologies,
such as vocal cord disorders, which provoke swelling, nodules, and paralyses, there
is no definition of any most indicated method, and characteristics or appropriated parameters
to detect voice deviations. Thus, the present paper proposes an algorithm to detect
vocal deviances through the Voice Signal Analysis. In order to complete this study, it
had been used data from the Disordered Voice Database, developed by the Massachusetts
Eye and Ear Infirmary (MEEI) due to their wide use in researches regarding the voice and
speech. A total of 166 signals from this database were used, including healthy voices and
pathologic voices affected by swelling, nodule, and vocal fold paralysis. From the voice
signals, autoregressive processes of order (AR and ARMA) were generated for a representation
of those signals, and – by using the models’ parameters obtained – it had been
used the KNN algorithm for a classification of the signals analyzed. Seeking an analysis
of the efficiency of the algorithm proposed in this study, the results obtained from this
algorithm were compared to a detection method, which only considers the Euclidian distance
between the signals. The results found point that the propositioned method in this
work presents a satisfactory result, generating a hit rate on the classification above 71%
(more than the 31% from the use of the Euclidian distance). Moreover, the method used is
easy to implement, so that it can be used along with simpler hardware. Consequently, this
research has the potential to generate a cheap and accessible sorter for wide-scale use by
health care professionals as a non-invasive pre-analysis to detect otorhinolaryngological
pathologies that affect the voice.
|
Page generated in 0.023 seconds