21 |
Estudo da influência de diversas medidas de similaridade na previsão de séries temporais utilizando o algoritmo KNN-TSP / Study of the influence of similarity measures in Time Series Prediction with the kNN-TSP algorithmAikes Junior, Jorge 11 April 2012 (has links)
Made available in DSpace on 2017-07-10T17:11:50Z (GMT). No. of bitstreams: 1
JORGE AIKES JUNIOR.PDF: 2050278 bytes, checksum: f5bae18bbcb7465240488c45b2c813e7 (MD5)
Previous issue date: 2012-04-11 / Time series can be understood as any set of observations which are time ordered. Among the many possible tasks appliable to temporal data, one that has attracted increasing interest, due to its various applications, is the time series forecasting. The k-Nearest Neighbor - Time Series Prediction (kNN-TSP) algorithm is a non-parametric method for forecasting time series. One of its advantages, is its easiness application when compared to parametric methods. Even though its easier to define kNN-TSP s parameters, some issues remain opened. This research is focused on the study of one of these parameters: the similarity measure. This parameter was empirically evaluated using various similarity measures in a large set of time series, including artificial series with seasonal and chaotic characteristics, and several real world time series. It was also carried out a case study comparing the predictive accuracy of the kNN-TSP algorithm with the Moving Average (MA), univariate Seasonal Auto-Regressive Integrated Moving Average (SARIMA) and multivariate SARIMA methods in a time series of a Korean s hospital daily patients flow in the Emergency Department. This work also proposes an approach to the development of a hybrid similarity measure which combines characteristics from several measures. The research s result demonstrated that the Lp Norm s measures have an advantage over other measures evaluated, due to its lower computational cost and for providing, in general, greater accuracy in temporal data forecasting using the kNN-TSP algorithm. Although the literature in general adopts the Euclidean similarity measure to calculate de similarity between time series, the Manhattan s distance can be considered an interesting candidate for defining similarity, due to the absence of statistical significant difference and to its lower computational cost when compared to the Euclidian measure. The measure proposed in this work does not show significant results, but it is promising for further research. Regarding the case study, the kNN-TSP algorithm with only the similarity measure parameter optimized achieves a considerably lower error than the MA s best configuration, and a slightly greater error than the univariate e multivariate SARIMA s optimal settings presenting less than one percent of difference. / Séries temporais podem ser entendidas como qualquer conjunto de observações que se encontram ordenadas no tempo. Dentre as várias tarefas possíveis com dados temporais, uma que tem atraído crescente interesse, devido a suas várias aplicações, é a previsão de séries temporais. O algoritmo k-Nearest Neighbor - Time Series Prediction (kNN-TSP) é um método não-paramétrico de previsão de séries temporais que apresenta como uma de suas vantagens a facilidade de aplicação, quando comparado aos métodos paramétricos. Apesar da maior facilidade na determinação de seus parâmetros, algumas questões relacionadas continuam em aberto. Este trabalho está focado no estudo de um desses parâmetros: a medida de similaridade. Esse parâmetro foi avaliado empiricamente utilizando diversas medidas de similaridade em um grande conjunto de séries temporais que incluem séries artificiais, com características sazonais e caóticas, e várias séries reais. Foi realizado também um estudo de caso comparativo entre a precisão da previsão do algoritmo kNN-TSP e a dos métodos de Médias Móveis (MA), Auto-regressivos de Médias Móveis Integrados Sazonais (SARIMA) univariado e SARIMA multivariado, em uma série de fluxo diário de pacientes na Área de Emergência de um hospital coreano. Neste trabalho é ainda proposta uma abordagem para o desenvolvimento de uma medida de similaridade híbrida, que combine características de várias medidas. Os resultados obtidos neste trabalho demonstram que as medidas da Norma Lp apresentam vantagem sobre as demais medidas avaliadas, devido ao seu menor custo computacional e por apresentar, em geral, maior precisão na previsão de dados temporais utilizando o algoritmo kNN-TSP. Apesar de na literatura, em geral, a medida Euclidiana ser adotada como medida de similaridade, a medida Manhattan pode ser considerada candidata interessante para definir a similaridade entre séries temporais, devido a não apresentar diferença estatisticamente significativa com a medida Euclidiana e possuir menor custo computacional. A medida proposta neste trabalho, não apresenta resultados significantes, mas apresenta-se promissora para novas pesquisas. Com relação ao estudo de caso, o algoritmo kNN-TSP, com apenas o parâmetro de medida de similaridade otimizado, alcança um erro consideravelmente inferior a melhor configuração com MA, e pouco maior que as melhores configurações dos métodos SARIMA univariado e SARIMA multivariado, sendo essa diferença inferior a um por cento.
|
22 |
Integrated feature, neighbourhood, and model optimization for personalised modelling and knowledge discoveryLiang, Wen January 2009 (has links)
“Machine learning is the process of discovering and interpreting meaningful information, such as new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques” (Larose, 2005). From my understanding, machine learning is a process of using different analysis techniques to observe previously unknown, potentially meaningful information, and discover strong patterns and relationships from a large dataset. Professor Kasabov (2007b) classified computational models into three categories (e.g. global, local, and personalised) which have been widespread and used in the areas of data analysis and decision support in general, and in the areas of medicine and bioinformatics in particular. Most recently, the concept of personalised modelling has been widely applied to various disciplines such as personalised medicine, personalised drug design for known diseases (e.g. cancer, diabetes, brain disease, etc.) as well as for other modelling problems in ecology, business, finance, crime prevention, and so on. The philosophy behind the personalised modelling approach is that every person is different from others, thus he/she will benefit from having a personalised model and treatment. However, personalised modelling is not without issues, such as defining the correct number of neighbours or defining an appropriate number of features. As a result, the principal goal of this research is to study and address these issues and to create a novel framework and system for personalised modelling. The framework would allow users to select and optimise the most important features and nearest neighbours for a new input sample in relation to a certain problem based on a weighted variable distance measure in order to obtain more precise prognostic accuracy and personalised knowledge, when compared with global modelling and local modelling approaches.
|
23 |
Identification of Driving Styles in BusesKarginova, Nadezda January 2010 (has links)
<p>It is important to detect faults in bus details at an early stage. Because the driving style affects the breakdown of different details in the bus, identification of the driving style is important to minimize the number of failures in buses.</p><p>The identification of the driving style of the driver was based on the input data which contained examples of the driving runs of each class. K-nearest neighbor and neural networks algorithms were used. Different models were tested.</p><p>It was shown that the results depend on the selected driving runs. A hypothesis was suggested that the examples from different driving runs have different parameters which affect the results of the classification.</p><p>The best results were achieved by using a subset of variables chosen with help of the forward feature selection procedure. The percent of correct classifications is about 89-90 % for the k-nearest neighbor algorithm and 88-93 % for the neural networks.</p><p>Feature selection allowed a significant improvement in the results of the k-nearest neighbor algorithm and in the results of the neural networks algorithm received for the case when the training and testing data sets were selected from the different driving runs. On the other hand, feature selection did not affect the results received with the neural networks for the case when the training and testing data sets were selected from the same driving runs.</p><p>Another way to improve the results is to use smoothing. Computing the average class among a number of consequent examples allowed achieving a decrease in the error.</p>
|
24 |
Fast Pose Estimation with Parameter Sensitive HashingShakhnarovich, Gregory, Viola, Paul, Darrell, Trevor 18 April 2003 (has links)
Example-based methods are effective for parameter estimation problems when the underlying system is simple or the dimensionality of the input is low. For complex and high-dimensional problems such as pose estimation, the number of required examples and the computational complexity rapidly becme prohibitively high. We introduce a new algorithm that learns a set of hashing functions that efficiently index examples relevant to a particular estimation task. Our algorithm extends a recently developed method for locality-sensitive hashing, which finds approximate neighbors in time sublinear in the number of examples. This method depends critically on the choice of hash functions; we show how to find the set of hash functions that are optimally relevant to a particular estimation problem. Experiments demonstrate that the resulting algorithm, which we call Parameter-Sensitive Hashing, can rapidly and accurately estimate the articulated pose of human figures from a large database of example images.
|
25 |
Advanced query processing on spatial networksYiu, Man-lung. January 2006 (has links)
Thesis (Ph. D.)--University of Hong Kong, 2006. / Title proper from title frame. Also available in printed format.
|
26 |
Identification of Driving Styles in BusesKarginova, Nadezda January 2010 (has links)
It is important to detect faults in bus details at an early stage. Because the driving style affects the breakdown of different details in the bus, identification of the driving style is important to minimize the number of failures in buses. The identification of the driving style of the driver was based on the input data which contained examples of the driving runs of each class. K-nearest neighbor and neural networks algorithms were used. Different models were tested. It was shown that the results depend on the selected driving runs. A hypothesis was suggested that the examples from different driving runs have different parameters which affect the results of the classification. The best results were achieved by using a subset of variables chosen with help of the forward feature selection procedure. The percent of correct classifications is about 89-90 % for the k-nearest neighbor algorithm and 88-93 % for the neural networks. Feature selection allowed a significant improvement in the results of the k-nearest neighbor algorithm and in the results of the neural networks algorithm received for the case when the training and testing data sets were selected from the different driving runs. On the other hand, feature selection did not affect the results received with the neural networks for the case when the training and testing data sets were selected from the same driving runs. Another way to improve the results is to use smoothing. Computing the average class among a number of consequent examples allowed achieving a decrease in the error.
|
27 |
A Boolean Function Based Approach to Nearest Neighbor FindingHsiao, Yuan-shu 29 June 2005 (has links)
With the rapid advances in technologies, strategies for efficiently operating the spatial data are needed. The spatial data consists of points, lines, rectangles, regions, surface, and volumes. In this thesis, we focus on the region data. There are many important and efficient operations for the region data, such as neighbor finding, rotation, and mirroring. The nearest neighbor (NN) finding is frequently used in geographic information system (GIS). We can find the specific point (e.g., a park, a department store, etc.) that is the closest to our position in geographical information systems. In any representation for the region data, it is not instinctive and easy for nearest neighbor finding, since the coordinate information has been lost. Voros, Chen, and Chang have proposed the strategies for the nearest neighbor finding based on the quadtree in eight directions. Chen and Chang have proposed the nearest neighbor finding based on the Peano curves. These strategies for the nearest neighbor finding based on the quadtree and the Peano curve use a looping process, which is time-consuming. On the other hand, in recent years, many researchers have also focused on finding efficient strategies for the rotating and mirroring operations, which is useful when the animation is performed by computers. The boolean function-based encoding is a considerable amount of space-saving with respect to the other binary image representation. The CBLQ representation saves memory space as compared to the other binary image representations that have proposed the strategies of the set operations. However, the processes for obtaining the rotated or mirrored code based on these two representations are time-consuming, since the coordinate information of all pixels has been lost. Therefore, in this thesis, first, for the nearest neighbor finding based on the quadtrees and the Peano curve, we propose the strategy which uses the bitwise and arithmetic operations, and it is more efficient than the strategies based on the looping processes. Next, we propose efficient strategies for rotating and mirroring images based on the boolean function-based encoding and constant bit-length linear quadtrees (CBLQ) representations. From our simulation study, first, we show that our strategies based on the quadtree and the Peano curve require the least CPU-time and our strategy based on the Hilbert curve requires the least total time (the CPU-time + the I/O time) among the strategies for the nearest neighbor finding based on the quadtree and the three space-filling curves. Next, in most of cases, when the black density is no larger than 50%, the CPU-time based on the boolean function-based encoding is less than that based on CBLQ.
|
28 |
A New Method for Finding the Decision Boundary Region for Pattern Recognition ProblemsYoung, Chieh-Neng 26 July 2001 (has links)
It has been shown that focusing the training algorithms to the decision boundary vicinity data can improve the accuracy of several classification methods. However, previous approaches for fining decision boundary vicinity data are either computationally tedious or may perform poorly in handling problems with class overlapping. With the application of the nearest neighbor rule, this work proposes a new criterion to characterize the nearness of the training samples to the decision boundary. To demonstrate the effectiveness of the proposed approach, the proposed method is integrated with a nearest neighbor classifier design method and a neural work training approach. Experimental results show that the proposed method can reduce the size and classification error for both of the tested classifiers.
|
29 |
Design and Analysis of Nearest Neighbor Search StrategiesChen, Hue-Ling 10 July 2002 (has links)
With the proliferation of wireless communications and rapid advances in technologies, algorithms for efficiently answering queries about large number of spatial data are needed. Spatial data consists of spatial objects including data of higher dimension. Neighbor finding is one of the most important spatial operations in the field of spatial data structures. In recent years, many
researchers have focused on finding efficient solutions to the nearest neighbor problem (NN) which involves determining the point in a data set that is the nearest to a given query point. It
is frequently used in Geographical Information Systems (GIS). A block B is said to be the neighbor of another block A, if block B has the same property as block A has and covers an
equal-sized neighbor of block A. Jozef Voros has proposed a neighbor finding strategy on images represented by quadtrees, in which the four equal-sized neighbors (the east, west, north, and south directions) of block A can be found. However, based on Voros's strategy, the case that the nearest neighbor occurs in the diagonal directions (the northeast, northwest, southeast, and southwest directions) will be ignored. Moreover, there is no total ordering that preserve proximity when mapping a spatial data from a higher dimensional space to a 1D-space. One way of effecting such a mapping is to utilize
space-filling curves. Space-filling curves pass through every point in a space and give a one-one correspondence between the coordinate and the 1D-sequence number of the point. The Peano curve, proposed by Orenstein, which maps the 1D-coordinate of a point by simply interleaving the bits of the X and Y coordinates in the 2D-space, can be easily used in neighbor finding. But with the data ordered by the RBG curve or the Hilbert curve, the neighbor finding would be complex.
The RBG curve achieves savings in random accesses on the disk for range queries and the Hilbert curve achieves the best clustering for range queries. Therefore, in this thesis, we first show the missing case in the Voros's strategy and show the ways to find it. Next, we show that the Peano curve is the best mapping function used in the nearest neighbor finding. We also show the
transformation rules between the Peano curve and the other curves such that we can efficiently find the nearest neighbor, when the data is linearly ordered by the other curves. From our simulation, we show that our proposed two strategies can work correctly and faster than the conventional strategies in nearest neighbor finding. Finally, we present a revised version of NA-Trees, which can work for exact match queries and range queries from a large, dynamic index, where an exact match query means finding the specific data object in a spatial database and a range query means reporting all data objects which are located in a specific range. By large, we mean that most of the index must be stored in secondary memory. By dynamic, we mean that insertions and deletions are intermixed with queries, so that the index cannot be built beforehand.
|
30 |
Group nearest neighbor queries /Shen, Qiong Mao. January 2003 (has links)
Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2003. / Includes bibliographical references (leaves 40-43). Also available in electronic version. Access restricted to campus users.
|
Page generated in 0.067 seconds