Spelling suggestions: "subject:"nearest neighbor"" "subject:"nearest weighbor""
91 |
Exploring the Noise Resilience of Combined Sturges AlgorithmAgarwal, Akrita January 2015 (has links)
No description available.
|
92 |
Increasing the Precision of Forest Area Estimates through Improved Sampling for Nearest Neighbor Satellite Image ClassificationBlinn, Christine Elizabeth 25 August 2005 (has links)
The impacts of training data sample size and sampling method on the accuracy of forest/nonforest classifications of three mosaicked Landsat ETM+ images with the nearest neighbor decision rule were explored. Large training data pools of single pixels were used in simulations to create samples with three sampling methods (random, stratified random, and systematic) and eight sample sizes (25, 50, 75, 100, 200, 300, 400, and 500). Two forest area estimation techniques were used to estimate the proportion of forest in each image and to calculate forest area precision estimates. Training data editing was explored to remove problem pixels from the training data pools. All possible band combinations of the six non-thermal ETM+ bands were evaluated for every sample draw. Comparisons were made between classification accuracies to determine if all six bands were needed. The utility of separability indices, minimum and average Euclidian distances, and cross-validation accuracies for the selection of band combinations, prediction of classification accuracies, and assessment of sample quality were determined.
Larger training data sample sizes produced classifications with higher average accuracies and lower variability. All three sampling methods had similar performance. Training data editing improved the average classification accuracies by a minimum of 5.45%, 5.31%, and 3.47%, respectively, for the three images. Band combinations with fewer than all six bands almost always produced the maximum classification accuracy for a single sample draw. The number of bands and combination of bands, which maximized classification accuracy, was dependent on the characteristics of the individual training data sample draw, the image, sample size, and, to a lesser extent, the sampling method. All three band selection measures were unable to select band combinations that produced higher accuracies on average than all six bands. Cross-validation accuracies with sample size 500 had high correlations with classification accuracies, and provided an indication of sample quality.
Collection of a high quality training data sample is key to the performance of the nearest neighbor classifier. Larger samples are necessary to guarantee classifier performance and the utility of cross-validation accuracies. Further research is needed to identify the characteristics of "good" training data samples. / Ph. D.
|
93 |
Precision Aggregated Local ModelsEdwards, Adam Michael 28 January 2021 (has links)
Large scale Gaussian process (GP) regression is infeasible for larger data sets due to cubic scaling of flops and quadratic storage involved in working with covariance matrices. Remedies in recent literature focus on divide-and-conquer, e.g., partitioning into sub-problems and inducing functional (and thus computational) independence. Such approximations can speedy, accurate, and sometimes even more flexible than an ordinary GPs. However, a big downside is loss of continuity at partition boundaries. Modern methods like local approximate GPs (LAGPs) imply effectively infinite partitioning and are thus pathologically good and bad in this regard. Model averaging, an alternative to divide-and-conquer, can maintain absolute continuity but often over-smooth, diminishing accuracy. Here I propose putting LAGP-like methods into a local experts-like framework, blending partition-based speed with model-averaging continuity, as a flagship example of what I call precision aggregated local models (PALM). Using N_C LAGPs, each selecting n from N data pairs, I illustrate a scheme that is at most cubic in n, quadratic in N_C, and linear in N, drastically reducing computational and storage demands. Extensive empirical illustration shows how PALM is at least as accurate as LAGP, can be much faster in terms of speed, and furnishes continuous predictive surfaces. Finally, I propose sequential updating scheme which greedily refines a PALM predictor up to a computational budget, and several variations on the basic PALM that may provide predictive improvements. / Doctor of Philosophy / Occasionally, when describing the relationship between two variables, it may be helpful to use a so-called ``non-parametric" regression that is agnostic to the function that connects them. Gaussian Processes (GPs) are a popular method of non-parametric regression used for their relative flexibility and interpretability, but they have the unfortunate drawback of being computationally infeasible for large data sets. Past work into solving the scaling issues for GPs has focused on ``divide and conquer" style schemes that spread the data out across multiple smaller GP models. While these model make GP methods much more accessible to large data sets they do so either at the expense of local predictive accuracy of global surface continuity. Precision Aggregated Local Models (PALM) is a novel divide and conquer method for GP models that is scalable for large data while maintaining local accuracy and a smooth global model. I demonstrate that PALM can be built quickly, and performs well predictively compared to other state of the art methods. This document also provides a sequential algorithm for selecting the location of each local model, and variations on the basic PALM methodology.
|
94 |
Machine Learning Models in Fullerene/Metallofullerene Chromatography StudiesLiu, Xiaoyang 08 August 2019 (has links)
Machine learning methods are now extensively applied in various scientific research areas to make models. Unlike regular models, machine learning based models use a data-driven approach. Machine learning algorithms can learn knowledge that are hard to be recognized, from available data. The data-driven approaches enhance the role of algorithms and computers and then accelerate the computation using alternative views. In this thesis, we explore the possibility of applying machine learning models in the prediction of chromatographic retention behaviors. Chromatographic separation is a key technique for the discovery and analysis of fullerenes. In previous studies, differential equation models have achieved great success in predictions of chromatographic retentions. However, most of the differential equation models require experimental measurements or theoretical computations for many parameters, which are not easy to obtain. Fullerenes/metallofullerenes are rigid and spherical molecules with only carbon atoms, which makes the predictions of chromatographic retention behaviors as well as other properties much simpler than other flexible molecules that have more variations on conformations. In this thesis, I propose the polarizability of a fullerene molecule is able to be estimated directly from the structures. Structural motifs are used to simplify the model and the models with motifs provide satisfying predictions. The data set contains 31947 isomers and their polarizability data and is split into a training set with 90% data points and a complementary testing set. In addition, a second testing set of large fullerene isomers is also prepared and it is used to testing whether a model can be trained by small fullerenes and then gives ideal predictions on large fullerenes. / Machine learning models are capable to be applied in a wide range of areas, such as scientific research. In this thesis, machine learning models are applied to predict chromatography behaviors of fullerenes based on the molecular structures. Chromatography is a common technique for mixture separations, and the separation is because of the difference of interactions between molecules and a stationary phase. In real experiments, a mixture usually contains a large family of different compounds and it requires lots of work and resources to figure out the target compound. Therefore, models are extremely import for studies of chromatography. Traditional models are built based on physics rules, and involves several parameters. The physics parameters are measured by experiments or theoretically computed. However, both of them are time consuming and not easy to be conducted. For fullerenes, in my previous studies, it has been shown that the chromatography model can be simplified and only one parameter, polarizability, is required. A machine learning approach is introduced to enhance the model by predicting the molecular polarizabilities of fullerenes based on structures. The structure of a fullerene is represented by several local structures. Several types of machine learning models are built and tested on our data set and the result shows neural network gives the best predictions.
|
95 |
Machine Learning for Malware Detection in Network TrafficOmopintemi, A.H., Ghafir, Ibrahim, Eltanani, S., Kabir, Sohag, Lefoane, Moemedi 19 December 2023 (has links)
No / Developing advanced and efficient malware detection systems is
becoming significant in light of the growing threat landscape in cybersecurity. This work aims to tackle the enduring problem of identifying malware and protecting digital assets from cyber-attacks.
Conventional methods frequently prove ineffective in adjusting
to the ever-evolving field of harmful activity. As such, novel approaches that improve precision while simultaneously taking into
account the ever-changing landscape of modern cybersecurity problems are needed. To address this problem this research focuses on
the detection of malware in network traffic. This work proposes
a machine-learning-based approach for malware detection, with
particular attention to the Random Forest (RF), Support Vector Machine (SVM), and Adaboost algorithms. In this paper, the model’s
performance was evaluated using an assessment matrix. Included
the Accuracy (AC) for overall performance, Precision (PC) for positive predicted values, Recall Score (RS) for genuine positives, and
the F1 Score (SC) for a balanced viewpoint. A performance comparison has been performed and the results reveal that the built model
utilizing Adaboost has the best performance. The TPR for the three
classifiers performs over 97% and the FPR performs < 4% for each of
the classifiers. The created model in this paper has the potential to
help organizations or experts anticipate and handle malware. The
proposed model can be used to make forecasts and provide management solutions in the network’s everyday operational activities.
|
96 |
Machine Learning Algorithms to Predict Cost Account Codes in an ERP System : An Exploratory Case StudyWirdemo, Alexander January 2023 (has links)
This study aimed to investigate how Machine Learning (ML) algorithms can be used to predict the cost account code to be used when handling invoices in an Enterprise Resource Planning (ERP) system commonly found in the Swedish public sector. This implied testing which one of the tested algorithms that performs the best and what criteria that need to be met in order to perform the best. Previous studies on ML and its use in invoice classification have focused on either the accounts payable side or the accounts receivable side of the balance sheet. The studies have used a variety of methods, some not only involving common ML algorithms such as Random forest, Naïve Bayes, Decision tree, Support Vector Machine, Logistic regression, Neural network or k-nearest Neighbor but also other classifiers such as rule classifiers and naïve classifiers. The general conclusion from previous studies is that several algorithms can classify invoices with a satisfactory accuracy score and that Random forest, Naïve Bayes and Neural network have shown the most promising results. The study was performed as an exploratory case study. The case company was a small municipal community where the finance clerks handles received invoices through an ERP system. The accounting step of invoice handling involves selecting the proper cost account code before submitting the invoice for review and approval. The data used was invoice summaries holding the organization number, bankgiro, postgiro and account code used. The algorithms selected for the task were the supervised learning algorithms Random forest and Naïve Bayes and the instance-based algorithm k-Nearest Neighbor (k-NN). The findings indicated that ML could be used to predict which cost account code to be used by providing a pre-filled suggestion when the clerk opens the invoice. Among the algorithms tested, Random forest performed the best with 78% accuracy (Naïve Bayes and k-NN performed at 69% and 70% accuracy, respectively). One reason for this is Random forest’s ability to handle several input variables, generate an unbiased estimate of the generalization error, and its ability to give information about the relationship between the variables and classification. However, a high level of support is needed in order to get the algorithm to perform at its best, where 335 occurrences is a guiding number in this case. / Syftet med denna studie var att undersöka hur Machine Learning (ML) algoritmer kan användas för att förutsäga vilken kontokod som ska användas vid hantering av fakturor i ett affärssystem som är vanligt förekommande i svensk offentlig sektor. Detta innebar att undersöka vilken av de testade algoritmerna som presterar bäst och vilka kriterier som måste uppfyllas för att prestera bäst. Tidigare studier om ML och dess användning vid fakturaklassificering har fokuserat på antingen balansräkningens leverantörsreskontra (leverantörsskulder) eller kundreskontrasidan (kundfordringar) i balansräkningen. Studierna har använt olika metoder, några involverar inte bara vanliga ML-algoritmer som Random forest, Naive Bayes, beslutsträd, Support Vector Machine, Logistisk regression, Neuralt nätverk eller k-nearest Neighbour, utan även andra klassificerare som regelklassificerare och naiva klassificerare. Den generella slutsatsen från tidigare studier är att det finns flera algoritmer som kan klassificera fakturor med en tillfredsställande noggrannhet, och att Random forest, Naive Bayes och neurala nätverk har visat de mest lovande resultaten. Studien utfördes som en explorativ fallstudie. Fallföretaget var en mindre kommun där ekonomiassistenter hanterar inkommande fakturor genom ett affärssystem. Bokföringssteget för fakturahantering innebär att användaren väljer rätt kostnadskontokod innan fakturan skickas för granskning och godkännande. Uppgifterna som användes var fakturasammandrag med organisationsnummer, bankgiro, postgiro och kontokod. Algoritmerna som valdes för uppgiften var de övervakade inlärningsalgoritmerna Random forest och Naive Bayes och den instansbaserade algoritmen k-Nearest Neighbour. Resultaten tyder på att ML skulle kunna användas för att förutsäga vilken kostnadskod som ska användas genom att ge ett förifyllt förslag när expediten öppnar fakturan. Bland de testade algoritmerna presterade Random forest bäst med 78 % noggrannhet (Naïve Bayes och k-Nearest Neighbour presterade med 69 % respektive 70 % noggrannhet). En förklaring till detta är Random forests förmåga att hantera flera indatavariabler, generera en opartisk skattning av generaliseringsfelet och dess förmåga att ge information om sambandet mellan variablerna och klassificeringen. Det krävs dock en högt antal dataobservationer för att få algoritmen att prestera som bäst, där 335 förekomster är ett minimum i detta fall.
|
97 |
Improving dual-tree algorithmsCurtin, Ryan Ross 07 January 2016 (has links)
This large body of work is entirely centered around dual-tree algorithms, a
class of algorithm based on spatial indexing structures that often provide large amounts of acceleration for various problems. This work focuses on understanding dual-tree algorithms using a new, tree-independent abstraction, and using this abstraction to develop new algorithms. Stated more clearly, the thesis of this entire work is that we may improve and expand the class of dual-tree algorithms by focusing on and providing improvements for each of the three independent components of a dual-tree algorithm: the type of space tree, the type of pruning dual-tree traversal, and the problem-specific BaseCase() and Score() functions. This is demonstrated by expressing many existing dual-tree algorithms in the tree-independent framework, and focusing on improving each of these three pieces. The result is a formidable set of generic components that can be used to assemble dual-tree algorithms, including faster traversals, improved tree theory, and new algorithms to solve the problems of max-kernel search and k-means clustering.
|
98 |
Automatic Classification of Fish in Underwater Video; Pattern Matching - Affine Invariance and Beyondgundam, madhuri, Gundam, Madhuri 15 May 2015 (has links)
Underwater video is used by marine biologists to observe, identify, and quantify living marine resources. Video sequences are typically analyzed manually, which is a time consuming and laborious process. Automating this process will significantly save time and cost. This work proposes a technique for automatic fish classification in underwater video. The steps involved are background subtracting, fish region tracking and classification using features. The background processing is used to separate moving objects from their surrounding environment. Tracking associates multiple views of the same fish in consecutive frames. This step is especially important since recognizing and classifying one or a few of the views as a species of interest may allow labeling the sequence as that particular species. Shape features are extracted using Fourier descriptors from each object and are presented to nearest neighbor classifier for classification. Finally, the nearest neighbor classifier results are combined using a probabilistic-like framework to classify an entire sequence.
The majority of the existing pattern matching techniques focus on affine invariance, mainly because rotation, scale, translation and shear are common image transformations. However, in some situations, other transformations may be modeled as a small deformation on top of an affine transformation. The proposed algorithm complements the existing Fourier transform-based pattern matching methods in such a situation. First, the spatial domain pattern is decomposed into non-overlapping concentric circular rings with centers at the middle of the pattern. The Fourier transforms of the rings are computed, and are then mapped to polar domain. The algorithm assumes that the individual rings are rotated with respect to each other. The variable angles of rotation provide information about the directional features of the pattern. This angle of rotation is determined starting from the Fourier transform of the outermost ring and moving inwards to the innermost ring. Two different approaches, one using dynamic programming algorithm and second using a greedy algorithm, are used to determine the directional features of the pattern.
|
99 |
EVALUATING SPATIAL QUERIES OVER DECLUSTERED SPATIAL DATAEslam A Almorshdy (6832553) 02 August 2019 (has links)
<div>
<div>
<p>Due to the large volumes of spatial data, data is stored on clusters of machines
that inter-communicate to achieve a task. In such distributed environment; communicating intermediate results among computing nodes dominates execution time.
Communication overhead is even more dominant if processing is in memory. Moreover, the way spatial data is partitioned affects overall processing cost. Various partitioning strategies influence the size of the intermediate results. Spatial data poses
the following additional challenges: 1)Storage load balancing because of the skewed
distribution of spatial data over the underlying space, 2)Query load imbalance due to
skewed query workload and query hotspots over both time and space, and 3)Lack of
effective utilization of the computing resources. We introduce a new kNN query evaluation technique, termed BCDB, for evaluating nearest-neighbor queries (NN-queries,
for short). In contrast to clustered partitioning of spatial data, BCDB explores the
use of declustered partitioning of data to address data and query skew. BCDB uses
summaries of the underling data and a coarse-grained index to localize processing of
the NN-query on each local node as much as possible. The coarse-grained index is locally traversed using a new uncertain version of classical distance browsing resulting in minimal O( √k) elements to be communicated across all processing nodes.</p>
</div>
</div>
|
100 |
Artificial intelligence and Machine learning : a diabetic readmission studyForsman, Robin, Jönsson, Jimmy January 2019 (has links)
The maturing of Artificial intelligence provides great opportunities for healthcare, but also comes with new challenges. For Artificial intelligence to be adequate a comprehensive analysis of the data is necessary along with testing the data in multiple algorithms to determine which algorithm is appropriate to use. In this study collection of data has been gathered that consists of patients who have either been readmitted or not readmitted to hospital within 30-days after being admitted. The data has then been analyzed and compared in different algorithms to determine the most appropriate algorithm to use.
|
Page generated in 0.0389 seconds