131 |
Exploring the Noise Resilience of Combined Sturges AlgorithmAgarwal, Akrita January 2015 (has links)
No description available.
|
132 |
Increasing the Precision of Forest Area Estimates through Improved Sampling for Nearest Neighbor Satellite Image ClassificationBlinn, Christine Elizabeth 25 August 2005 (has links)
The impacts of training data sample size and sampling method on the accuracy of forest/nonforest classifications of three mosaicked Landsat ETM+ images with the nearest neighbor decision rule were explored. Large training data pools of single pixels were used in simulations to create samples with three sampling methods (random, stratified random, and systematic) and eight sample sizes (25, 50, 75, 100, 200, 300, 400, and 500). Two forest area estimation techniques were used to estimate the proportion of forest in each image and to calculate forest area precision estimates. Training data editing was explored to remove problem pixels from the training data pools. All possible band combinations of the six non-thermal ETM+ bands were evaluated for every sample draw. Comparisons were made between classification accuracies to determine if all six bands were needed. The utility of separability indices, minimum and average Euclidian distances, and cross-validation accuracies for the selection of band combinations, prediction of classification accuracies, and assessment of sample quality were determined.
Larger training data sample sizes produced classifications with higher average accuracies and lower variability. All three sampling methods had similar performance. Training data editing improved the average classification accuracies by a minimum of 5.45%, 5.31%, and 3.47%, respectively, for the three images. Band combinations with fewer than all six bands almost always produced the maximum classification accuracy for a single sample draw. The number of bands and combination of bands, which maximized classification accuracy, was dependent on the characteristics of the individual training data sample draw, the image, sample size, and, to a lesser extent, the sampling method. All three band selection measures were unable to select band combinations that produced higher accuracies on average than all six bands. Cross-validation accuracies with sample size 500 had high correlations with classification accuracies, and provided an indication of sample quality.
Collection of a high quality training data sample is key to the performance of the nearest neighbor classifier. Larger samples are necessary to guarantee classifier performance and the utility of cross-validation accuracies. Further research is needed to identify the characteristics of "good" training data samples. / Ph. D.
|
133 |
Precision Aggregated Local ModelsEdwards, Adam Michael 28 January 2021 (has links)
Large scale Gaussian process (GP) regression is infeasible for larger data sets due to cubic scaling of flops and quadratic storage involved in working with covariance matrices. Remedies in recent literature focus on divide-and-conquer, e.g., partitioning into sub-problems and inducing functional (and thus computational) independence. Such approximations can speedy, accurate, and sometimes even more flexible than an ordinary GPs. However, a big downside is loss of continuity at partition boundaries. Modern methods like local approximate GPs (LAGPs) imply effectively infinite partitioning and are thus pathologically good and bad in this regard. Model averaging, an alternative to divide-and-conquer, can maintain absolute continuity but often over-smooth, diminishing accuracy. Here I propose putting LAGP-like methods into a local experts-like framework, blending partition-based speed with model-averaging continuity, as a flagship example of what I call precision aggregated local models (PALM). Using N_C LAGPs, each selecting n from N data pairs, I illustrate a scheme that is at most cubic in n, quadratic in N_C, and linear in N, drastically reducing computational and storage demands. Extensive empirical illustration shows how PALM is at least as accurate as LAGP, can be much faster in terms of speed, and furnishes continuous predictive surfaces. Finally, I propose sequential updating scheme which greedily refines a PALM predictor up to a computational budget, and several variations on the basic PALM that may provide predictive improvements. / Doctor of Philosophy / Occasionally, when describing the relationship between two variables, it may be helpful to use a so-called ``non-parametric" regression that is agnostic to the function that connects them. Gaussian Processes (GPs) are a popular method of non-parametric regression used for their relative flexibility and interpretability, but they have the unfortunate drawback of being computationally infeasible for large data sets. Past work into solving the scaling issues for GPs has focused on ``divide and conquer" style schemes that spread the data out across multiple smaller GP models. While these model make GP methods much more accessible to large data sets they do so either at the expense of local predictive accuracy of global surface continuity. Precision Aggregated Local Models (PALM) is a novel divide and conquer method for GP models that is scalable for large data while maintaining local accuracy and a smooth global model. I demonstrate that PALM can be built quickly, and performs well predictively compared to other state of the art methods. This document also provides a sequential algorithm for selecting the location of each local model, and variations on the basic PALM methodology.
|
134 |
Machine Learning Models in Fullerene/Metallofullerene Chromatography StudiesLiu, Xiaoyang 08 August 2019 (has links)
Machine learning methods are now extensively applied in various scientific research areas to make models. Unlike regular models, machine learning based models use a data-driven approach. Machine learning algorithms can learn knowledge that are hard to be recognized, from available data. The data-driven approaches enhance the role of algorithms and computers and then accelerate the computation using alternative views. In this thesis, we explore the possibility of applying machine learning models in the prediction of chromatographic retention behaviors. Chromatographic separation is a key technique for the discovery and analysis of fullerenes. In previous studies, differential equation models have achieved great success in predictions of chromatographic retentions. However, most of the differential equation models require experimental measurements or theoretical computations for many parameters, which are not easy to obtain. Fullerenes/metallofullerenes are rigid and spherical molecules with only carbon atoms, which makes the predictions of chromatographic retention behaviors as well as other properties much simpler than other flexible molecules that have more variations on conformations. In this thesis, I propose the polarizability of a fullerene molecule is able to be estimated directly from the structures. Structural motifs are used to simplify the model and the models with motifs provide satisfying predictions. The data set contains 31947 isomers and their polarizability data and is split into a training set with 90% data points and a complementary testing set. In addition, a second testing set of large fullerene isomers is also prepared and it is used to testing whether a model can be trained by small fullerenes and then gives ideal predictions on large fullerenes. / Machine learning models are capable to be applied in a wide range of areas, such as scientific research. In this thesis, machine learning models are applied to predict chromatography behaviors of fullerenes based on the molecular structures. Chromatography is a common technique for mixture separations, and the separation is because of the difference of interactions between molecules and a stationary phase. In real experiments, a mixture usually contains a large family of different compounds and it requires lots of work and resources to figure out the target compound. Therefore, models are extremely import for studies of chromatography. Traditional models are built based on physics rules, and involves several parameters. The physics parameters are measured by experiments or theoretically computed. However, both of them are time consuming and not easy to be conducted. For fullerenes, in my previous studies, it has been shown that the chromatography model can be simplified and only one parameter, polarizability, is required. A machine learning approach is introduced to enhance the model by predicting the molecular polarizabilities of fullerenes based on structures. The structure of a fullerene is represented by several local structures. Several types of machine learning models are built and tested on our data set and the result shows neural network gives the best predictions.
|
135 |
Machine Learning for Malware Detection in Network TrafficOmopintemi, A.H., Ghafir, Ibrahim, Eltanani, S., Kabir, Sohag, Lefoane, Moemedi 19 December 2023 (has links)
No / Developing advanced and efficient malware detection systems is
becoming significant in light of the growing threat landscape in cybersecurity. This work aims to tackle the enduring problem of identifying malware and protecting digital assets from cyber-attacks.
Conventional methods frequently prove ineffective in adjusting
to the ever-evolving field of harmful activity. As such, novel approaches that improve precision while simultaneously taking into
account the ever-changing landscape of modern cybersecurity problems are needed. To address this problem this research focuses on
the detection of malware in network traffic. This work proposes
a machine-learning-based approach for malware detection, with
particular attention to the Random Forest (RF), Support Vector Machine (SVM), and Adaboost algorithms. In this paper, the model’s
performance was evaluated using an assessment matrix. Included
the Accuracy (AC) for overall performance, Precision (PC) for positive predicted values, Recall Score (RS) for genuine positives, and
the F1 Score (SC) for a balanced viewpoint. A performance comparison has been performed and the results reveal that the built model
utilizing Adaboost has the best performance. The TPR for the three
classifiers performs over 97% and the FPR performs < 4% for each of
the classifiers. The created model in this paper has the potential to
help organizations or experts anticipate and handle malware. The
proposed model can be used to make forecasts and provide management solutions in the network’s everyday operational activities.
|
136 |
Undersökning om hjulmotorströmmar kan användas som alternativ metod för kollisiondetektering i autonoma gräsklippare. : Klassificering av hjulmotorströmmar med KNN och MLP. / Investigation if wheel motor currents can be used as an alternative method for collision detection in robotic lawn mowersBertilsson, Tobias, Johansson, Romario January 2019 (has links)
Purpose – The purpose of the study is to expand the knowledge of how wheel motor currents can be combined with machine learning to be used in a collision detection system for autonomous robots, in order to decrease the number of external sensors and open new design opportunities and lowering production costs. Method – The study is conducted with design science research where two artefacts are developed in a cooperation with Globe Tools Group. The artefacts are evaluated in how they categorize data given by an autonomous robot in the two categories collision and non-collision. The artefacts are then tested by generated data to analyse their ability to categorize. Findings – Both artefacts showed a 100 % accuracy in detecting the collisions in the given data by the autonomous robot. In the second part of the experiment the artefacts show that they have different decision boundaries in how they categorize the data, which will make them useful in different applications. Implications – The study contributes to an expanding knowledge in how machine learning and wheel motor currents can be used in a collision detection system. The results can lead to lowering production costs and opening new design opportunities. Limitations – The data used in the study is gathered by an autonomous robot which only did frontal collisions on an artificial lawn. Keywords – Machine learning, K-Nearest Neighbour, Multilayer Perceptron, collision detection, autonomous robots, Collison detection based on current. / Syfte – Studiens syfte är att utöka kunskapen om hur hjulmotorstömmar kan kombineras med maskininlärning för att användas vid kollisionsdetektion hos autonoma robotar, detta för att kunna minska antalet krävda externa sensorer hos dessa robotar och på så sätt öppna upp design möjligheter samt minska produktionskostnader Metod – Studien genomfördes med design science research där två artefakter utvecklades i samarbete med Globe Tools Group. Artefakterna utvärderades sedan i hur de kategoriserade kollisioner utifrån en given datamängd som genererades från en autonom gräsklippare. Studiens experiment introducerade sedan in data som inte ingick i samma datamängd för att se hur metoderna kategoriserade detta. Resultat – Artefakterna klarade med 100% noggrannhet att detektera kollisioner i den giva datamängden som genererades. Dock har de två olika artefakterna olika beslutsregioner i hur de kategoriserar datamängderna till kollision samt icke-kollisioner, vilket kan ge dom olika användningsområden Implikationer – Examensarbetet bidrar till en ökad kunskap om hur maskininlärning och hjulmotorströmmar kan användas i ett kollisionsdetekteringssystem. Studiens resultat kan bidra till minskade kostnader i produktion samt nya design möjligheter Begränsningar – Datamängden som användes i studien samlades endast in av en autonom gräsklippare som gjorde frontalkrockar med underlaget konstgräs. Nyckelord – Maskininlärning, K-nearest neighbor, Multi-layer perceptron, kollisionsdetektion, autonoma robotar
|
137 |
Machine Learning Algorithms to Predict Cost Account Codes in an ERP System : An Exploratory Case StudyWirdemo, Alexander January 2023 (has links)
This study aimed to investigate how Machine Learning (ML) algorithms can be used to predict the cost account code to be used when handling invoices in an Enterprise Resource Planning (ERP) system commonly found in the Swedish public sector. This implied testing which one of the tested algorithms that performs the best and what criteria that need to be met in order to perform the best. Previous studies on ML and its use in invoice classification have focused on either the accounts payable side or the accounts receivable side of the balance sheet. The studies have used a variety of methods, some not only involving common ML algorithms such as Random forest, Naïve Bayes, Decision tree, Support Vector Machine, Logistic regression, Neural network or k-nearest Neighbor but also other classifiers such as rule classifiers and naïve classifiers. The general conclusion from previous studies is that several algorithms can classify invoices with a satisfactory accuracy score and that Random forest, Naïve Bayes and Neural network have shown the most promising results. The study was performed as an exploratory case study. The case company was a small municipal community where the finance clerks handles received invoices through an ERP system. The accounting step of invoice handling involves selecting the proper cost account code before submitting the invoice for review and approval. The data used was invoice summaries holding the organization number, bankgiro, postgiro and account code used. The algorithms selected for the task were the supervised learning algorithms Random forest and Naïve Bayes and the instance-based algorithm k-Nearest Neighbor (k-NN). The findings indicated that ML could be used to predict which cost account code to be used by providing a pre-filled suggestion when the clerk opens the invoice. Among the algorithms tested, Random forest performed the best with 78% accuracy (Naïve Bayes and k-NN performed at 69% and 70% accuracy, respectively). One reason for this is Random forest’s ability to handle several input variables, generate an unbiased estimate of the generalization error, and its ability to give information about the relationship between the variables and classification. However, a high level of support is needed in order to get the algorithm to perform at its best, where 335 occurrences is a guiding number in this case. / Syftet med denna studie var att undersöka hur Machine Learning (ML) algoritmer kan användas för att förutsäga vilken kontokod som ska användas vid hantering av fakturor i ett affärssystem som är vanligt förekommande i svensk offentlig sektor. Detta innebar att undersöka vilken av de testade algoritmerna som presterar bäst och vilka kriterier som måste uppfyllas för att prestera bäst. Tidigare studier om ML och dess användning vid fakturaklassificering har fokuserat på antingen balansräkningens leverantörsreskontra (leverantörsskulder) eller kundreskontrasidan (kundfordringar) i balansräkningen. Studierna har använt olika metoder, några involverar inte bara vanliga ML-algoritmer som Random forest, Naive Bayes, beslutsträd, Support Vector Machine, Logistisk regression, Neuralt nätverk eller k-nearest Neighbour, utan även andra klassificerare som regelklassificerare och naiva klassificerare. Den generella slutsatsen från tidigare studier är att det finns flera algoritmer som kan klassificera fakturor med en tillfredsställande noggrannhet, och att Random forest, Naive Bayes och neurala nätverk har visat de mest lovande resultaten. Studien utfördes som en explorativ fallstudie. Fallföretaget var en mindre kommun där ekonomiassistenter hanterar inkommande fakturor genom ett affärssystem. Bokföringssteget för fakturahantering innebär att användaren väljer rätt kostnadskontokod innan fakturan skickas för granskning och godkännande. Uppgifterna som användes var fakturasammandrag med organisationsnummer, bankgiro, postgiro och kontokod. Algoritmerna som valdes för uppgiften var de övervakade inlärningsalgoritmerna Random forest och Naive Bayes och den instansbaserade algoritmen k-Nearest Neighbour. Resultaten tyder på att ML skulle kunna användas för att förutsäga vilken kostnadskod som ska användas genom att ge ett förifyllt förslag när expediten öppnar fakturan. Bland de testade algoritmerna presterade Random forest bäst med 78 % noggrannhet (Naïve Bayes och k-Nearest Neighbour presterade med 69 % respektive 70 % noggrannhet). En förklaring till detta är Random forests förmåga att hantera flera indatavariabler, generera en opartisk skattning av generaliseringsfelet och dess förmåga att ge information om sambandet mellan variablerna och klassificeringen. Det krävs dock en högt antal dataobservationer för att få algoritmen att prestera som bäst, där 335 förekomster är ett minimum i detta fall.
|
138 |
Improving dual-tree algorithmsCurtin, Ryan Ross 07 January 2016 (has links)
This large body of work is entirely centered around dual-tree algorithms, a
class of algorithm based on spatial indexing structures that often provide large amounts of acceleration for various problems. This work focuses on understanding dual-tree algorithms using a new, tree-independent abstraction, and using this abstraction to develop new algorithms. Stated more clearly, the thesis of this entire work is that we may improve and expand the class of dual-tree algorithms by focusing on and providing improvements for each of the three independent components of a dual-tree algorithm: the type of space tree, the type of pruning dual-tree traversal, and the problem-specific BaseCase() and Score() functions. This is demonstrated by expressing many existing dual-tree algorithms in the tree-independent framework, and focusing on improving each of these three pieces. The result is a formidable set of generic components that can be used to assemble dual-tree algorithms, including faster traversals, improved tree theory, and new algorithms to solve the problems of max-kernel search and k-means clustering.
|
139 |
使用最近鄰域法預測匯率—以美元兌新台幣為例 / Predicting exchange rates with nearest-neighbors method: The case of NTD/USD郭依帆 Unknown Date (has links)
建立模型來估計匯率早已行之有年。較早期的匯率模型,不論是在樣本內的配適或是樣本外的預測,其實表現的並不理想。之後的研究針對這樣的結果指出,這是因為匯率的表現是非線性的,並非傳統線性模型可描繪出來。而對於捕捉匯率非線性的特性,傾向使用無母數的估計方式。因此,本研究採用最近鄰域法進行美元兌新台幣的匯率預測。另外,許多早期的研究發現,隨機漫步模型與其他模型相比較之後,在匯率預測上的表現最好,因而引發了”打敗隨機漫步”的一連串熱潮。本研究欲延續這項議題,將隨機漫步模型做為與最近鄰域模型比較的基準。 / 本研究使用的資料為即期匯率,包含日資料、週資料和月資料三種。將每種資料皆切割為樣本內與樣本外兩個部分,其中最後三分之一的樣本數用於樣本外預測。平均絕對誤差與平均誤差平方根則是用來衡量比較模型預測的準確性。實證結果發現,使用局部加權估計的最近鄰域模型在樣本內的配適表現上優於隨機漫步模型;然而,在樣本外的預測能力上,隨機漫步模型仍舊略勝一籌。 / A wide variety of empirical exchange rate models have been estimated over the years. Earlier findings indicated that exchange rate equations do not fit particularly well, and forecast no better. Later researches then provided a potential reason for the poor performance that traditional exchange rate models, because they are nonlinear. To find a resolution for nonlinearity, nonparametric techniques tend to be useful tools. In this study, we use one of nonparametric techniques called nearest-neighbors method to predict NTD against USD. Besides, many earlier papers found that forecasts from popular models for the foreign exchange rate generally fail to improve upon the random walk out-of-sample. “Beat the random walk” became an emerging issue then. This has motivated this research, and thus we include the random walk as a linear benchmark. / The data set consists of the daily, weekly and monthly spot rates for NTD/USD. We divide each data set into a fitting set and a prediction set for in-sample analysis and out-of-sample forecast, respectively. The out-of-sample forecasts are calculated from the last one-third of each series. As a measure of performance the mean squared error (MAE) and root mean squared error (RMSE) are used. In our empirical results, we find that nearest-neighbors model using local weights easily tops the random walk in-sample. However, as we turn to the out-of-sample prediction, no models produce forecasts superior to the random walk. It seems difficult to beat the random walk out-of-sample in this study.
|
140 |
Spatial Analysis of Retinal Pigment Epithelium MorphologyHuang, Haitao 12 August 2016 (has links)
In patients with age-related macular degeneration, a monolayer of cells in the eyes called retinal pigment epithelium differ from healthy ones in morphology. It is therefore important to quantify the morphological changes, which will help us better understand the physiology, disease progression and classification. Classification of the RPE morphometry has been accomplished with whole tissue data. In this work, we focused on the spatial aspect of RPE morphometric analysis. We used the second-order spatial analysis to reveal the distinct patterns of cell clustering between normal and diseased eyes for both simulated and experimental human RPE data. We classified the mouse genotype and age by the k-Nearest Neighbors algorithm. Radially aligned regions showed different classification power for several cell shape variables. Our proposed methods provide a useful addition to classification and prognosis of eye disease noninvasively.
|
Page generated in 0.0383 seconds