Spelling suggestions: "subject:"nearest neighbouring"" "subject:"nearest neighbourhood""
11 |
探討三種分類方法來提升混合方式用在兩階段決策模式的準確率:以旅遊決策為例 / Improving the precision rate of the Two-stage Decision Model in the context of tourism decision-making via exploring Decision Tree, Multi-staged Binary Tree and Back Propagation of Error Neural Network陳怡倩, Chen, Yi Chien Unknown Date (has links)
The two-stage data mining technique for classifications in tourism recommendation system is necessary to connect user perception, decision criteria and decision purpose. In existed literature, hybrid data mining method combining Decision Tree and K-nearest neighbour approaches (DTKNN) were proposed. It has a high precision rate of approximately 80% in K-nearest Neighbour (KNN) but a much lower rate in the first stage using Decision Tree (Fu & Tu, 2011). It included two potential improvements on two-stage technique. To improve the first stage of DTKNN in precision rate and the efficiency, the amount of questions is decreased when users search for the desired recommendation on the system. In this paper, the researcher investigates the way to improve the first stage of DTKNN for full questionnaires and also determines the suitability of dynamic questionnaire based on its precision rate in future tourism recommendation system. Firstly, this study compared and chose the highest precision rate among Decision Tree, Multi-staged Binary Tree and Back Propagation of Error Neural Network (BPNN). The chosen method is then combined with KNN to propose a new methodology. Secondly, the study compared and deter¬mined the suitability of dynamic questionnaires for all three classification methods by decreasing the number of attributes. The suitable dynamic questionnaire is based on the least amount of attributes used with an appropriate precision rate. Tourism recommendation system is selected as the target to apply and analyse the usefulness of the algorithm as tourism selection is a two-stage example. Tourism selection is to determine expected goal and experience before going on a tour at the first stage and to choose the tour that best matches stage one. The result indicates that Multi-staged Bi¬nary Tree has the highest precision rate of 74.167% comparing to Decision Tree with 73.33% then BPNN with 65.47% for full questionnaire. This new approach will improve the effectiveness of the system by improving the precision rate of first stage under the current DTKNN method. For dynamic questionnaire, the result has shown that Decision Tree is the most suitable method given that it resulted in the least difference of 1.33% in precision rate comparing to full questionnaire, as opposed to 1.48% for BPNN and 4% for Multi-staged Binary Tree. Thus, dynamic questionnaire will also improve the efficiency by decreasing the amount of questions which users are required to fill in when searching for the desired recommendation on the system. It provides users with the option to not answer some questions. It also increases the practicality of non-dynamic questionnaire and, therefore, affects the ultimate precision rate.
|
12 |
Inomhuspositionering med bredbandig radioGustavsson, Oscar, Miksits, Adam January 2019 (has links)
In this report it is evaluated whether a higher dimensional fingerprint vector increases accuracy of an algorithm for indoor localisation. Many solutions use a Received Signal Strength Indicator (RSSI) to estimate a position. It was studied if the use of the Channel State Information (CSI), i.e. the channel’s frequency response, is beneficial for the accuracy.The localisation algorithm estimates the position of a new measurement by comparing it to previous measurements using k-Nearest Neighbour (k-NN) regression. The mean power was used as RSSI and 100 samples of the frequency response as CSI. Reduction of the dimension of the CSI vector with statistical moments and Principal Component Analysis (PCA) was tested. An improvement in accuracy could not be observed by using a higher dimensional fingerprint vector than RSSI. A standardised Euclidean or Mahalanobis distance measure in the k-NN algorithm seemed to perform better than Euclidean distance. Taking the logarithm of the frequency response samples before doing any calculation also seemed to improve accuracy. / I denna rapport utvärderas huruvida data av högre dimension ökar noggrannheten hos en algoritm för inomhuspositionering. Många lösningar använder en indikator för mottagen signalstyrka (RSSI) för att skatta en position. Det studerades studerade om användningen av kanalens fysikaliska tillstånd (CSI), det vill säga kanalens frekvenssvar, är fördelaktig för noggrannheten.Positioneringsalgoritmen skattar positionen för en ny mätning genom att jämföra den med tidigare mätningar med k-Nearest Neighbour (k-NN)-regression. Medeleffekten användes som RSSI och 100 sampel av frekvenssvaret som CSI. Reducering av CSI vektornsdimension med statistiska moment och Principalkomponentanalys(PCA) testades. En förbättring av noggrannheten kunde inte observeras genom att använda data med högre dimension än RSSI. Ett standardiserat Euklidiskt eller Mahalanobis avståndsåatt i k-NN-algoritmen verkade prestera bättre än Euklidiskt avstånd. Att ta logaritmen av frekvenssvarets sampel innan andra beräkningar gjordes verkade också förbättra noggrannheten.
|
13 |
Prediktion av efterfrågan i filmbranschen baserat på maskininlärningLiu, Julia, Lindahl, Linnéa January 2018 (has links)
Machine learning is a central technology in data-driven decision making. In this study, machine learning in the context of demand forecasting in the motion picture industry from film exhibitors’ perspective is investigated. More specifically, it is investigated to what extent the technology can assist estimation of public interest in terms of revenue levels of unreleased movies. Three machine learning models are implemented with the aim to forecast cumulative revenue levels during the opening weekend of various movies which were released in 2010-2017 in Sweden. The forecast is based on ten attributes which range from public online user-generated data to specific movie characteristics such as production budget and cast. The results indicate that the choice of attributes as well as models in this study were not optimal on the Swedish market as the retrieved values from relevant precision metrics were inadequate, however with valid underlying reasons. / Maskininlärning är en central teknik i datadrivet beslutsfattande. I den här rapporten utreds maskininlärning isammanhanget av efterfrågeprediktion i filmbranschen från biografers perspektiv. Närmare bestämt undersöks det i vilken utsträckningtekniken kan bistå uppskattning av publikintresse i termer av intäkter vad gäller osläppta filmer hos biografer. Tremaskininlärningsmodeller implementeras i syfte att göra en prognos på kumulativa intäktsnivåer under premiärhelgen för filmer vilkahade premiär 2010-2017 i Sverige. Prognostiseringen baseras på varierande attribut som sträcker sig från publik användargenererad data på nätet till filmspecifika variabler så som produktionsbudget och uppsättning av skådespelare. De erhållna resultaten visar att valen av attribut och modeller inte var optimala på den svenska marknaden då erhållna precisionsmått från modellerna antog låga värden, med relevanta underliggande skäl.
|
14 |
Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery DiseaseDuan, Haoyang 15 May 2014 (has links)
From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.
|
15 |
Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery DiseaseDuan, Haoyang January 2014 (has links)
From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.
|
16 |
Detektion och klassificering av äppelmognad i hyperspektrala bilder / Detection And Classification Of Apple Ripening In Hyperspectral ImagesAndersson, Fanny, Furugård, Anna January 2021 (has links)
Detta arbete presenterar en icke-destruktiv metod för att detektera och klassificera mognadsgraden hos äpplen med användning av hyperspektrala bilder. Fastställning av mognadsgraden hos äpplen är intressant för bland annat äppelodlare och musterier vid lagring och beredning. Äpplens mognadsgrad är även intressant inom växtförädling. För att fastställa mognadsgraden idag krävs att det skärs i frukten, en så kallad destruktiv metod. Hyperspektrala bilder kan idag användas inom områden som jordbruk, miljöövervakning och militär spaning. / <p>Examensarbetet är utfört vid Institutionen för teknik och naturvetenskap (ITN) vid Tekniska fakulteten, Linköpings universitet</p>
|
17 |
Credit Scoring using Machine Learning ApproachesChitambira, Bornvalue January 2022 (has links)
This project will explore machine learning approaches that are used in creditscoring. In this study we consider consumer credit scoring instead of corporatecredit scoring and our focus is on methods that are currently used in practiceby banks such as logistic regression and decision trees and also compare theirperformance against machine learning approaches such as support vector machines (SVM), neural networks and random forests. In our models we addressimportant issues such as dataset imbalance, model overfitting and calibrationof model probabilities. The six machine learning methods we study are support vector machine, logistic regression, k-nearest neighbour, artificial neuralnetworks, decision trees and random forests. We implement these models inpython and analyse their performance on credit dataset with 30000 observations from Taiwan, extracted from the University of California Irvine (UCI)machine learning repository.
|
18 |
Forecasting hourly electricity consumption for sets of households using machine learning algorithmsLinton, Thomas January 2015 (has links)
To address inefficiency, waste, and the negative consequences of electricity generation, companies and government entities are looking to behavioural change among residential consumers. To drive behavioural change, consumers need better feedback about their electricity consumption. A monthly or quarterly bill provides the consumer with almost no useful information about the relationship between their behaviours and their electricity consumption. Smart meters are now widely dispersed in developed countries and they are capable of providing electricity consumption readings at an hourly resolution, but this data is mostly used as a basis for billing and not as a tool to assist the consumer in reducing their consumption. One component required to deliver innovative feedback mechanisms is the capability to forecast hourly electricity consumption at the household scale. The work presented by this thesis is an evaluation of the effectiveness of a selection of kernel based machine learning methods at forecasting the hourly aggregate electricity consumption for different sized sets of households. The work of this thesis demonstrates that k-Nearest Neighbour Regression and Gaussian process Regression are the most accurate methods within the constraints of the problem considered. In addition to accuracy, the advantages and disadvantages of each machine learning method are evaluated, and a simple comparison of each algorithms computational performance is made. / För att ta itu med ineffektivitet, avfall, och de negativa konsekvenserna av elproduktion så vill företag och myndigheter se beteendeförändringar bland hushållskonsumenter. För att skapa beteendeförändringar så behöver konsumenterna bättre återkoppling när det gäller deras elförbrukning. Den nuvarande återkopplingen i en månads- eller kvartalsfaktura ger konsumenten nästan ingen användbar information om hur deras beteenden relaterar till deras konsumtion. Smarta mätare finns nu överallt i de utvecklade länderna och de kan ge en mängd information om bostäders konsumtion, men denna data används främst som underlag för fakturering och inte som ett verktyg för att hjälpa konsumenterna att minska sin konsumtion. En komponent som krävs för att leverera innovativa återkopplingsmekanismer är förmågan att förutse elförbrukningen på hushållsskala. Arbetet som presenteras i denna avhandling är en utvärdering av noggrannheten hos ett urval av kärnbaserad maskininlärningsmetoder för att förutse den sammanlagda förbrukningen för olika stora uppsättningar av hushåll. Arbetet i denna avhandling visar att "k-Nearest Neighbour Regression" och "Gaussian Process Regression" är de mest exakta metoder inom problemets begränsningar. Förutom noggrannhet, så görs en utvärdering av fördelar, nackdelar och prestanda hos varje maskininlärningsmetod.
|
19 |
Evaluating Random Forest and k-Nearest Neighbour Algorithms on Real-Life Data Sets / Utvärdering av slumpmässig skog och k-närmaste granne algoritmer på verkliga datamängderSalim, Atheer, Farahani, Milad January 2023 (has links)
Computers can be used to classify various types of data, for example to filter email messages, detect computer viruses, detect diseases, etc. This thesis explores two classification algorithms, random forest and k-nearest neighbour, to understand how accurately and how quickly they classify data. A literature study was conducted to identify the various prerequisites and to find suitable data sets. Five different data sets, leukemia, credit card, heart failure, mushrooms and breast cancer, were gathered and classified by each algorithm. A train split and a 4-fold cross-validation for each data set was used. The Rust library SmartCore, which included numerous classification methods and tools, was used to perform the classification. The results gathered indicated that using the train split resulted in better classification results, as opposed to 4-fold cross-validation. However, it could not be determined if any attributes of a data set affect the classification accuracy. Random forest managed to achieve the best classification results on the two data sets heart failure and leukemia, whilst k-nearest neighbour achieved the best classification results on the remaining three data sets. In general the classification results on both algorithms were similar. Based on the results, the execution time of random forest was dependent on the number of trees in the ”forest”, in which a greater number of trees resulted in an increased execution time. In contrast, a higher k value did not increase the execution time of k-nearest neighbour. It was also found that data sets with only binary values (0 and 1) run much faster than a data set with arbitrary values when using random forest. The number of instances in a data set also leads to an increased execution time for random forest despite a small number of features. The same applied to k-nearest neighbour, but with the number of features also affecting the execution since time is needed to compute distances between data points. Random forest managed to achieve the fastest execution time on the two data sets credit card and mushrooms, whilst k-nearest neighbour executed faster on the remaining three data sets. The difference in execution time between the algorithms varied a lot and this depends on the parameter value chosen for the respective algorithm. / Datorer kan användas för att klassificera olika typer av data, t.ex att filtrera e-postmeddelanden, upptäcka datorvirus, upptäcka sjukdomar, etc. Denna avhandling utforskar två klassificeringsalgoritmer, slumpmässiga skogar och k-närmaste grannar, för att förstå hur precist och hur snabbt de klassificerar data. En litteraturstudie genomfördes för att identifiera de olika förutsättningarna och för att hitta lämpliga datamängder. Fem olika datamängder, leukemia, credit card, heart failure, mushrooms och breast cancer, samlades in och klassificerades av varje algoritm. En träningsfördelning och en 4-faldig korsvalidering för varje datamängd användes. Rust-biblioteket SmartCore, som inkluderade många klassificeringsmetoder och verktyg, användes för att utföra klassificeringen. De insamlade resultaten visade att användningen av träningsfördelning resulterade i bättre klassificeringsresultat i motsats till 4-faldig korsvalidering. Det gick dock inte att fastställa om några attribut för en datamängd påverkar klassificeringens noggrannhet. Slumpmässiga skogar lyckades uppnå det bästa klassificeringsresultaten på de två datamängderna heart failure och leukemia, medan k-närmaste granne uppnådde det bästa klassificeringsresultaten på de återstående tre datamängderna. I allmänhet var klassificeringsresultaten för båda algoritmerna likartade. Utifrån resultaten var utförandetiden för slumpmässiga skogar beroende av antalet träd i ”skogen”, då ett större antal träd resulterade i en ökad utförandetid. Däremot ökade inte ett högre k-värde exekveringstiden för k-närmaste grannar. Det upptäcktes även att datamängder med endast binära värden (0 och 1) körs mycket snabbare än datamängder med godtyckliga värden när man använder slumpmässiga skogar. Antalet instanser i en datamängd leder också till en ökad exekveringstid för slumpmässiga skogar trots ett litet antal egenskaper. Detsamma gällde för k-närmaste granne, men även antalet egenskaper påverkade exekveringstiden då tid behövs för att beräkna avstånd mellan datapunkter. Slumpmässiga skogar lyckades uppnå den snabbaste exekveringstiden på de två datamängderna credit card och mushrooms, medan k-närmaste granne exekverades snabbare på de återstående tre datamängderna. Skillnaden i exekveringstid mellan algoritmerna varierade mycket och detta beror på vilket parametervärde som valts för respektive algoritm.
|
20 |
Bedömning av elevuppsatser genom maskininlärning / Essay Scoring for Swedish using Machine LearningDyremark, Johanna, Mayer, Caroline January 2019 (has links)
Betygsättning upptar idag en stor del av lärares arbetstid och det finns en betydande inkonsekvens vid bedömning utförd av olika lärare. Denna studie ämnar undersöka vilken träffsäkerhet som en automtiserad bedömningsmodell kan uppnå. Tre maskininlärningsmodeller för klassifikation i form av Linear Discriminant Analysis, K-Nearest Neighbor och Random Forest tränas och testas med femfaldig korsvalidering på uppsatser från nationella prov i svenska. Klassificeringen baseras på språk och formrelaterade attribut inkluderande ord och teckenvisa längdmått, likhet med texter av olika formalitetsgrad och grammatikrelaterade mått. Detta utmynnar i ett maximalt quadratic weighted kappa-värde på 0,4829 och identisk överensstämmelse med expertgivna betyg i 57,53 % av fallen. Dessa resultat uppnåddes av en modell baserad på Linear Discriminant Analysis och uppvisar en högre korrelation med expertgivna betyg än en ordinarie lärare. Trots pågående digitalisering inom skolväsendet kvarstår ett antal hinder innan fullständigt maskininlärningsbaserad bedömning kan realiseras, såsom användarnas inställning till tekniken, etiska dilemman och teknikens svårigheter med förståelse av semantik. En delvis integrerad automatisk betygssättning har dock potential att identifiera uppsatser där behov av dubbelrättning föreligger, vilket kan öka överensstämmelsen vid storskaliga prov till en låg kostnad. / Today, a large amount of a teacher’s workload is comprised of essay scoring and there is a large variability between teachers’ gradings. This report aims to examine what accuracy can be acceived with an automated essay scoring system for Swedish. Three following machine learning models for classification are trained and tested with 5-fold cross-validation on essays from Swedish national tests: Linear Discriminant Analysis, K-Nearest Neighbour and Random Forest. Essays are classified based on 31 language structure related attributes such as token-based length measures, similarity to texts with different formal levels and use of grammar. The results show a maximal quadratic weighted kappa value of 0.4829 and a grading identical to expert’s assessment in 57.53% of all tests. These results were achieved by a model based on Linear Discriminant Analysis and showed higher inter-rater reliability with expert grading than a local teacher. Despite an ongoing digitilization within the Swedish educational system, there are a number of obstacles preventing a complete automization of essay scoring such as users’ attitude, ethical issues and the current techniques difficulties in understanding semantics. Nevertheless, a partial integration of automatic essay scoring has potential to effectively identify essays suitable for double grading which can increase the consistency of large-scale tests to a low cost.
|
Page generated in 0.0402 seconds