Global ETD Search

221	Weighing Machine Learning Algorithms for Accounting RWISs Characteristics in METRo : A comparison of Random Forest, Deep Learning & kNN Landmér Pedersen, Jesper January 2019 (has links) The numerical model to forecast road conditions, Model of the Environment and Temperature of Roads (METRo), laid the foundation of solving the energy balance and calculating the temperature evolution of roads. METRo does this by providing a numerical modelling system making use of Road Weather Information Stations (RWIS) and meteorological projections. While METRo accommodates tools for correcting errors at each station, such as regional differences or microclimates, this thesis proposes machine learning as a supplement to the METRo prognostications for accounting station characteristics. Controlled experiments were conducted by comparing four regression algorithms, that is, recurrent and dense neural network, random forest and k-nearest neighbour, to predict the squared deviation of METRo forecasted road surface temperatures. The results presented reveal that the models utilising the random forest algorithm yielded the most reliable predictions of METRo deviations. However, the study also presents the promise of neural networks and the ability and possible advantage of seasonal adjustments that the networks could offer. machine learning neural network random forest k-nearest neighbour Computer Sciences Datavetenskap (datalogi)
222	A comparative study of social bot classification techniques Örnbratt, Filip, Isaksson, Jonathan, Willing, Mario January 2019 (has links) With social media rising in popularity over the recent years, new so called social bots are infiltrating by spamming and manipulating people all over the world. Many different methods have been presented to solve this problem with varying success. This study aims to compare some of these methods, on a dataset of Twitter account metadata, to provide helpful information to companies when deciding how to solve this problem. Two machine learning algorithms and a human survey will be compared on the ability to classify accounts. The algorithms used are the supervised algorithm random forest and the unsupervised algorithm k-means. There will also be an evaluation of two ways to run these algorithms, using the machine learning as a service BigML and the python library Scikit-learn. Additionally, what metadata features are most valuable in the supervised and human survey will be compared. Results show that supervised machine learning is the superior technique for social bot identification with an accuracy of almost 99%. To conclude, it depends on the expertise of the company and if a relevant training dataset is available but in most cases supervised machine learning is recommended. manual bot classification social bot metadata machine learning supervised learning unsupervised learning random forest k-means Computer Sciences Datavetenskap (datalogi)
223	Fraud or Not? Åkerblom, Thea, Thor, Tobias January 2019 (has links) This paper uses statistical learning to examine and compare three different statistical methods with the aim to predict credit card fraud. The methods compared are Logistic Regression, K-Nearest Neighbour and Random Forest. They are applied and estimated on a data set consisting of nearly 300,000 credit card transactions to determine their performance using classification of fraud as the outcome variable. The three models all have different properties and advantages. The K-NN model preformed the best in this paper but has some disadvantages, since it does not explain the data but rather predict the outcome accurately. Random Forest explains the variables but performs less precise. The Logistic Regression model seems to be unfit for this specific data set. Logistic Regression K-Nearest Neighbour classification random forest fraud transactions and statistical learning Probability Theory and Statistics Sannolikhetsteori och statistik
224	Random forest och glesa datarespresentationer / Random forest using sparse data structures Linusson, Henrik, Rudenwall, Robin, Olausson, Andreas January 2012 (has links) In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representing molecules - the problem that motivated this paper derives from collecting substructures of the chemical into what is known as fractional representations. Assembling large sets of molecules represented in this way will result in sparse data, where a large portion of the set is null values. This consumes an excessive amount of computer memory which inhibits the size of data sets that can be used when constructing predictive models.In this study, we suggest a set of criteria for evaluation of random forest implementations to be used for in silico predictive modeling on sparse data sets, with regard to computer memory usage, model construction time and predictive accuracy.A novel random forest system was implemented to meet the suggested criteria, and experiments were made to compare our implementation to existing machine learning algorithms to establish our implementation‟s correctness. Experimental results show that our random forest implementation can create accurate prediction models on sparse datasets, with lower memory usage overhead than implementations using a common matrix representation, and in less time than existing random forest implementations evaluated against. We highlight design choices made to accommodate for sparse data structures and data sets in the random forest ensemble technique, and therein present potential improvements to feature selection in sparse data sets. / Program: Systemarkitekturutbildningen data mining machine learning regression classification in silico modeling random forest sparse data feature selection Engineering and Technology Teknik och teknologier
225	Transfer Learning for Medication Adherence Prediction from Social Forums Self-Reported Data Kyle Haas (5931056) 17 January 2019 (has links) <div> <div> <div> <p>Medication non-adherence and non-compliance left unaddressed can compound into severe medical problems for patients. Identifying patients that are likely to become non-adherent can help reduce these problems. Despite these benefits, monitoring adherence at scale is cost-prohibitive. Social forums offer an easily accessible, affordable, and timely alternative to the traditional methods based on claims data. This study investigates the potential of medication adherence prediction based on social forum data for diabetes and fibromyalgia therapies by using transfer learning from the Medical Expenditure Panel Survey (MEPS). </p><p><br></p> <p>Predictive adherence models are developed by using both survey and social forums data and different random forest (RF) techniques. The first of these implementations uses binned inputs from k-means clustering. The second technique is based on ternary trees instead of the widely used binary decision trees. These techniques are able to handle missing data, a prevalent characteristic of social forums data. </p><p><br></p> <p>The results of this study show that transfer learning between survey models and social forum models is possible. Using MEPS survey data and the techniques listed above to derive RF models, less than 5% difference in accuracy was observed between the MEPS test dataset and the social forum test dataset. Along with these RF techniques, another RF implementation with imputed means for the missing values was developed and shown to predict adherence for social forum patients with an accuracy >70%. </p> </div> </div> <div> <div> <p><br></p> </div> </div> </div> <div> <div> <div> <p>This thesis shows that a model trained with verified survey data can be used to complement traditional medical adherence models by predicting adherence from unverified, self-reported data in a dynamic and timely manner. Furthermore, this model provides a method for discovering objective insights from subjective social reports. Additional investigation is needed to improve the prediction accuracy of the proposed model and to assess biases that may be inherent to self-reported adherence measures in social health networks. </p> </div> </div> </div> Computer Engineering Basic Pharmacology Preventive Medicine Medication Adherence MEPS Social Forum Random Forest Transfer Learning
226	Machine Learning : for Barcode Detection and OCR Fridolfsson, Olle January 2015 (has links) Machine learning can be utilized in many different ways in the field of automatic manufacturing and logistics. In this thesis supervised machine learning have been utilized to train a classifiers for detection and recognition of objects in images. The techniques AdaBoost and Random forest have been examined, both are based on decision trees. The thesis has considered two applications: barcode detection and optical character recognition (OCR). Supervised machine learning methods are highly appropriate in both applications since both barcodes and printed characters generally are rather distinguishable. The first part of this thesis examines the use of machine learning for barcode detection in images, both traditional 1D-barcodes and the more recent Maxi-codes, which is a type of two-dimensional barcode. In this part the focus has been to train classifiers with the technique AdaBoost. The Maxi-code detection is mainly done with Local binary pattern features. For detection of 1D-codes, features are calculated from the structure tensor. The classifiers have been evaluated with around 200 real test images, containing barcodes, and shows promising results. The second part of the thesis involves optical character recognition. The focus in this part has been to train a Random forest classifier by using the technique point pair features. The performance has also been compared with the more proven and widely used Haar-features. Although, the result shows that Haar-features are superior in terms of accuracy. Nevertheless the conclusion is that point pairs can be utilized as features for Random forest in OCR. machine learning Ada Boost Random forest barcode detection OCR Övrig annan teknik
227	Statistical Analysis, Modeling, and Algorithms for Pharmaceutical and Cancer Systems Choi, Bong-Jin 27 May 2014 (has links) The aim of the present study is to develop a statistical algorithm and model associ- ated with breast and lung cancer patients. In this study, we developed several statistical softwares, R packages, and models using our new statistical approach. In the present study, we used the five parameters logistic model for determining the optimal doses of a pharmaceutical drugs, including dynamic initial points, an automatic process for outlier detection and an algorithm that develops a graphic user interface(GUI) program. The developed statistical procedure assists medical scientists by reducing their time in determining the optimal dose of new drugs, and can also easily identify which drugs need more experimentation. Secondly, in the present study, we developed a new classification method that is very useful in the health sciences. We used a new decision tree algorithm and a random forest method to rank our variables and to build a final decision tree model. The decision tree can identify and communicate complex data systems to scientists with minimal knowledge in statistics. Thirdly, we developed statistical packages using the Johnson SB probability distribu- tion which is important in parametrically studying a variety of health, environmental, and engineering problems. Scientists are experiencing difficulties in obtaining estimates for the four parameters of the subject probability distribution. The developed algorithm com- bines several statistical procedures, such as, the Newtwon Raphson, the Bisection, the Least Square Estimation, and the regression method to develop our R package. This R package has functions that generate random numbers, calculate probabilities, inverse probabilities, and estimate the four parameters of the SB Johnson probability distribution. Researchers can use the developed R package to build their own statistical models or perform desirable statistical simulations. The final aspect of the study involves building a statistical model for lung cancer sur- vival time. In developing the subject statistical model, we have taken into consideration the number of cigarettes the patient smoked per day, duration of smoking, and the age at diagnosis of lung cancer. The response variables the survival time. The significant factors include interaction. the probability density function of the survival times has been obtained and the survival function is determined. The analysis is have on your groups the involve gender and with factors. A companies with the ordinary survival function is given. Decision Tree Drug Efficiency Random Forest Survival Analysis Variable Rank Medicinal Chemistry and Pharmaceutics Medicine and Health Sciences Statistics and Probability
228	Evaluating the accuracy of imputed forest biomass estimates at the project level Gagliasso, Donald 01 October 2012 (has links) Various methods have been used to estimate the amount of above ground forest biomass across landscapes and to create biomass maps for specific stands or pixels across ownership or project areas. Without an accurate estimation method, land managers might end up with incorrect biomass estimate maps, which could lead them to make poorer decisions in their future management plans. Previous research has shown that nearest-neighbor imputation methods can accurately estimate forest volume across a landscape by relating variables of interest to ground data, satellite imagery, and light detection and ranging (LiDAR) data. Alternatively, parametric models, such as linear and non-linear regression and geographic weighted regression (GWR), have been used to estimate net primary production and tree diameter. The goal of this study was to compare various imputation methods to predict forest biomass, at a project planning scale (<20,000 acres) on the Malheur National Forest, located in eastern Oregon, USA. In this study I compared the predictive performance of, 1) linear regression, GWR, gradient nearest neighbor (GNN), most similar neighbor (MSN), random forest imputation, and k-nearest neighbor (k-nn) to estimate biomass (tons/acre) and basal area (sq. feet per acre) across 19,000 acres on the Malheur National Forest and 2) MSN and k-nn when imputing forest biomass at spatial scales ranging from 5,000 to 50,000 acres. To test the imputation methods a combination of ground inventory plots, LiDAR data, satellite imagery, and climate data were analyzed, and their root mean square error (RMSE) and bias were calculated. Results indicate that for biomass prediction, the k-nn (k=5) had the lowest RMSE and least amount of bias. The second most accurate method consisted of the k-nn (k=3), followed by the GWR model, and the random forest imputation. The GNN method was the least accurate. For basal area prediction, the GWR model had the lowest RMSE and least amount of bias. The second most accurate method was k-nn (k=5), followed by k-nn (k=3), and the random forest method. The GNN method, again, was the least accurate. The accuracy of MSN, the current imputation method used by the Malheur Nation Forest, and k-nn (k=5), the most accurate imputation method from the second chapter, were then compared over 6 spatial scales: 5,000, 10,000, 20,000, 30,000, 40,000, and 50,000 acres. The root mean square difference (RMSD) and bias were calculated for each of the spatial scale samples to determine which was more accurate. MSN was found to be more accurate at the 5,000, 10,000, 20,000, 30,000, and 40,000 acre scales. K-nn (k=5) was determined to be more accurate at the 50,000 acre scale. / Graduation date: 2013 LiDAR biomass Geographic Weighted Regression Gradient Nearest Neighbor Most Similar Neighbor random forest
229	Robust Image Segmentation applied to Magnetic Resonance and Ultrasound Images of the Prostate Ghose, Soumya 01 October 2012 (has links) (PDF) Prostate segmentation in trans-rectal ultrasound (TRUS) and magnetic resonance images (MRI) facilitates volume estimation, multi-modal image registration, surgical planing and image guided prostate biopsies. The objective of this thesis is to develop shape and region prior deformable models for accurate, robust and computationally efficient prostate segmentation in TRUS and MRI images. Primary contribution of this thesis is in adopting a probabilistic learning approach to achieve soft classification of the prostate for automatic initialization and evolution of a shape and region prior deformable models for prostate segmentation in TRUS images. Two deformable models are developed for the purpose. An explicit shape and region prior deformable model is derived from principal component analysis (PCA) of the contour landmarks obtained from the training images and PCA of the probability distribution inside the prostate region. Moreover, an implicit deformable model is derived from PCA of the signed distance representation of the labeled training data and curve evolution is guided by energy minimization framework of Mumford-Shah (MS) functional. Region based energy is determined from region based statistics of the posterior probabilities. Graph cut energy minimization framework is adopted for prostate segmentation in MRI. Posterior probabilities obtained in a supervised learning schema and from a probabilistic segmentation of the prostate using an at-las are fused in logarithmic domain to reduce segmentation error. Finally a graph cut energy minimization in the stochastic framework achieves prostate segmenta-tion in MRI. Statistically significant improvement in segmentation accuracies are achieved compared to some of the works in literature. Stochastic representation of the prostate region and use of the probabilities in optimization significantly improve segmentation accuracies. Prostate segmentation TRUS MRI Random forest spectral clustering
230	Automatic Red Tide Detection using MODIS Satellite Images Cheng, Wijian 08 June 2009 (has links) Red tides pose a significant economic and environmental threat in the Gulf of Mexico. Detecting red tide is important for understanding this phenomenon. In this thesis, machine learning approaches based on Random Forests, Support Vector Machines and K-Nearest Neighbors have been evaluated for red tide detection from MODIS satellite images. Detection results using machine learning algorithms were compared to ship collected ground truth red tide data. This work has three major contributions. First, machine learning approaches outperformed two of the latest thresholding red tide detection algorithms based on bio-optical characterization by more than 10% in terms of F measure and more than 4% in terms of area under the ROC curve. Machine Learning approaches are effective in more locations on the West Florida Shelf. Second, the thresholds developed in recent thresholding methods were introduced as input attributes to the machine learning approaches and this strategy improved Random Forests and KNearest Neighbors approaches' F-measures. Third, voting the machine learning and thresholding methods could achieve the better performance compared with using machine learning alone, which implied a combination between machine learning models and biocharacterization thresholding methods can be used to obtain effective red tide detection results. karenia brevis West Florida Shelf machine learning random forest support vector machine American Studies Arts and Humanities Computer Sciences

Search results