Global ETD Search

231	Evaluating the accuracy of imputed forest biomass estimates at the project level Gagliasso, Donald 01 October 2012 (has links) Various methods have been used to estimate the amount of above ground forest biomass across landscapes and to create biomass maps for specific stands or pixels across ownership or project areas. Without an accurate estimation method, land managers might end up with incorrect biomass estimate maps, which could lead them to make poorer decisions in their future management plans. Previous research has shown that nearest-neighbor imputation methods can accurately estimate forest volume across a landscape by relating variables of interest to ground data, satellite imagery, and light detection and ranging (LiDAR) data. Alternatively, parametric models, such as linear and non-linear regression and geographic weighted regression (GWR), have been used to estimate net primary production and tree diameter. The goal of this study was to compare various imputation methods to predict forest biomass, at a project planning scale (<20,000 acres) on the Malheur National Forest, located in eastern Oregon, USA. In this study I compared the predictive performance of, 1) linear regression, GWR, gradient nearest neighbor (GNN), most similar neighbor (MSN), random forest imputation, and k-nearest neighbor (k-nn) to estimate biomass (tons/acre) and basal area (sq. feet per acre) across 19,000 acres on the Malheur National Forest and 2) MSN and k-nn when imputing forest biomass at spatial scales ranging from 5,000 to 50,000 acres. To test the imputation methods a combination of ground inventory plots, LiDAR data, satellite imagery, and climate data were analyzed, and their root mean square error (RMSE) and bias were calculated. Results indicate that for biomass prediction, the k-nn (k=5) had the lowest RMSE and least amount of bias. The second most accurate method consisted of the k-nn (k=3), followed by the GWR model, and the random forest imputation. The GNN method was the least accurate. For basal area prediction, the GWR model had the lowest RMSE and least amount of bias. The second most accurate method was k-nn (k=5), followed by k-nn (k=3), and the random forest method. The GNN method, again, was the least accurate. The accuracy of MSN, the current imputation method used by the Malheur Nation Forest, and k-nn (k=5), the most accurate imputation method from the second chapter, were then compared over 6 spatial scales: 5,000, 10,000, 20,000, 30,000, 40,000, and 50,000 acres. The root mean square difference (RMSD) and bias were calculated for each of the spatial scale samples to determine which was more accurate. MSN was found to be more accurate at the 5,000, 10,000, 20,000, 30,000, and 40,000 acre scales. K-nn (k=5) was determined to be more accurate at the 50,000 acre scale. / Graduation date: 2013 LiDAR biomass Geographic Weighted Regression Gradient Nearest Neighbor Most Similar Neighbor random forest
232	Robust Image Segmentation applied to Magnetic Resonance and Ultrasound Images of the Prostate Ghose, Soumya 01 October 2012 (has links) (PDF) Prostate segmentation in trans-rectal ultrasound (TRUS) and magnetic resonance images (MRI) facilitates volume estimation, multi-modal image registration, surgical planing and image guided prostate biopsies. The objective of this thesis is to develop shape and region prior deformable models for accurate, robust and computationally efficient prostate segmentation in TRUS and MRI images. Primary contribution of this thesis is in adopting a probabilistic learning approach to achieve soft classification of the prostate for automatic initialization and evolution of a shape and region prior deformable models for prostate segmentation in TRUS images. Two deformable models are developed for the purpose. An explicit shape and region prior deformable model is derived from principal component analysis (PCA) of the contour landmarks obtained from the training images and PCA of the probability distribution inside the prostate region. Moreover, an implicit deformable model is derived from PCA of the signed distance representation of the labeled training data and curve evolution is guided by energy minimization framework of Mumford-Shah (MS) functional. Region based energy is determined from region based statistics of the posterior probabilities. Graph cut energy minimization framework is adopted for prostate segmentation in MRI. Posterior probabilities obtained in a supervised learning schema and from a probabilistic segmentation of the prostate using an at-las are fused in logarithmic domain to reduce segmentation error. Finally a graph cut energy minimization in the stochastic framework achieves prostate segmenta-tion in MRI. Statistically significant improvement in segmentation accuracies are achieved compared to some of the works in literature. Stochastic representation of the prostate region and use of the probabilities in optimization significantly improve segmentation accuracies. Prostate segmentation TRUS MRI Random forest spectral clustering
233	Automatic Red Tide Detection using MODIS Satellite Images Cheng, Wijian 08 June 2009 (has links) Red tides pose a significant economic and environmental threat in the Gulf of Mexico. Detecting red tide is important for understanding this phenomenon. In this thesis, machine learning approaches based on Random Forests, Support Vector Machines and K-Nearest Neighbors have been evaluated for red tide detection from MODIS satellite images. Detection results using machine learning algorithms were compared to ship collected ground truth red tide data. This work has three major contributions. First, machine learning approaches outperformed two of the latest thresholding red tide detection algorithms based on bio-optical characterization by more than 10% in terms of F measure and more than 4% in terms of area under the ROC curve. Machine Learning approaches are effective in more locations on the West Florida Shelf. Second, the thresholds developed in recent thresholding methods were introduced as input attributes to the machine learning approaches and this strategy improved Random Forests and KNearest Neighbors approaches' F-measures. Third, voting the machine learning and thresholding methods could achieve the better performance compared with using machine learning alone, which implied a combination between machine learning models and biocharacterization thresholding methods can be used to obtain effective red tide detection results. karenia brevis West Florida Shelf machine learning random forest support vector machine American Studies Arts and Humanities Computer Sciences
234	System Complexity Reduction via Feature Selection January 2011 (has links) abstract: This dissertation transforms a set of system complexity reduction problems to feature selection problems. Three systems are considered: classification based on association rules, network structure learning, and time series classification. Furthermore, two variable importance measures are proposed to reduce the feature selection bias in tree models. Associative classifiers can achieve high accuracy, but the combination of many rules is difficult to interpret. Rule condition subset selection (RCSS) methods for associative classification are considered. RCSS aims to prune the rule conditions into a subset via feature selection. The subset then can be summarized into rule-based classifiers. Experiments show that classifiers after RCSS can substantially improve the classification interpretability without loss of accuracy. An ensemble feature selection method is proposed to learn Markov blankets for either discrete or continuous networks (without linear, Gaussian assumptions). The method is compared to a Bayesian local structure learning algorithm and to alternative feature selection methods in the causal structure learning problem. Feature selection is also used to enhance the interpretability of time series classification. Existing time series classification algorithms (such as nearest-neighbor with dynamic time warping measures) are accurate but difficult to interpret. This research leverages the time-ordering of the data to extract features, and generates an effective and efficient classifier referred to as a time series forest (TSF). The computational complexity of TSF is only linear in the length of time series, and interpretable features can be extracted. These features can be further reduced, and summarized for even better interpretability. Lastly, two variable importance measures are proposed to reduce the feature selection bias in tree-based ensemble models. It is well known that bias can occur when predictor attributes have different numbers of values. Two methods are proposed to solve the bias problem. One uses an out-of-bag sampling method called OOBForest, and the other, based on the new concept of a partial permutation test, is called a pForest. Experimental results show the existing methods are not always reliable for multi-valued predictors, while the proposed methods have advantages. / Dissertation/Thesis / Ph.D. Industrial Engineering 2011 Industrial Engineering Artificial Intelligence Information Technology associative classification attribute importance feature selection random forest time series classification
235	Prediction of mammalian essential genes based on sequence and functional features Kabir, Mitra January 2017 (has links) Essential genes are those whose presence is imperative for an organism's survival, whereas the functions of non-essential genes may be useful but not critical. Abnormal functionality of essential genes may lead to defects or death at an early stage of life. Knowledge of essential genes is therefore key to understanding development, maintenance of major cellular processes and tissue-specific functions that are crucial for life. Existing experimental techniques for identifying essential genes are accurate, but most of them are time consuming and expensive. Predicting essential genes using computational methods, therefore, would be of great value as they circumvent experimental constraints. Our research is based on the hypothesis that mammalian essential (lethal) and non-essential (viable) genes are distinguishable by various properties. We examined a wide range of features of Mus musculus genes, including sequence, protein-protein interactions, gene expression and function, and found 75 features that were statistically discriminative between lethal and viable genes. These features were used as inputs to create a novel machine learning classifier, allowing the prediction of a mouse gene as lethal or viable with the cross-validation and blind test accuracies of ∼91% and ∼93%, respectively. The prediction results are promising, indicating that our classifier is an effective mammalian essential gene prediction method. We further developed the mouse gene essentiality study by analysing the association between essentiality and gene duplication. Mouse genes were labelled as singletons or duplicates, and their expression patterns over 13 developmental stages were examined. We found that lethal genes originating from duplicates are considerably lower in proportion than singletons. At all developmental stages a significantly higher proportion of singletons and lethal genes are expressed than duplicates and viable genes. Lethal genes were also found to be more ancient than viable genes. In addition, we observed that duplicate pairs with similar patterns of developmental co-expression are more likely to be viable; lethal gene duplicate pairs do not have such a trend. Overall, these results suggest that duplicate genes in mouse are less likely to be essential than singletons. Finally, we investigated the evolutionary age of mouse genes across development to see if the morphological hourglass pattern exists in the mouse. We found that in mouse embryos, genes expressed in early and late stages are evolutionarily younger than those expressed in mid-embryogenesis, thus yielding an hourglass pattern. However, the oldest genes are not expressed at the phylotypic stage stated in prior studies, but instead at an earlier time point - the egg cylinder stage. These results question the application of the hourglass model to mouse development. 572
236	Predicting Demographic and Financial Attributes in a Bank Marketing Dataset January 2016 (has links) abstract: Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by bank representatives with offers. These telemarketing strategies can be improved in combination with data mining techniques that allow predictability of customer information and interests. In this thesis, bank telemarketing data from a Portuguese banking institution were analyzed to determine predictability of several client demographic and financial attributes and find most contributing factors in each. Data were preprocessed to ensure quality, and then data mining models were generated for the attributes with logistic regression, support vector machine (SVM) and random forest using Orange as the data mining tool. Results were analyzed using precision, recall and F1 score. / Dissertation/Thesis / Masters Thesis Computer Science 2016 Computer science Mathematics Industrial engineering classification data mining logistic regression random forest sensitivity analysis support vector machine
237	Learning from Asymmetric Models and Matched Pairs January 2013 (has links) abstract: With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables. / Dissertation/Thesis / Ph.D. Industrial Engineering 2013 Statistics Computer science Information science asymmetry feature selection matched data random forest stratified data support vector machine
238	Machine learning in logistics : Increasing the performance of machine learning algorithms on two specific logistic problems / Maskininlärning i logistik : Öka prestandan av maskininlärningsalgoritmer på två specifika logistikproblem. Lind Nilsson, Rasmus January 2017 (has links) Data Ductus, a multination IT-consulting company, wants to develop an AI that monitors a logistic system and looks for errors. Once trained enough, this AI will suggest a correction and automatically right issues if they arise. This project presents how one works with machine learning problems and provides a deeper insight into how cross-validation and regularisation, among other techniques, are used to improve the performance of machine learning algorithms on the defined problem. Three techniques are tested and evaluated in our logistic system on three different machine learning algorithms, namely Naïve Bayes, Logistic Regression and Random Forest. The evaluation of the algorithms leads us to conclude that Random Forest, using cross-validated parameters, gives the best performance on our specific problems, with the other two falling behind in each tested category. It became clear to us that cross-validation is a simple, yet powerful tool for increasing the performance of machine learning algorithms. / Data Ductus, ett multinationellt IT-konsultföretag vill utveckla en AI som övervakar ett logistiksystem och uppmärksammar fel. När denna AI är tillräckligt upplärd ska den föreslå korrigering eller automatiskt korrigera problem som uppstår. Detta projekt presenterar hur man arbetar med maskininlärningsproblem och ger en djupare inblick i hur kors-validering och regularisering, bland andra tekniker, används för att förbättra prestandan av maskininlärningsalgoritmer på det definierade problemet. Dessa tekniker testas och utvärderas i vårt logistiksystem på tre olika maskininlärnings algoritmer, nämligen Naïve Bayes, Logistic Regression och Random Forest. Utvärderingen av algoritmerna leder oss till att slutsatsen är att Random Forest, som använder korsvaliderade parametrar, ger bästa prestanda på våra specifika problem, medan de andra två faller bakom i varje testad kategori. Det blev klart för oss att kors-validering är ett enkelt, men kraftfullt verktyg för att öka prestanda hos maskininlärningsalgoritmer. Machine learning confusion matrix performance random forest naïve bayes logistic regression cross-validation regularisation Computer Sciences Datavetenskap (datalogi)
239	以詞性組合為基礎之中文語言特徵研究 / A Study of Part-of-Speech Pair-based Language Features in Chinese Texts 江易倫, Jiang, Yi Lun Unknown Date (has links) 在作者歸屬的研究中，語言特徵的選擇一直是很重要的一環，因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異，像是高頻詞彙、n-grams、及標點符號等，但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題，本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證，並以雷震這位作者的文本為基準，探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型，使用OOB錯誤率評估分類模型分類表現，並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中，找出不同作者以及不同類型間語言特徵的獨特性詞組並做解釋。 / In the study of authorship attribution, the choice of language features have always been a very important part because it reflects the performance of the whole prediction. Most of the commonly used language features are excellent in classification, such as word frequencies, n-grams, and punctuation, but the phrases within these language features can not explain the causal relationship between categories and the differences between them. In order to solve this problem, this paper proposes 3 kinds of linguistic meaning as a auxiliary verification, and based on the Lei-Chen 's text, discussed "different authors with same topics" and "different genres with same author" is applied on the two research directions. In this paper, we will use the random forest algorithm to establish the classification model, use the OOB error rate assessment classification model classification performance, and through the important feature values to find the weight of each phrase as a decision point. Finally, we hope to find out unique phrases of different authors and different genres of language features from the classification rules and explain them. 作者歸屬語言特徵隨機森林 Authorship attribution Language features Random forest
240	Retrieval of Cloud Top Pressure Adok, Claudia January 2016 (has links) In this thesis the predictive models the multilayer perceptron and random forest are evaluated to predict cloud top pressure. The dataset used in this thesis contains brightness temperatures, reflectances and other useful variables to determine the cloud top pressure from the Advanced Very High Resolution Radiometer (AVHRR) instrument on the two satellites NOAA-17 and NOAA-18 during the time period 2006-2009. The dataset also contains numerical weather prediction (NWP) variables calculated using mathematical models. In the dataset there are also observed cloud top pressure and cloud top height estimates from the more accurate instrument on the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite. The predicted cloud top pressure is converted into an interpolated cloud top height. The predicted pressure and interpolated height are then evaluated against the more accurate and observed cloud top pressure and cloud top height from the instrument on the satellite CALIPSO. The predictive models have been performed on the data using different sampling strategies to take into account the performance of individual cloud classes prevalent in the data. The multilayer perceptron is performed using both the original response cloud top pressure and a log transformed repsonse to avoid negative values as output which is prevalent when using the original response. Results show that overall the random forest model performs better than the multilayer perceptron in terms of root mean squared error and mean absolute error. neural networks multilayer perceptron random forest regression cloud top pressure cloud top height Computer and Information Sciences Data- och informationsvetenskap

Search results