Global ETD Search

161	Dense Neural Network Outperforms Other Machine Learning Models for Scaling-up Lichen Cover Maps in Eastern Canada Richardson, Galen 11 May 2023 (has links) Lichen mapping is vital for caribou management plans and sustainable land conservation. Previous studies have used Random Forest, dense neural network, and convolutional neural network (CNN) models for mapping lichen coverage with remote sensing data. However, to date, it is not clear how these models rank in the performance of this task. In this study, these machine learning models were evaluated on their ability to predict lichen percent coverage in Sentinel-2 imagery covering Québec and Labrador, NL. The models were trained on 10-m resolution lichen coverage (%) maps created from 20 drone surveys collected in July 2019 and 2022. The maps were divided into quadrant blocks and then split into train, validation, and test datasets. The quadrant-blocking approach exposed the models to a variety of different landscapes and reduced spatial autocorrelation between the training sites. All three models performed similarly when evaluated on the test set. However, the dense neural network achieved a higher accuracy than the other two, with a reported Mean Absolute Error (MAE) of 5.2% and an R2 of 0.76. By comparison, the Random Forest model returned an MAE of 5.5% (R2: 0.74) and the CNN had an MAE of 5.3% (R2: 0.74). The models were also evaluated on their ability to predict lichen coverage (%) for larger quadrant blocks consisting of, on average, 400 Sentinel-2 pixels. The Random Forest and dense neural network had an R2 of 0.93, while the CNN had an R2 of 0.90. The MAE in this assessment for the dense neural network, Random Forest, and CNN were 2.1%, 2.3%, and 3.1% respectively. A regional lichen map was created using the dense neural network and a Sentinel-2 image mosaic. Model predictions have larger errors for land covers that the model was not exposed to in training, such as mines and deep lakes. While the dense neural network requires more computational effort to train than a Random Forest model, the 5.9% performance gain in the test pixel comparison and 9.1% performance gain in the quadrant block comparison renders it the most suitable for lichen mapping. This study represents progress toward determining the appropriate methodology for generating accurate lichen maps from satellite imagery for caribou conservation and sustainable land management. Neural Networks Artificial Intelligence Random Forest Remote Sensing Caribou Lichen Earth Observation
162	Statistical Tools for Efficient Confirmation of Diagnosis in Patients with Suspected Primary Central Nervous System Vasculitis Brooks, John 27 April 2023 (has links) The management of missing data is a major concern in classification model generation in all fields but poses a particular challenge in situations where there is only a small quantity of sparse data available. In the field of medicine, this is not an uncommon problem. While widely subscribed methodologies like logistic regression can, with minor modifications and potentially much labor, provide reasonable insights from the larger and less sparse datasets that are anticipated when analyzing diagnosis of common conditions, there are a multitude of rare conditions of interest. Primary angiitis of the central nervous system (PACNS) is a rare but devastating entity that given its range of presenting symptoms can be suspected in a variety of circumstances. It unfortunately continues to be a diagnosis that is hard to make. Aside from some general frameworks, there isn’t a rigorously defined diagnostic approach as is the case in other more common neuroinflammatory conditions like multiple sclerosis. Instead, clinicians currently rely on experience and clinical judgement to guide the reasonable exclusion of potential inciting entities and mimickers. In effect this results in a smaller quantity of heterogenous that may not optimally suited for more traditional classification methodology (e.g., logistic regression) without substantial contemplation and justification of appropriate data cleaning / preprocessing. It is therefore challenging to make and analyze systematic approaches that could direct clinicians in a way that standardizes patient care. In this thesis, a machine learning approach was presented to derive quantitatively justified insights into the factors that are most important to consider during the diagnostic process to identify conditions like PACNS. Modern categorization techniques (i.e., random forest and support vector machines) were used to generate diagnostic models identifying cases of PACNS from which key elements of diagnostic importance could be identified. A novel variant of a random forest (RF) approach was also demonstrated as a means of managing missing data in a small sample, a significant problem encountered when exploring data on rare conditions without clear diagnostic frameworks. A reduced need to hypothesize the reasons for missingness when generating and applying the novel variant was discussed. The application of such tools to diagnostic model generation of PACNS and other rare and / or emerging diseases and provide objective feedback was explored. This primarily centered around a structured assessment on how to prioritize testing to rapidly rule out conditions that require alternative management and could be used to support future guidelines to optimize the care of these patients. The material presented herein had three components. The first centered around the example of PACNS. It described, in detail, an example of a relevant medical condition and explores why the data is both rare and sparse. Furthermore, the reasons for the sparsity are heterogeneous or non-monotonic (i.e., not conducive to modelling with a singular model). This component concludes with a search for candidate variables to diagnose the condition by means of scoping review for subsequent comparative demonstration of the novel variant of random forest construction that was proposed. The second component discussed machine learning model development and simulates data with varying degrees and patterns of missingness to demonstrate how the models could be applied to data with properties like what would be expected of PACNS related data. Finally, described techniques were applied to separate a subset of patients with suspected PACNS from those with diagnosed PACNS using institutional data and proposes future study to expand upon and ultimately verify these insights. Further development of the novel random forest approach is also discussed. Random Forest Machine Learning Missingness Feature Importance Gini Importance
163	Methods for network intrusion detection : Evaluating rule-based methods and machine learning models on the CIC-IDS2017 dataset Lindstedt, Henrik January 2022 (has links) Network intrusion detection is a task aimed to identify malicious network traffic. Malicious networktraffic is generated when a perpetrator attacks a network or internet-connected device with the intent todisrupt, steal or destroy a service or information. Two approaches for this particular task is the rule-basedmethod and the use of machine learning. The purpose of this paper was to contribute with knowledgeon how to evaluate and build better network intrusion detection systems (NIDS). That was fulfilled bycomparing the detection ability of two machine learning models, a neural network and a random forestmodel, with a rule-based NIDS called Snort. The paper describes how the two models and Snort wereconstructed and how performance metrics were generated on a dataset called CIC-IDS2017. It also describes how we capture our own malicious network traffic and the models ability to classify that data. Thecomparisons shows that the neural network outperforms Snort and the Random forest. We also presentfour factors that may influence which method that should be used for intrusion detection. In addition weconclude that we see potential in using CIC-IDS2017 to build NIDS based on machine learning. MLP random forest CIC-IDS2017 Snort Intrusion Detection System Information Systems
164	Using Transcriptomic Data to Predict Biomarkers for Subtyping of Lung Cancer Daran, Rukesh January 2021 (has links) Lung cancer is one the most dangerous types of all cancer. Several studies have explored the use of machine learning methods to predict and diagnose this cancer. This study explored the potential of decision tree (DT) and random forest (RF) classification models, in the context of a small transcriptome dataset for outcome prediction of different subtypes on lung cancer. In the study we compared the three subtypes; adenocarcinomas (AC), small cell lung cancer (SCLC) and squamous cell carcinomas (SCC) with normal lung tissue by applying the two machine learning methods from caret R package. The DT and RF model and their validation showed different results for each subtype of the lung cancer data. The DT found more features and validated them with better metrics. Analysis of the biological relevance was focused on the identified features for each of the subtypes AC, SCLC and SCC. The DT presented a detailed insight into the biological data which was essential by classifying it as a biomarker. The identified features from this research may serve as potential candidate genes which could be explored further to confirm their role in corresponding lung cancer types and contribute to targeted diagnostics of different subtypes. lung cancer decision tree random forest accuracy cross-validation machine learning Bioinformatics and Systems Biology Bioinformatik och systembiologi
165	Detecting fraudulent users using behaviour analysis / Detektera artificiella användare med hjälp av beteendeanalys Jóhannsson, Jökull January 2017 (has links) With the increased global use of online media platforms, there are more opportunities than ever to misuse those platforms or perpetrate fraud. One such fraud is within the music industry, where perpetrators create automated programs, streaming songs to generate revenue or increase popularity of an artist. With growing annual revenue of the digital music industry, there are significant financial incentives for perpetrators with fraud in mind. The focus of the study is extracting user behavioral patterns and utilising them to train and compare multiple supervised classification method to detect fraud. The machine learning algorithms examined are Logistic Regression, Support Vector Machines, Random Forest and Artificial Neural Networks. The study compares performance of these algorithms trained on imbalanced datasets carrying different fractions of fraud. The trained models are evaluated using the Precision Recall Area Under the Curve (PR AUC) and a F1-score. Results show that the algorithms achieve similar performance when trained on balanced and imbalanced datasets. It also shows that Random Forest outperforms the other methods for all datasets tested in this experiment. / Med den ökande användningen av strömmande media ökar också möjligheterna till missbruk av dessa platformar samt bedrägeri. Ett typiskt fall av bedrägeri är att använda automatiserade program för att strömma media, och därigenom generera intäkter samt att öka en artist popularitet. Med den växande ekonomin kring strömmande media växer också incitamentet till bedrägeriförsök. Denna studies fokus är att finna användarmönster och använda denna kunskap för att träna modeller som kan upptäcka bedrägeriförsök. The maskininlärningsalgoritmer som undersökts är Logistic Regression, Support Vector Machines, Random Forest och Artificiella Neurala Nätverk. Denna studie jämför effektiviteten och precisionen av dessa algoritmer, som tränats på obalanserad data som innehåller olika procentandelar av bedrägeriförsök. Modellerna som genererats av de olika algoritmerna har sedan utvärderas med hjälp av Precision Recall Area Under the Curve (PR AUC) och F1-score. Resultaten av studien visar på liknande prestanda mellan modellerna som genererats av de utvärderade algoritmerna. Detta gäller både när de tränats på balanserad såväl som obalanserad data. Resultaten visar också att Random Forestbaserade modeller genererar bättre resultat för alla dataset som testats i detta experiment. fraud machine learning random forest neural network fraud detection music Computer Sciences Datavetenskap (datalogi)
166	Predicting Attrition in Financial Data with Machine Learning Algorithms / Förutsäga kundförluster i finansdata med maskininlärningstekniker Darnald, Johan January 2018 (has links) For most businesses there are costs involved when acquiring new customers and having longer relationships with customers is therefore often more profitable. Predicting if an individual is prone to leave the business is then a useful tool to help any company take actions to mitigate this cost. The event when a person ends their relationship with a business is called attrition or churn. Predicting peoples actions is however hard and many different factors can affect their choices. This paper investigates different machine learning methods for predicting attrition in the customer base of a bank. Four different methods are chosen based on the results they have shown in previous research and these are then tested and compared to find which works best for predicting these events. Four different datasets from two different products and with two different applications are created from real world data from a European bank. All methods are trained and tested on each dataset. The results of the tests are then evaluated and compared to find what works best. The methods found in previous research to most reliably achieve good results in predicting churn in banking customers are the Support Vector Machine, Neural Network, Balanced Random Forest, and the Weighted Random Forest. The results show that the Balanced Random Forest achieves the best results with an average AUC of 0.698 and an average F-score of 0.376. The accuracy and precision of the model are concluded to not be enough to make definite decisions but can be used with other factors such as profitability estimations to improve the effectiveness of any actions taken to prevent the negative effects of churn. / För de flesta företag finns det en kostnad involverad i att skaffa nya kunder. Längre relationer med kunder är därför ofta mer lönsamma. Att kunna förutsäga om en kund är nära att lämna företaget är därför ett användbart verktyg för att kunna utföra åtgärder för att minska denna kostnad. Händelsen när en kund avslutar sin relation med ett företag kallas här efter kundförlust. Att förutsäga människors handlingar är däremot svårt och många olika faktorer kan påverka deras val. Denna avhandling undersöker olika maskininlärningsmetoder för att förutsäga kundförluster hos en bank. Fyra metoder väljs baserat på tidigare forskning och dessa testas och jämförs sedan för att hitta vilken som fungerar bäst för att förutsäga dessa händelser. Fyra dataset från två olika produkter och med två olika användningsområden skapas från verklig data ifrån en Europeisk bank. Alla metoder tränas och testas på varje dataset. Resultaten från dessa test utvärderas och jämförs sedan för att få reda på vilken metod som fungerar bäst. Metoderna som enligt tidigare forskning ger de mest pålitliga och bästa resultaten för att förutsäga kundförluster hos banker är stödvektormaskin, neurala nätverk, balanserad slumpmässig skog och vägd slumpmässig skog. Resultatet av testerna visar att en balanserad slumpmässig skog får bäst resultat med en genomsnittlig AUC på 0.698 och ett F-värde på 0.376. Träffsäkerheten och det positiva prediktiva värdet på metoden är inte tillräckligt för att ta definitiva handlingar med men kan användas med andra faktorer så som lönsamhetsuträkningar för att förbättra effektiviteten av handlingar som tas för att minska de negativa effekterna av kundförluster. Machine learning Random Forest Support Vector Machine Neural Network Computer Sciences Datavetenskap (datalogi)
167	A Machine Learning Ensemble Approach to Churn Prediction : Developing and Comparing Local Explanation Models on Top of a Black-Box Classifier / Maskininlärningsensembler som verktyg för prediktering av utträde : En studie i att beräkna och jämföra lokala förklaringsmodeller ovanpå svårförståeliga klassificerare Olofsson, Nina January 2017 (has links) Churn prediction methods are widely used in Customer Relationship Management and have proven to be valuable for retaining customers. To obtain a high predictive performance, recent studies rely on increasingly complex machine learning methods, such as ensemble or hybrid models. However, the more complex a model is, the more difficult it becomes to understand how decisions are actually made. Previous studies on machine learning interpretability have used a global perspective for understanding black-box models. This study explores the use of local explanation models for explaining the individual predictions of a Random Forest ensemble model. The churn prediction was studied on the users of Tink – a finance app. This thesis aims to take local explanations one step further by making comparisons between churn indicators of different user groups. Three sets of groups were created based on differences in three user features. The importance scores of all globally found churn indicators were then computed for each group with the help of local explanation models. The results showed that the groups did not have any significant differences regarding the globally most important churn indicators. Instead, differences were found for globally less important churn indicators, concerning the type of information that users stored in the app. In addition to comparing churn indicators between user groups, the result of this study was a well-performing Random Forest ensemble model with the ability of explaining the reason behind churn predictions for individual users. The model proved to be significantly better than a number of simpler models, with an average AUC of 0.93. / Metoder för att prediktera utträde är vanliga inom Customer Relationship Management och har visat sig vara värdefulla när det kommer till att behålla kunder. För att kunna prediktera utträde med så hög säkerhet som möjligt har den senasteforskningen fokuserat på alltmer komplexa maskininlärningsmodeller, såsom ensembler och hybridmodeller. En konsekvens av att ha alltmer komplexa modellerär dock att det blir svårare och svårare att förstå hur en viss modell har kommitfram till ett visst beslut. Tidigare studier inom maskininlärningsinterpretering har haft ett globalt perspektiv för att förklara svårförståeliga modeller. Denna studieutforskar lokala förklaringsmodeller för att förklara individuella beslut av en ensemblemodell känd som 'Random Forest'. Prediktionen av utträde studeras påanvändarna av Tink – en finansapp. Syftet med denna studie är att ta lokala förklaringsmodeller ett steg längre genomatt göra jämförelser av indikatorer för utträde mellan olika användargrupper. Totalt undersöktes tre par av grupper som påvisade skillnader i tre olika variabler. Sedan användes lokala förklaringsmodeller till att beräkna hur viktiga alla globaltfunna indikatorer för utträde var för respektive grupp. Resultaten visade att detinte fanns några signifikanta skillnader mellan grupperna gällande huvudindikatorerna för utträde. Istället visade resultaten skillnader i mindre viktiga indikatorer som hade att göra med den typ av information som lagras av användarna i appen. Förutom att undersöka skillnader i indikatorer för utträde resulterade dennastudie i en välfungerande modell för att prediktera utträde med förmågan attförklara individuella beslut. Random Forest-modellen visade sig vara signifikantbättre än ett antal enklare modeller, med ett AUC-värde på 0.93. Machine learning Ensemble Random forest Churn prediction LIME Interpretability CRM Local explanations Computer Sciences Datavetenskap (datalogi)
168	Uncertainty Analysis : Severe Accident Scenario at a Nordic Nuclear Power Plant Hedly, Josefin, De Young, Mikaela January 2023 (has links) Nuclear Power Plants (NPP) undergo fault and sensitivity analysis with scenario modelling to predict catastrophic events, specifically releases of Cesium 137 (Cs-137). The purpose of this thesis is to find which of 108 input-features from Modular Accident Analysis Program (MAAP)simulation code are important, when there is large release of Cs-137 emissions. The features are tested all together and in their groupings. To find important features, the Machine learning (ML) model Random Forest (RF) has a built-in attribute which identifies important features. The results of RF model classification are corroborated with Support Vector Machines (SVM), K-Nearest Neighbor (KNN) and use k-folds cross validation to improve and validate the results, resulting in a near 90% accuracy for the three ML models. RF is successful at identifying important features related to Cs-137 emissions, by using the classification model to first identify top features, to further train the models at identifying important input-features. The discovered input-features are important both within their individual groups, but also when including all features simultaneously. The large number of features included did not disrupt RF much, but the skewed dataset with few classified extreme events caused the accuracy to be lower at near 90%. Nuclear power plant microdata analysis Random Forest k-Nearest Neighbor SVM Computer Sciences Datavetenskap (datalogi)
169	Classification of imbalanced disparate medical data using ontology / Klassificering av Obalanserad Medicinsk Data med Ontologier Karlsson, Ludvig, Wilhelm Kopp Sundin, Gustav January 2023 (has links) Through the digitization of healthcare, large volumes of data are generated and stored in healthcare operations. Today, a multitude of platforms and digital infrastructures are used for storage and management of data. The systems lack a common ontology which limits the interoperability between datasets. Limited interoperability impacts various areas of healthcare, for instance sharing of data between entities and the possibilities for aggregated machine learning research incorporating distributed data. This study examines how a random forest classifier performs on two datasets consisting of phase III clinical trial studies on small-cell lung cancer where the datasets do not share a common ontology. The performance is then compared to the same classifier’s performance on one dataset consisting of a connection of the two earlier mentioned sets where a common ontology is implemented. The study does not show unambiguous results indicating that a single ontology is creating a better performance for the random forest classifier. In addition, the conditions of entities within primary care in Sweden for undergoing a transition to a new platform for storage of data is discussed together with areas for future research. / Till följd av digitaliseringen inom hälso- och sjukvården genereras stora volymer data som lagras och används i verksamheten. Idag används en mängd olika plattformar för lagring och hantering av data. Systemen saknar en gemensam ontologi, vilket begränsar interoperabiliteten mellan datamängderna. Bristande interoperabilitet påverkar olika områden inom hälso- och sjukvården, till exempel delning av data mellan vårdinstanser och möjligheterna för forskning på en aggregerad nivå där maskininlärning används. Denna studie undersöker hur en random forest klassificerare presterar på två dataset bestående av fas III kliniska prövningar av småcellig lungcancer där dataseten inte delar en gemensam ontologi. Prestandan jämförs sedan med samma klassificerares prestanda på ett dataset som består av en anslutning mellan de två tidigare nämnda dataseten där en gemensam ontologi har implementerats. Studien visar inte entydiga resultat som indikerar att en gemensam eller icke-gemensam ontologi skapar bättre prestanda för en random forest klassificerare. Vidare diskuteras förutsättningarna och krav på förändringsprocessen för en övergång till Centrum för Datadriven Hälsas föreslagna plattform utifrån en klinik inom primärvårdens perspektiv. Ontology machine learning random forest imbalanced data oncology digital transformation Computer and Information Sciences Data- och informationsvetenskap
170	Tree-Based Methods and a Mixed Ridge Estimator for Analyzing Longitudinal Data With Correlated Predictors Eliot, Melissa Nicole 01 September 2011 (has links) Due to recent advances in technology that facilitate acquisition of multi-parameter defined phenotypes, new opportunities have arisen for predicting patient outcomes based on individual specific cell subset changes. The data resulting from these trials can be a challenge to analyze, as predictors may be highly correlated with each other or related to outcome within levels of other predictor variables. As a result, applying traditional methods like simple linear models and univariate approaches such as odds ratios may be insufficient. In this dissertation, we describe potential solutions including tree-based methods, ridge regression, mixed modeling, and a new estimator called a mixed ridge estimator with expectation-maximization (EM) algorithm. Data examples are provided. In particular, flow cytometry is a method of measuring a large number of particle counts at once by suspending them in a fluid and shining a beam of light onto the fluid. This is specifically relevant in the context of studying human immunodeficiency virus (HIV), where there exists a great potential to draw from the rich array of data on host cell-mediated response to infection and drug exposures, to inform and discover patient level determinants of disease progression and/or response to anti-retroviral therapy (ART). The data sets collected are often high dimensional with correlated columns, which can be challenging to analyze. We demonstrate the application and comparative interpretations of three tree-based algorithms for the analysis of data arising from flow cytometry in the first chapter of this manuscript. Specifically, we consider the question of what best predicts CD4 T-cell recovery in HIV-1 infected persons starting antiretroviral therapy with CD4 count between 200-350 cell/μl. The tree-based approaches, namely, classification and regression trees (CART), random forests (RF) and logic regression (LR), were designed specifically to uncover complex structure in high dimensional data settings. While contingency table analysis and RFs provide information on the importance of each potential predictor variable, CART and LR offer additional insight into the combinations of variables that together are predictive of the outcome. Specifically, application of tree-based methods to our data suggest that a combination of baseline immune activation states, with emphasis on CD8 T cell activation, may be a better predictor than any single T cell/innate cell subset analyzed. In the following chapter, tree-based methods are compared to each other via a simulation study. Each has its merits in particular circumstances; for example, RF is able to identify the order of importance of predictors regardless of whether there is a tree-like structure. It is able to adjust for correlation among predictors by using a machine learning algorithm, analyzing subsets of predictors and subjects over a number of iterations. CART is useful when variables are predictive of outcome within levels of other variables, and is able to find the most parsimonious model using pruning. LR also identifies structure within the set of predictor variables, and nicely illustrates relationship among variables. However, due to the vast number of combinations of predictor variables that would need to be analyzed in order to find the single best LR tree, an algorithm is used that only searches a subset of potential combinations of predictors. Therefore, results may be different each time the algorithm is used on the same data set. Next we use a regression approach to analyzing data with correlated predictors. Ridge regression is a method of accounting for correlated data by adding a shrinkage component to the estimators for a linear model. We perform a simulation study to compare ridge regression to linear regression over various correlation coefficients and find that ridge regression outperforms linear regression as correlation increases. To account for collinearity among the predictors along with longitudinal data, a new estimator that combines the applicability of ridge regression and mixed models using an EM algorithm is developed and compared to the mixed model. We find from a simulation study comparing our mixed ridge (MR) approach with a traditional mixed model that our new mixed ridge estimator is able to handle collinearity of predictor variables better than the mixed model, while accounting for random within-subject effects that regular ridge regression does not take into account. As correlation among predictors increases, power decreases more quickly for the mixed model than MR. Additionally, type I error rate is not significantly elevated when the MR approach is taken. The MR estimator gives us new insight into flow cytometry data and other data sets with correlated predictor variables that our tree-based methods could not give us. These methods all provide unique insight into our data that more traditional methods of analysis do not offer. CART Flow cytometry Logic Regression Mixed model Random Forest Ridge regression Biostatistics

Search results