Global ETD Search

161	Intertidal resource cultivation over millennia structures coastal biodiversity Cox, Kieran D. 22 December 2021 (has links) Cultivation of marine ecosystems began in the early Holocene and has contributed vital resources to humans over millennia. Several more recent cultivation practices, however, erode biodiversity. Emerging lines of evidence indicate that certain resource management practices may promote favourable ecological conditions. Here, I use the co-occurrence of 24 First Nations clam gardens, shellfish aquaculture farms, and unmodified clam beaches to test several hypotheses concerning the ecological implications of managing intertidal bivalve populations. To so do, in 2015 and 2016, I surveyed epifaunal (surface) and bivalve communities and quantified each intertidal sites’ abiotic conditions, including sediment characteristics and substrate composition. In 2017, I generated three-dimensional models of each site using structure-from-motion photogrammetry and measured several aspects of habitat complexity. Statistical analyses use a combination of non-parametric multivariate statistics, multivariate regression trees, and random forests to quantify the extent to which the intertidal resource cultivation structures nearshore biodiversity Chapter 1 outlines a brief history of humanity's use of marine resources, the transition from extracting to cultivating aquatic taxa, and the emergences of the northeast Pacific’s most prevalent shellfish cultivation practices: clam gardens and shellfish farms. Chapter 2 evaluates the ability of epifaunal community assessment methods to capture species diversity by conducting a paired field experiment using four assessment methods: photo-quadrat, point-intercept, random subsampling, and full-quadrat assessments. Conducting each method concurrently within multiple intertidal sites allowed me to quantify the implications of varying sampling areas, subsampling, and photo surveys on detecting species diversity, abundance, and sample- and coverage-based biodiversity metrics. Species richness, density, and sample-based rarefaction varied between methods, despite assessments occurring at the same locations, with photo-quadrats detecting the lowest estimates and full-quadrat assessments the highest. Abundance estimates were consistent among methods, supporting the use of extrapolation. Coverage-based rarefaction and extrapolation curves confirmed that these dissimilarities were due to differences between the methods, not the sample completeness. The top-performing method, random subsampling, was used to conduct Chapter 4’s surveys. Chapter 3 examines the connection between shellfish biomass and the ecological conditions clam garden and shellfish farms foster. First, I established the methodological implications of varying sediment volume on the detection of bivalve diversity, abundance, shell length, and sample- and coverage-based biodiversity metrics. Similar to Chapter 2, this examination identified the most suitable method, which I used during the 2015 and 2016 bivalve surveys. The analyses quantified several interactions between each sites’ abiotic conditions and biological communities including, the influence of substrate composition, sediment characteristics, and physical complexity on bivalve communities, and if bivalve richness and habitat complexity facilitates increases in bivalve biomass. Chapter 4 quantifies the extent to which managing intertidal bivalves enhance habitat complexity, fostering increased diversity in the epifaunal communities. This chapter combines 2015, 2016, and 2017 surveys of the sites' epifaunal communities and habitat complexity metrics, including fractal dimension at four-resolutions and linear rugosity. Clam gardens enhance fine- and broad-scale complexity, while shellfish farms primarily increase fine-scale complexity, allowing for insights into parallel and divergent community responses. Chapter 5 presents an overview of shellfish as a marine subsidy to coastal terrestrial ecosystems along the Pacific coast of North America. I identified the vectors that transport shellfish-derived nutrients into coastal terrestrial environments, including birds, mammals, and over 13,000 years of marine resource use by local people. I also examined the abundance of shellfish-derived nutrients transported, the prolonged persistence of shellfish subsidies once deposited within terrestrial ecosystems, and the ecological implications for recipient ecosystems. Chapter 6 contextualizes the preceding chapters relative to the broader literature. The objective is to provide insight into how multiple shellfish cultivation systems influence biological communities, how ecological mechanisms facilitate biotic responses, and summarize the implications for conservation planning, Indigenous resource sovereignty, and biodiversity preservation. It also explores future work, specifically the need to support efforts that pair Indigenous knowledge, and ways of knowing with Western scientific insights to address conservation challenges. / Graduate / 2022-12-13 Shellfish Mariculture Species-Habitat Relationships Resource Management Spatial Subsidies Structure-from-Motion Photogrammetry Multivariate Random Forest
162	Dense Neural Network Outperforms Other Machine Learning Models for Scaling-up Lichen Cover Maps in Eastern Canada Richardson, Galen 11 May 2023 (has links) Lichen mapping is vital for caribou management plans and sustainable land conservation. Previous studies have used Random Forest, dense neural network, and convolutional neural network (CNN) models for mapping lichen coverage with remote sensing data. However, to date, it is not clear how these models rank in the performance of this task. In this study, these machine learning models were evaluated on their ability to predict lichen percent coverage in Sentinel-2 imagery covering Québec and Labrador, NL. The models were trained on 10-m resolution lichen coverage (%) maps created from 20 drone surveys collected in July 2019 and 2022. The maps were divided into quadrant blocks and then split into train, validation, and test datasets. The quadrant-blocking approach exposed the models to a variety of different landscapes and reduced spatial autocorrelation between the training sites. All three models performed similarly when evaluated on the test set. However, the dense neural network achieved a higher accuracy than the other two, with a reported Mean Absolute Error (MAE) of 5.2% and an R2 of 0.76. By comparison, the Random Forest model returned an MAE of 5.5% (R2: 0.74) and the CNN had an MAE of 5.3% (R2: 0.74). The models were also evaluated on their ability to predict lichen coverage (%) for larger quadrant blocks consisting of, on average, 400 Sentinel-2 pixels. The Random Forest and dense neural network had an R2 of 0.93, while the CNN had an R2 of 0.90. The MAE in this assessment for the dense neural network, Random Forest, and CNN were 2.1%, 2.3%, and 3.1% respectively. A regional lichen map was created using the dense neural network and a Sentinel-2 image mosaic. Model predictions have larger errors for land covers that the model was not exposed to in training, such as mines and deep lakes. While the dense neural network requires more computational effort to train than a Random Forest model, the 5.9% performance gain in the test pixel comparison and 9.1% performance gain in the quadrant block comparison renders it the most suitable for lichen mapping. This study represents progress toward determining the appropriate methodology for generating accurate lichen maps from satellite imagery for caribou conservation and sustainable land management. Neural Networks Artificial Intelligence Random Forest Remote Sensing Caribou Lichen Earth Observation
163	Statistical Tools for Efficient Confirmation of Diagnosis in Patients with Suspected Primary Central Nervous System Vasculitis Brooks, John 27 April 2023 (has links) The management of missing data is a major concern in classification model generation in all fields but poses a particular challenge in situations where there is only a small quantity of sparse data available. In the field of medicine, this is not an uncommon problem. While widely subscribed methodologies like logistic regression can, with minor modifications and potentially much labor, provide reasonable insights from the larger and less sparse datasets that are anticipated when analyzing diagnosis of common conditions, there are a multitude of rare conditions of interest. Primary angiitis of the central nervous system (PACNS) is a rare but devastating entity that given its range of presenting symptoms can be suspected in a variety of circumstances. It unfortunately continues to be a diagnosis that is hard to make. Aside from some general frameworks, there isn’t a rigorously defined diagnostic approach as is the case in other more common neuroinflammatory conditions like multiple sclerosis. Instead, clinicians currently rely on experience and clinical judgement to guide the reasonable exclusion of potential inciting entities and mimickers. In effect this results in a smaller quantity of heterogenous that may not optimally suited for more traditional classification methodology (e.g., logistic regression) without substantial contemplation and justification of appropriate data cleaning / preprocessing. It is therefore challenging to make and analyze systematic approaches that could direct clinicians in a way that standardizes patient care. In this thesis, a machine learning approach was presented to derive quantitatively justified insights into the factors that are most important to consider during the diagnostic process to identify conditions like PACNS. Modern categorization techniques (i.e., random forest and support vector machines) were used to generate diagnostic models identifying cases of PACNS from which key elements of diagnostic importance could be identified. A novel variant of a random forest (RF) approach was also demonstrated as a means of managing missing data in a small sample, a significant problem encountered when exploring data on rare conditions without clear diagnostic frameworks. A reduced need to hypothesize the reasons for missingness when generating and applying the novel variant was discussed. The application of such tools to diagnostic model generation of PACNS and other rare and / or emerging diseases and provide objective feedback was explored. This primarily centered around a structured assessment on how to prioritize testing to rapidly rule out conditions that require alternative management and could be used to support future guidelines to optimize the care of these patients. The material presented herein had three components. The first centered around the example of PACNS. It described, in detail, an example of a relevant medical condition and explores why the data is both rare and sparse. Furthermore, the reasons for the sparsity are heterogeneous or non-monotonic (i.e., not conducive to modelling with a singular model). This component concludes with a search for candidate variables to diagnose the condition by means of scoping review for subsequent comparative demonstration of the novel variant of random forest construction that was proposed. The second component discussed machine learning model development and simulates data with varying degrees and patterns of missingness to demonstrate how the models could be applied to data with properties like what would be expected of PACNS related data. Finally, described techniques were applied to separate a subset of patients with suspected PACNS from those with diagnosed PACNS using institutional data and proposes future study to expand upon and ultimately verify these insights. Further development of the novel random forest approach is also discussed. Random Forest Machine Learning Missingness Feature Importance Gini Importance
164	Methods for network intrusion detection : Evaluating rule-based methods and machine learning models on the CIC-IDS2017 dataset Lindstedt, Henrik January 2022 (has links) Network intrusion detection is a task aimed to identify malicious network traffic. Malicious networktraffic is generated when a perpetrator attacks a network or internet-connected device with the intent todisrupt, steal or destroy a service or information. Two approaches for this particular task is the rule-basedmethod and the use of machine learning. The purpose of this paper was to contribute with knowledgeon how to evaluate and build better network intrusion detection systems (NIDS). That was fulfilled bycomparing the detection ability of two machine learning models, a neural network and a random forestmodel, with a rule-based NIDS called Snort. The paper describes how the two models and Snort wereconstructed and how performance metrics were generated on a dataset called CIC-IDS2017. It also describes how we capture our own malicious network traffic and the models ability to classify that data. Thecomparisons shows that the neural network outperforms Snort and the Random forest. We also presentfour factors that may influence which method that should be used for intrusion detection. In addition weconclude that we see potential in using CIC-IDS2017 to build NIDS based on machine learning. MLP random forest CIC-IDS2017 Snort Intrusion Detection System Information Systems
165	Using Transcriptomic Data to Predict Biomarkers for Subtyping of Lung Cancer Daran, Rukesh January 2021 (has links) Lung cancer is one the most dangerous types of all cancer. Several studies have explored the use of machine learning methods to predict and diagnose this cancer. This study explored the potential of decision tree (DT) and random forest (RF) classification models, in the context of a small transcriptome dataset for outcome prediction of different subtypes on lung cancer. In the study we compared the three subtypes; adenocarcinomas (AC), small cell lung cancer (SCLC) and squamous cell carcinomas (SCC) with normal lung tissue by applying the two machine learning methods from caret R package. The DT and RF model and their validation showed different results for each subtype of the lung cancer data. The DT found more features and validated them with better metrics. Analysis of the biological relevance was focused on the identified features for each of the subtypes AC, SCLC and SCC. The DT presented a detailed insight into the biological data which was essential by classifying it as a biomarker. The identified features from this research may serve as potential candidate genes which could be explored further to confirm their role in corresponding lung cancer types and contribute to targeted diagnostics of different subtypes. lung cancer decision tree random forest accuracy cross-validation machine learning Bioinformatics and Systems Biology Bioinformatik och systembiologi
166	Detecting fraudulent users using behaviour analysis / Detektera artificiella användare med hjälp av beteendeanalys Jóhannsson, Jökull January 2017 (has links) With the increased global use of online media platforms, there are more opportunities than ever to misuse those platforms or perpetrate fraud. One such fraud is within the music industry, where perpetrators create automated programs, streaming songs to generate revenue or increase popularity of an artist. With growing annual revenue of the digital music industry, there are significant financial incentives for perpetrators with fraud in mind. The focus of the study is extracting user behavioral patterns and utilising them to train and compare multiple supervised classification method to detect fraud. The machine learning algorithms examined are Logistic Regression, Support Vector Machines, Random Forest and Artificial Neural Networks. The study compares performance of these algorithms trained on imbalanced datasets carrying different fractions of fraud. The trained models are evaluated using the Precision Recall Area Under the Curve (PR AUC) and a F1-score. Results show that the algorithms achieve similar performance when trained on balanced and imbalanced datasets. It also shows that Random Forest outperforms the other methods for all datasets tested in this experiment. / Med den ökande användningen av strömmande media ökar också möjligheterna till missbruk av dessa platformar samt bedrägeri. Ett typiskt fall av bedrägeri är att använda automatiserade program för att strömma media, och därigenom generera intäkter samt att öka en artist popularitet. Med den växande ekonomin kring strömmande media växer också incitamentet till bedrägeriförsök. Denna studies fokus är att finna användarmönster och använda denna kunskap för att träna modeller som kan upptäcka bedrägeriförsök. The maskininlärningsalgoritmer som undersökts är Logistic Regression, Support Vector Machines, Random Forest och Artificiella Neurala Nätverk. Denna studie jämför effektiviteten och precisionen av dessa algoritmer, som tränats på obalanserad data som innehåller olika procentandelar av bedrägeriförsök. Modellerna som genererats av de olika algoritmerna har sedan utvärderas med hjälp av Precision Recall Area Under the Curve (PR AUC) och F1-score. Resultaten av studien visar på liknande prestanda mellan modellerna som genererats av de utvärderade algoritmerna. Detta gäller både när de tränats på balanserad såväl som obalanserad data. Resultaten visar också att Random Forestbaserade modeller genererar bättre resultat för alla dataset som testats i detta experiment. fraud machine learning random forest neural network fraud detection music Computer Sciences Datavetenskap (datalogi)
167	Predicting Attrition in Financial Data with Machine Learning Algorithms / Förutsäga kundförluster i finansdata med maskininlärningstekniker Darnald, Johan January 2018 (has links) For most businesses there are costs involved when acquiring new customers and having longer relationships with customers is therefore often more profitable. Predicting if an individual is prone to leave the business is then a useful tool to help any company take actions to mitigate this cost. The event when a person ends their relationship with a business is called attrition or churn. Predicting peoples actions is however hard and many different factors can affect their choices. This paper investigates different machine learning methods for predicting attrition in the customer base of a bank. Four different methods are chosen based on the results they have shown in previous research and these are then tested and compared to find which works best for predicting these events. Four different datasets from two different products and with two different applications are created from real world data from a European bank. All methods are trained and tested on each dataset. The results of the tests are then evaluated and compared to find what works best. The methods found in previous research to most reliably achieve good results in predicting churn in banking customers are the Support Vector Machine, Neural Network, Balanced Random Forest, and the Weighted Random Forest. The results show that the Balanced Random Forest achieves the best results with an average AUC of 0.698 and an average F-score of 0.376. The accuracy and precision of the model are concluded to not be enough to make definite decisions but can be used with other factors such as profitability estimations to improve the effectiveness of any actions taken to prevent the negative effects of churn. / För de flesta företag finns det en kostnad involverad i att skaffa nya kunder. Längre relationer med kunder är därför ofta mer lönsamma. Att kunna förutsäga om en kund är nära att lämna företaget är därför ett användbart verktyg för att kunna utföra åtgärder för att minska denna kostnad. Händelsen när en kund avslutar sin relation med ett företag kallas här efter kundförlust. Att förutsäga människors handlingar är däremot svårt och många olika faktorer kan påverka deras val. Denna avhandling undersöker olika maskininlärningsmetoder för att förutsäga kundförluster hos en bank. Fyra metoder väljs baserat på tidigare forskning och dessa testas och jämförs sedan för att hitta vilken som fungerar bäst för att förutsäga dessa händelser. Fyra dataset från två olika produkter och med två olika användningsområden skapas från verklig data ifrån en Europeisk bank. Alla metoder tränas och testas på varje dataset. Resultaten från dessa test utvärderas och jämförs sedan för att få reda på vilken metod som fungerar bäst. Metoderna som enligt tidigare forskning ger de mest pålitliga och bästa resultaten för att förutsäga kundförluster hos banker är stödvektormaskin, neurala nätverk, balanserad slumpmässig skog och vägd slumpmässig skog. Resultatet av testerna visar att en balanserad slumpmässig skog får bäst resultat med en genomsnittlig AUC på 0.698 och ett F-värde på 0.376. Träffsäkerheten och det positiva prediktiva värdet på metoden är inte tillräckligt för att ta definitiva handlingar med men kan användas med andra faktorer så som lönsamhetsuträkningar för att förbättra effektiviteten av handlingar som tas för att minska de negativa effekterna av kundförluster. Machine learning Random Forest Support Vector Machine Neural Network Computer Sciences Datavetenskap (datalogi)
168	A Machine Learning Ensemble Approach to Churn Prediction : Developing and Comparing Local Explanation Models on Top of a Black-Box Classifier / Maskininlärningsensembler som verktyg för prediktering av utträde : En studie i att beräkna och jämföra lokala förklaringsmodeller ovanpå svårförståeliga klassificerare Olofsson, Nina January 2017 (has links) Churn prediction methods are widely used in Customer Relationship Management and have proven to be valuable for retaining customers. To obtain a high predictive performance, recent studies rely on increasingly complex machine learning methods, such as ensemble or hybrid models. However, the more complex a model is, the more difficult it becomes to understand how decisions are actually made. Previous studies on machine learning interpretability have used a global perspective for understanding black-box models. This study explores the use of local explanation models for explaining the individual predictions of a Random Forest ensemble model. The churn prediction was studied on the users of Tink – a finance app. This thesis aims to take local explanations one step further by making comparisons between churn indicators of different user groups. Three sets of groups were created based on differences in three user features. The importance scores of all globally found churn indicators were then computed for each group with the help of local explanation models. The results showed that the groups did not have any significant differences regarding the globally most important churn indicators. Instead, differences were found for globally less important churn indicators, concerning the type of information that users stored in the app. In addition to comparing churn indicators between user groups, the result of this study was a well-performing Random Forest ensemble model with the ability of explaining the reason behind churn predictions for individual users. The model proved to be significantly better than a number of simpler models, with an average AUC of 0.93. / Metoder för att prediktera utträde är vanliga inom Customer Relationship Management och har visat sig vara värdefulla när det kommer till att behålla kunder. För att kunna prediktera utträde med så hög säkerhet som möjligt har den senasteforskningen fokuserat på alltmer komplexa maskininlärningsmodeller, såsom ensembler och hybridmodeller. En konsekvens av att ha alltmer komplexa modellerär dock att det blir svårare och svårare att förstå hur en viss modell har kommitfram till ett visst beslut. Tidigare studier inom maskininlärningsinterpretering har haft ett globalt perspektiv för att förklara svårförståeliga modeller. Denna studieutforskar lokala förklaringsmodeller för att förklara individuella beslut av en ensemblemodell känd som 'Random Forest'. Prediktionen av utträde studeras påanvändarna av Tink – en finansapp. Syftet med denna studie är att ta lokala förklaringsmodeller ett steg längre genomatt göra jämförelser av indikatorer för utträde mellan olika användargrupper. Totalt undersöktes tre par av grupper som påvisade skillnader i tre olika variabler. Sedan användes lokala förklaringsmodeller till att beräkna hur viktiga alla globaltfunna indikatorer för utträde var för respektive grupp. Resultaten visade att detinte fanns några signifikanta skillnader mellan grupperna gällande huvudindikatorerna för utträde. Istället visade resultaten skillnader i mindre viktiga indikatorer som hade att göra med den typ av information som lagras av användarna i appen. Förutom att undersöka skillnader i indikatorer för utträde resulterade dennastudie i en välfungerande modell för att prediktera utträde med förmågan attförklara individuella beslut. Random Forest-modellen visade sig vara signifikantbättre än ett antal enklare modeller, med ett AUC-värde på 0.93. Machine learning Ensemble Random forest Churn prediction LIME Interpretability CRM Local explanations Computer Sciences Datavetenskap (datalogi)
169	Uncertainty Analysis : Severe Accident Scenario at a Nordic Nuclear Power Plant Hedly, Josefin, De Young, Mikaela January 2023 (has links) Nuclear Power Plants (NPP) undergo fault and sensitivity analysis with scenario modelling to predict catastrophic events, specifically releases of Cesium 137 (Cs-137). The purpose of this thesis is to find which of 108 input-features from Modular Accident Analysis Program (MAAP)simulation code are important, when there is large release of Cs-137 emissions. The features are tested all together and in their groupings. To find important features, the Machine learning (ML) model Random Forest (RF) has a built-in attribute which identifies important features. The results of RF model classification are corroborated with Support Vector Machines (SVM), K-Nearest Neighbor (KNN) and use k-folds cross validation to improve and validate the results, resulting in a near 90% accuracy for the three ML models. RF is successful at identifying important features related to Cs-137 emissions, by using the classification model to first identify top features, to further train the models at identifying important input-features. The discovered input-features are important both within their individual groups, but also when including all features simultaneously. The large number of features included did not disrupt RF much, but the skewed dataset with few classified extreme events caused the accuracy to be lower at near 90%. Nuclear power plant microdata analysis Random Forest k-Nearest Neighbor SVM Computer Sciences Datavetenskap (datalogi)
170	Classification of imbalanced disparate medical data using ontology / Klassificering av Obalanserad Medicinsk Data med Ontologier Karlsson, Ludvig, Wilhelm Kopp Sundin, Gustav January 2023 (has links) Through the digitization of healthcare, large volumes of data are generated and stored in healthcare operations. Today, a multitude of platforms and digital infrastructures are used for storage and management of data. The systems lack a common ontology which limits the interoperability between datasets. Limited interoperability impacts various areas of healthcare, for instance sharing of data between entities and the possibilities for aggregated machine learning research incorporating distributed data. This study examines how a random forest classifier performs on two datasets consisting of phase III clinical trial studies on small-cell lung cancer where the datasets do not share a common ontology. The performance is then compared to the same classifier’s performance on one dataset consisting of a connection of the two earlier mentioned sets where a common ontology is implemented. The study does not show unambiguous results indicating that a single ontology is creating a better performance for the random forest classifier. In addition, the conditions of entities within primary care in Sweden for undergoing a transition to a new platform for storage of data is discussed together with areas for future research. / Till följd av digitaliseringen inom hälso- och sjukvården genereras stora volymer data som lagras och används i verksamheten. Idag används en mängd olika plattformar för lagring och hantering av data. Systemen saknar en gemensam ontologi, vilket begränsar interoperabiliteten mellan datamängderna. Bristande interoperabilitet påverkar olika områden inom hälso- och sjukvården, till exempel delning av data mellan vårdinstanser och möjligheterna för forskning på en aggregerad nivå där maskininlärning används. Denna studie undersöker hur en random forest klassificerare presterar på två dataset bestående av fas III kliniska prövningar av småcellig lungcancer där dataseten inte delar en gemensam ontologi. Prestandan jämförs sedan med samma klassificerares prestanda på ett dataset som består av en anslutning mellan de två tidigare nämnda dataseten där en gemensam ontologi har implementerats. Studien visar inte entydiga resultat som indikerar att en gemensam eller icke-gemensam ontologi skapar bättre prestanda för en random forest klassificerare. Vidare diskuteras förutsättningarna och krav på förändringsprocessen för en övergång till Centrum för Datadriven Hälsas föreslagna plattform utifrån en klinik inom primärvårdens perspektiv. Ontology machine learning random forest imbalanced data oncology digital transformation Computer and Information Sciences Data- och informationsvetenskap

Search results