Global ETD Search

171	Tree-Based Methods and a Mixed Ridge Estimator for Analyzing Longitudinal Data With Correlated Predictors Eliot, Melissa Nicole 01 September 2011 (has links) Due to recent advances in technology that facilitate acquisition of multi-parameter defined phenotypes, new opportunities have arisen for predicting patient outcomes based on individual specific cell subset changes. The data resulting from these trials can be a challenge to analyze, as predictors may be highly correlated with each other or related to outcome within levels of other predictor variables. As a result, applying traditional methods like simple linear models and univariate approaches such as odds ratios may be insufficient. In this dissertation, we describe potential solutions including tree-based methods, ridge regression, mixed modeling, and a new estimator called a mixed ridge estimator with expectation-maximization (EM) algorithm. Data examples are provided. In particular, flow cytometry is a method of measuring a large number of particle counts at once by suspending them in a fluid and shining a beam of light onto the fluid. This is specifically relevant in the context of studying human immunodeficiency virus (HIV), where there exists a great potential to draw from the rich array of data on host cell-mediated response to infection and drug exposures, to inform and discover patient level determinants of disease progression and/or response to anti-retroviral therapy (ART). The data sets collected are often high dimensional with correlated columns, which can be challenging to analyze. We demonstrate the application and comparative interpretations of three tree-based algorithms for the analysis of data arising from flow cytometry in the first chapter of this manuscript. Specifically, we consider the question of what best predicts CD4 T-cell recovery in HIV-1 infected persons starting antiretroviral therapy with CD4 count between 200-350 cell/μl. The tree-based approaches, namely, classification and regression trees (CART), random forests (RF) and logic regression (LR), were designed specifically to uncover complex structure in high dimensional data settings. While contingency table analysis and RFs provide information on the importance of each potential predictor variable, CART and LR offer additional insight into the combinations of variables that together are predictive of the outcome. Specifically, application of tree-based methods to our data suggest that a combination of baseline immune activation states, with emphasis on CD8 T cell activation, may be a better predictor than any single T cell/innate cell subset analyzed. In the following chapter, tree-based methods are compared to each other via a simulation study. Each has its merits in particular circumstances; for example, RF is able to identify the order of importance of predictors regardless of whether there is a tree-like structure. It is able to adjust for correlation among predictors by using a machine learning algorithm, analyzing subsets of predictors and subjects over a number of iterations. CART is useful when variables are predictive of outcome within levels of other variables, and is able to find the most parsimonious model using pruning. LR also identifies structure within the set of predictor variables, and nicely illustrates relationship among variables. However, due to the vast number of combinations of predictor variables that would need to be analyzed in order to find the single best LR tree, an algorithm is used that only searches a subset of potential combinations of predictors. Therefore, results may be different each time the algorithm is used on the same data set. Next we use a regression approach to analyzing data with correlated predictors. Ridge regression is a method of accounting for correlated data by adding a shrinkage component to the estimators for a linear model. We perform a simulation study to compare ridge regression to linear regression over various correlation coefficients and find that ridge regression outperforms linear regression as correlation increases. To account for collinearity among the predictors along with longitudinal data, a new estimator that combines the applicability of ridge regression and mixed models using an EM algorithm is developed and compared to the mixed model. We find from a simulation study comparing our mixed ridge (MR) approach with a traditional mixed model that our new mixed ridge estimator is able to handle collinearity of predictor variables better than the mixed model, while accounting for random within-subject effects that regular ridge regression does not take into account. As correlation among predictors increases, power decreases more quickly for the mixed model than MR. Additionally, type I error rate is not significantly elevated when the MR approach is taken. The MR estimator gives us new insight into flow cytometry data and other data sets with correlated predictor variables that our tree-based methods could not give us. These methods all provide unique insight into our data that more traditional methods of analysis do not offer. CART Flow cytometry Logic Regression Mixed model Random Forest Ridge regression Biostatistics
172	Detecting early-stage Alzheimer’s disease with Machine Learning algorithms Mukka, Jakob January 2023 (has links) Alzheimer’s disease (AD) accounts for the majority of all cases of dementia and can be characterized as a disease that causes a progressive decline of cognitive functions. Detecting the disease at it’s earliest stage is important as medical treatments can be more effective if they can be applied before the disease has caused irreparable brain damage. However, making a correct diagnosis of AD can be difficult, especially in the early stage when the symptoms are still mild. Machine learning algorithms can help in this process, with the purpose of this study being to investigate just how accurately machine learning algorithms can detect early-stage AD. Three algorithms were selected for the study, Random Forest, AdaBoost and Logistic Regression, which were then evaluated on the accuracy of their predictions. The results showed that Random Forest had the best overall performance with an accuracy of 79.78%. AdaBoost attained an accuracy of 76.40% and Logistic Regression attained an accuracy of 74.16%. These results suggest that machine learning algorithms can be used to make relatively accurate predictions of AD even when the disease is in it’s early stage. Machine Learning Random Forest AdaBoost Logistic Regression Computer Sciences Datavetenskap (datalogi)
173	A Knowledge Based Approach of Toxicity Prediction for Drug Formulation. Modelling Drug Vehicle Relationships Using Soft Computing Techniques Mistry, Pritesh January 2015 (has links) This multidisciplinary thesis is concerned with the prediction of drug formulations for the reduction of drug toxicity. Both scientific and computational approaches are utilised to make original contributions to the field of predictive toxicology. The first part of this thesis provides a detailed scientific discussion on all aspects of drug formulation and toxicity. Discussions are focused around the principal mechanisms of drug toxicity and how drug toxicity is studied and reported in the literature. Furthermore, a review of the current technologies available for formulating drugs for toxicity reduction is provided. Examples of studies reported in the literature that have used these technologies to reduce drug toxicity are also reported. The thesis also provides an overview of the computational approaches currently employed in the field of in silico predictive toxicology. This overview focuses on the machine learning approaches used to build predictive QSAR classification models, with examples discovered from the literature provided. Two methodologies have been developed as part of the main work of this thesis. The first is focused on use of directed bipartite graphs and Venn diagrams for the visualisation and extraction of drug-vehicle relationships from large un-curated datasets which show changes in the patterns of toxicity. These relationships can be rapidly extracted and visualised using the methodology proposed in chapter 4. The second methodology proposed, involves mining large datasets for the extraction of drug-vehicle toxicity data. The methodology uses an area-under-the-curve principle to make pairwise comparisons of vehicles which are classified according to the toxicity protection they offer, from which predictive classification models based on random forests and decisions trees are built. The results of this methodology are reported in chapter 6.
174	Spatial Patterns and the Socioeconomic Determinants of COVID-19 Infections in Ottawa, Canada. Laadhar, Brahim 15 December 2023 (has links) This study uncovered the pattern and spatial relationships between socio-economic factors and aggregated COVID-19 rates in Ottawa, Canada, from July 2020 to December 2021 at the neighbourhood scale. Both top-down and bottom-up data mining approaches were used to predict COVID-19 rates. The top-down approach employed ordinary least squares regression (OLS), spatial error model (SEM), geographically weighted regression (GWR) and multi-scale geographically weighted regression (MGWR). Model intercomparison was also undertaken. The pattern of COVID-19 in Ottawa exhibited a significant moderately positive spatial structure among neighbourhoods (Moran's I = 0.39; p = 0.0001). Local Moran's analysis identified areas of low and high COVID-19 clustering, interspersed with cold spots. The OLS model used determinants based on a literature review. Determinants were tested for normality using the Shapiro-Wilks test with those that failed the test had transformatoins to normality applied. Next, an OLS-based backward stepwise approach was used to select the optimal set of determinants based on goodness of fit, selecting the model with the lowest Akaike Information Criterion (AIC). The percentage of people who take public transit to work, percentage of people with no high school diploma, percentage of people over 65 years old, and percentage of people with a Bachelor level degree or above comprised the final set of determinants. A SEM model was created to account for residual spatial autocorrelation in the OLS model's residuals and yielded an adjusted R² = 0.63. Based on the SEM, a one-unit increase in the square root of the percentage of people with a bachelor's degree or above was associated with a 3.2% increase in COVID-19 rates, while the same unit increase in the square root of the percentage of people with no high school diploma was associated with a 10.6% increase in COVID-19 rates. Conversely, a one percent increase in the percentage of people aged 65 and older was linked to a 34.6% decrease in COVID-19 rates. To examine local variations in the relationships between the determinants and COVID-19, a MGWR with a Bisquare kernel and an adaptive bandwidth was used to improve upon the overall explained variance of the SEM model. The residuals of the MGWR model exhibited no significant spatial autocorrelation (Moran's I = -0.04; p = 0.62) and residuals were approximately normal (W = 0.98; p > 0.25). The MGWR model yielded an adjusted R² = 0.75. Taking a data mining and bottom-up approach, an optimized Random Forest model provided a very different set of determinants as important when compared to the top-down regression approaches and accounted for 47.34% of the COVID-19 variance. COVID-19 Spatial autocorrelation Regression Analysis Ottawa Hot Spot Analysis Random Forest
175	A comparison of forecasting techniques: Predicting the S&P500 Neikter, Axel, Sjöberg, Nils January 2023 (has links) Accurately predicting the S\&P 500 index means knowing where the US economy is heading. If there was a model that could predict the S\&P 500 with even some accuracy, this would be extremely valuable. Machine learning techniques such as neural network and Random forest have become more popular in forecasting. This thesis compares the more traditional forecasting methods, ARIMA, Exponential smoothing, and Naïve, versus the Random forest regression model in predicting the S\&P 500 index. The models are compared using the scale measures MAE and RMSE. The Diebold-Mariano test is used to evaluate if the model's forecasts significantly have better accuracy than the last known observation (Naïve method). The result showed that the Random forest model did outperform the other models regarding the RMSE and MAE values, especially on a two-day forecast. Furthermore, the Random forest model was significantly better on all horizons on a five percent significance level, meaning that the model had a better forecast accuracy than the last known observation. However, further research on this subject is needed to ensure the effectiveness of the Random forest model when forecasting stock market indices. Forecasting machine learning random forest arima Probability Theory and Statistics Sannolikhetsteori och statistik
176	Automatic processing of LiDAR point cloud data captured by drones / Automatisk bearbetning av punktmolnsdata från LiDAR infångat av drönare Li Persson, Leon January 2023 (has links) As automation is on the rise in the world at large, the ability to automatically differentiate objects in datasets via machine learning is of growing interest. This report details an experimental evaluation of supervised learning on point cloud data using random forest with varying setups. Acquired via airborne LiDAR using drones, the data holds a 3D representation of a landscape area containing power line corridors. Segmentation was performed with the goal of isolating data points belonging to power line objects from the rest of the surroundings. Pre-processing was performed on the data to extend the machine learning features used with geometry-based features that are not inherent to the LiDAR data itself. Due to how large-scale the data is, the labels were generated by the customer, Airpelago, and supervised learning was applied using this data. With their labels as benchmark, F1 scores of over 90% could be generated for both of the classes pertaining to power line objects. The best results were obtained when the data classes were balanced and both relevant intrinsic and extended features were used for the training of the classification models. machine learning supervised learning random forest point cloud segmentation Computer Engineering Datorteknik
177	Detecting Fraudulent User Behaviour : A Study of User Behaviour and Machine Learning in Fraud Detection Gerdelius, Patrik, Hugo, Sjönneby January 2024 (has links) This study aims to create a Machine Learning model and investigate its performance of detecting fraudulent user behaviour on an e-commerce platform. The user data was analysed to identify and extract critical features distinguishing regular users from fraudulent users. Two different types of user data were used; Event Data and Screen Data, spanning over four weeks. A Principal Component Analysis (PCA) was applied to the Screen Data to reduce its dimensionality. Feature Engineering was conducted on both Event Data and Screen Data. A Random Forest model, a supervised ensemble method, was used for classification. The data was imbalanced due to a significant difference in number of frauds compared to regular users. Therefore, two different balancing methods were used: Oversampling (SMOTE) and changing the Probability Threshold (PT) for the classification model. The best result was achieved with the resampled data where the threshold was set to 0,4. The result of this model was a prediction of 80,88% of actual frauds being predicted as such, while 0,73% of the regular users were falsely predicted as frauds. While this result was promising, questions are raised regarding the validity since there is a possibility that the model was over-fitted on the data set. An indication of this was that the result was significantly less accurate without resampling. However, the overall conclusion from the result was that this study shows an indication that it is possible to distinguish frauds from regular users, with or without resampling. For future research, it would be interesting to see data over a more extended period of time and train the model on real-time data to counter changes in fraudulent behaviour. Fraud Detection User Behaviour Random Forest PCA SMOTE Computer Sciences Datavetenskap (datalogi)
178	Comparative Analysis of Surrogate Models for the Dissolution of Spent Nuclear Fuel Awe, Dayo 01 May 2024 (has links) (PDF) This thesis presents a comparative analysis of surrogate models for the dissolution of spent nuclear fuel, with a focus on the use of deep learning techniques. The study explores the accuracy and efficiency of different machine learning methods in predicting the dissolution behavior of nuclear waste, and compares them to traditional modeling approaches. The results show that deep learning models can achieve high accuracy in predicting the dissolution rate, while also being computationally efficient. The study also discusses the potential applications of surrogate modeling in the field of nuclear waste management, including the optimization of waste disposal strategies and the design of more effective containment systems. Overall, this research highlights the importance of surrogate modeling in improving our understanding of nuclear waste behavior and developing more sustainable waste management practices. spent nuclear fuel random forest regression boosting methods surrogate model machine learning Physical Sciences and Mathematics
179	Android Malware Detection Using Machine Learning Kesani, Rahul Sai January 2024 (has links) Background. The Android smartphone, with its wide range of uses and excellent performance, has attracted numerous users. Still, this domination of the Android platform also has motivated the attackers to develop malware. The traditional methodology which detects the malware based on the signature is unfit to discover unknown applications. In this paper, detection has been tried whether an application is malware or not using Static Analysis (SA). Considered all the permissions that an application asks for and took them as input to feed our machine learning models. Objectives. The objectives to address and fulfill the aim of this thesis are: To find/create the necessary data set containing malware in the android systems. To test this, different classifiers have been built using different machine learning (ML) algorithms such as Support Vector Machine (SVM) (Linear and RBF), Logistic Regression (LR), Random Forest Algorithm (RF), Gaussian Naive-Bayes (GNB), Decision Tree Method (DT) etc., and also compared their performances. To evaluate and compare each of the chosen models using Accuracy, Precision, F1-Score and Recall methods among the algorithms mentioned in detecting the malware in android with better accuracy in real-time scenarios. Methods. To answer the research question, 1 method has been chosen which is: To identify malware in android system, the Experiment has been used. Results. The Sequential Neural Network (SNN) performed well on the dataset with 98.82 percent than the other Machine Learning (ML) algorithms. So, it is the most fruitful algorithm for the Android malware detection. Random Forest (RF), Decision Tree (DT) are the second-best algorithms on the dataset with 97 percent. Conclusions. Among Logistic Regression, KNN, SVM Linear, SVM RBF, Decision Tree, Random Forest, Gaussian Naive Bayes, and Sequential Neural Network Random Forest is declared as the most efficient algorithm after comparing all the models based on the performance metrics Precision, Recall, F1-Score and also by calculating Accuracy. Random Forest is considered as the most efficient algorithm among the four algorithms when they were compared. Malware Machine Learning Random Forest Sequential Neural Network. Computer Sciences Datavetenskap (datalogi)
180	An Investigation of How Well Random Forest Regression Can Predict Demand : Is Random Forest Regression better at predicting the sell-through of close to date products at different discount levels than a basic linear model? Jonsson, Estrid, Fredrikson, Sara January 2021 (has links) Allt eftersom klimatkrisen fortskrider ökar engagemanget kring hållbarhet inom företag. Växthusgaser är ett av de största problemen och matsvinn har därför fått mycket uppmärksamhet sedan det utnämndes till den tredje största bidragaren till de globala utsläppen. För att minska sitt bidrag rabatterar många matbutiker produkter med kort bästföredatum, vilket kommit att kräva en förståelse för hur priskänslig efterfrågan på denna typ av produkt är. Prisoptimering görs vanligtvis med så kallade Generalized Linear Models men då efterfrågan är ett komplext koncept har maskininl ärningsmetoder börjat utmana de traditionella modellerna. En sådan metod är Random Forest Regression, och syftet med uppsatsen är att utreda ifall modellen är bättre på att estimera efterfrågan baserat på rabattnivå än en klassisk linjär modell. Vidare utreds det ifall ett tydligt linjärt samband existerar mellan rabattnivå och efterfrågan, samt ifall detta beror av produkttyp. Resultaten visar på att Random Forest tar bättre hänsyn till det komplexa samband som visade sig finnas, och i detta specifika fall presterar bättre. Vidare visade resultaten att det sammantaget inte finns något linjärt samband, men att vissa produktkategorier uppvisar svag linjäritet. / As the climate crisis continues to evolve many companies focus their development on becoming more sustainable. With greenhouse gases being highlighted as the main problem, food waste has obtained a great deal of attention after being named the third largest contributor to global emissions. One way retailers have attempted to improve is through offering close-to-date produce at discount, hence decreasing levels of food being thrown away. To minimize waste the level of discount must be optimized, and as the products can be seen as flawed the known price-to-demand relation of the products may be insufficient. The optimization process historically involves generalized linear regression models, however demand is a complex concept influenced by many factors. This report investigates whether a Machine Learning model, Random Forest Regression, is better at estimating the demand of close-to-date products at different discount levels than a basic linear regression model. The discussion also includes an analysis on whether discounts always increase the will to buy and whether this depends on product type. The results show that Random Forest to a greater extent considers the many factors influencing demand and is superior as a predictor in this case. Furthermore it was concluded that there is generally not a clear linear relation however this does depend on product type as certain categories showed some linearity. Random Forest Regression Linear Regression Food Waste Demand Prediction Computer and Information Sciences Data- och informationsvetenskap

Search results