• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 147
  • 36
  • 22
  • 15
  • 8
  • 4
  • 3
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 289
  • 289
  • 97
  • 90
  • 77
  • 69
  • 57
  • 57
  • 56
  • 39
  • 39
  • 36
  • 34
  • 31
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
211

Propuesta de sistema cloud para optimizar la selección de auditores y seguimiento de la ejecución de auditorías en una organización de certificación de procesos utilizando árboles de decisión, geolocalización y tableros BI / Cloud system proposal to optimize the selection of auditors and monitoring of the execution of audits in a process certification organization using decision trees, geolocation and BI dashboards

Ocrospoma Cuadros, Gerson Josue, Romaña Casas, Victor Arturo 16 December 2021 (has links)
El presente proyecto de tesis está enfocado en brindar una propuesta que le permita a una empresa del sector “inspección y certificación de calidad” automatizar las actividades del proceso de selección de auditores y seguimiento de ejecución de auditorías mediante la implementación de un sistema cloud utilizando herramientas de machine learning, específicamente, para árboles de decisión mediante técnicas de aprendizaje automático, geolocalización y tableros BI como apoyo para la medición de indicadores y toma de decisiones. Para la elaboración de la propuesta, se han desarrollado seis capítulos. En el primer capítulo del documento se realiza la definición del proyecto, donde se presenta a la organización objeto de estudio, los objetivos del proyecto así como sus indicadores de éxito, además de describir el problema identificado. El segundo capítulo presenta el cumplimiento de los student outcomes. El tercer capítulo describe los fundamentos teóricos para el desarrollo del proyecto. El cuarto capítulo comprende el desarrollo del proyecto, donde se realiza el análisis de la situación actual de la compañía, ingeniería de procesos, propuesta de solución, análisis de requerimientos, modelado de casos del sistema y diseño de arquitectura de software de la propuesta para el cual se empleará el modelo C4. El quinto capítulo presenta los resultados del proyecto tomando como referencia la propuesta planteada. El sexto capítulo desarrolla la gestión del proyecto tomando de referencias lo sugerido en la Guía del PMBOK®. En este contexto, mediante la propuesta de solución que el presente proyecto brindará, se logrará disminuir el tiempo de contratación de auditores especializados, reducir el sobrecosto de los proyectos de auditoría por falta de un correcto control y seguimiento de los eventos de ejecución de auditorías en los servicios de inspección y certificación. / This thesis project is focused on providing a proposal that allows a company in the "quality certification and inspection" sector to automate the activities of the auditor selection process and monitoring of audit execution by implementing a cloud system using tools of machine learning, specifically, for decision trees using machine learning techniques, geolocation and BI dashboards as support for the measurement of indicators and decision making. For the elaboration of the proposal, six chapters have been developed. In the first chapter of the document, the definition of the project is carried out, where the objectives of the project as well as its indicators of success are presented to the organization under study, as well as describing the problem identified. The second chapter presents the fulfillment of the student outcomes. The third chapter describes the theoretical foundations for the development of the project. The fourth chapter includes the development of the project, where the analysis of the current situation of the company, process engineering, solution proposal, requirements analysis, modeling of system cases and software architecture design of the proposal for the which model C4 will be used. The fifth chapter presents the results of the project taking the proposed proposal as a reference. The sixth chapter develops the management of the project taking from references what is suggested in the PMBOK® Guide. In this context, through the proposed solution that this project will provide, it will be possible to reduce the time of hiring specialized auditors, reduce the cost overrun of audit projects due to lack of proper control and monitoring of audit execution events in inspection and certification services. / Tesis
212

Exploring factors that decides on how a Business Intelligence tool is being received by its users

Klaesson, Mårten January 2020 (has links)
Self-Service Business Intelligence (SSBI) is a service where users can create reports andanalyze data on their own. It is an approach to decentralize competence and knowledge withina company. It has been proven to increase productivity and provide employees with morepossibilities to make smart data-driven decisions. I decided to do this project to learn moreabout SSBI and specifically explore what factors that contributes to the user experience ofworking with SSBI. With the help of a survey I was able to reach out to the employees at IfP&C Insurance.I asked how satisfied they were with the SSBI solution at the company, how theyexperienced loading times, how active they were and if SSBI brought value to their day to daywork among many other questions. The data from the survey was analyzed looking at trendsand correlation between answers to identify what parts employees were pleased with and theparts that needs more attention. This was done with the help of Decision Trees, correlationmatrices and extensive graph comparisons. The results managed to answer my scientificquestion rather well. It shows that most employees find that working with SSBI at If P&CInsurance is an enjoyable experience and they believe that it adds real value to their work.There is an interest in further education in Tableau, which is the SSBI software being used atIf P&C Insurance. A fact that shows that employees are eager to learn more, but also that theavailable education at the company has not reached out to all employees. There is also a majorissue with loading times when browsing reports. Users that experiences that loading times areslow or very slow are also overrepresented in the group that is not pleased with the software.The issue with slow loading times has two solutions that I recommend to the company: • Educate employees to create reports that require as low processing power as possibleto browse. This is something that a few employees asked for specifically. • Increase the capacity on their servers. As using Tableau and creating reports hasbecome more and more popular at the company, the servers have not been updated inthe same pace, creating long delays when browsing and working with reports. In general, I think If P&C Insurance has created a functioning environment for SSBI and ifthey address the few issues I have mentioned they will have a thriving Tableau communitywithin the company.
213

Methods of Handling Missing Data in One Shot Response Based Power System Control

Dahal, Niraj 08 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / The thesis extends the work done in [1] [2] by Rovnyak, et al. where the authors have described about transient event prediction and response based one shot control using decision trees trained and tested in a 176 bus model of WECC power system network. This thesis contains results from rigorous simulations performed to measure robustness of the existing one shot control subjected to missing PMU's data ranging from 0-10%. We can divide the thesis into two parts in which the first part includes understanding of the work done in [2] using another set of one-shot control combinations labelled as CC2 and the second part includes measuring their robustness while assuming missing PMU's data. Previous work from [2] involves use of decision trees for event detection based on different indices to classify a contingency as a 'Fault' or 'No fault' and another set of decision trees that decides either to actuate 'Control' or 'No control'. The actuation of control here means application of one-shot control combination to possibly bring the system to a new equilibrium point which would otherwise attain loss of synchronism. The work done in [2] also includes assessing performance of the one shot control without event detection. The thesis is organized as follows- Chapter 1 of the thesis highlights the effect of missing PMUs' data in a power system network and the need to address them appropriately. It also provides a general idea of transient stability and response of a transient fault in a power system. Chapter 2 forms the foundation of the thesis as it describes the work done in [1] [2] in detail. It describes the power system model used, contingencies set, and different indices used for decision trees. It also describes about the one shot control combination (CC1) deduced by Rovnyak, et.al. of which performance is later tested in this thesis assuming different missing data scenarios. In addition to CC1, the chapter also describes another set of control combination (CC2) whose performance is also tested assuming the same missing data scenarios. This chapter also explains about the control methodology used in [2]. Finally the performance metrics of the DTs are explained at the end of the chapter. These are the same performance metrics used in [2] to measure the robustness of the one shot control. Chapter 2 is thus more a literature review of previous work plus inclusion of few simulation results obtained from CC2 using exactly the same model and same control methodology. Chapter 3 describes different techniques of handling missing data from PMUs most of which have been used in and referred from different previous papers. Finally Chapter 4 presents the results and analysis of the simulation. The thesis is wrapped up explaining future enhancements and room for improvements.
214

GLOBAL TRANSLATION OF MACHINE LEARNING MODELS TO INTERPRETABLE MODELS

Mohammad Naser Al-Merri (11794466) 07 January 2022 (has links)
<div>The widespread and growing usage of machine learning models, especially in highly critical areas such as law, predicate the need for interpretable models. Models that cannot be audited are vulnerable to inheriting biases from the dataset. Even locally interpretable models are vulnerable to adversarial attack. To address this issue a new methodology is proposed to translate any existing machine learning model into a globally interpretable one.</div><div>This methodology, MTRE-PAN, is designed as a hybrid SVM-decision tree model and leverages the interpretability of linear hyperplanes. MTRE-PAN uses this hybrid model to create polygons that act as intermediates for the decision boundary. MTRE-PAN is compared to a previously proposed model, TRE-PAN, on three non-synthetic datasets: Abalone, Census and Diabetes data. TRE-PAN translates a machine learning model to a 2-3 decision tree in</div><div>order to provide global interpretability for the target model. The datasets are each used to train a Neural Network that represents the non-interpretable model. For all target models, the results show that MTRE-PAN generates interpretable decision trees that have a lower</div><div>number of leaves and higher parity compared to TRE-PAN.</div>
215

Automated Gravel Road Condition Assessment : A Case Study of Assessing Loose Gravel using Audio Data

Saeed, Nausheen January 2021 (has links)
Gravel roads connect sparse populations and provide highways for agriculture and the transport of forest goods. Gravel roads are an economical choice where traffic volume is low. In Sweden, 21% of all public roads are state-owned gravel roads, covering over 20,200 km. In addition, there are some 74,000 km of gravel roads and 210,000 km of forest roads that are owned by the private sector. The Swedish Transport Administration (Trafikverket) rates the condition of gravel roads according to the severity of irregularities (e.g. corrugations and potholes), dust, loose gravel, and gravel cross-sections. This assessment is carried out during the summertime when roads are free of snow. One of the essential parameters for gravel road assessment is loose gravel. Loose gravel can cause a tire to slip, leading to a loss of driver control.  Assessment of gravel roads is carried out subjectively by taking images of road sections and adding some textual notes. A cost-effective, intelligent, and objective method for road assessment is lacking. Expensive methods, such as laser profiler trucks, are available and can offer road profiling with high accuracy. These methods are not applied to gravel roads, however, because of the need to maintain cost-efficiency.  In this thesis, we explored the idea that, in addition to machine vision, we could also use machine hearing to classify the condition of gravel roads in relation to loose gravel. Several suitable classical supervised learning and convolutional neural networks (CNN) were tested. When people drive on gravel roads, they can make sense of the road condition by listening to the gravel hitting the bottom of the car. The more we hear gravel hitting the bottom of the car, the more we can sense that there is a lot of loose gravel and, therefore, the road might be in a bad condition. Based on this idea, we hypothesized that machines could also undertake such a classification when trained with labeled sound data. Machines can identify gravel and non-gravel sounds. In this thesis, we used traditional machine learning algorithms, such as support vector machines (SVM), decision trees, and ensemble classification methods. We also explored CNN for classifying spectrograms of audio sounds and images in gravel roads. Both supervised learning and CNN were used, and results were compared for this study. In classical algorithms, when compared with other classifiers, ensemble bagged tree (EBT)-based classifiers performed best for classifying gravel and non-gravel sounds. EBT performance is also useful in reducing the misclassification of non-gravel sounds. The use of CNN also showed a 97.91% accuracy rate. Using CNN makes the classification process more intuitive because the network architecture takes responsibility for selecting the relevant training features. Furthermore, the classification results can be visualized on road maps, which can help road monitoring agencies assess road conditions and schedule maintenance activities for a particular road. / <p>Due to unforeseen circumstances the seminar was postponed from May 7 to 28, as duly stated in the new posting page.</p>
216

User authentication through behavioral biometrics using multi-class classification algorithms : A comprehensive study of machine learning algorithms for keystroke and mouse dynamics / Användarautentisering med beteendemässig biometri och användning av multi-class klassificeringsalgoritmer : En djupgående studie av maskininlärningsalgoritmer för tangentbords- och musdynamik

Lantz, Emil January 2023 (has links)
User authentication is vital in a secure system. Authentication is achieved through something a genuine user knows, has, or is. The latter is called biometrics, commonly attributed with fingerprint and face modalities. It is also possible to identify a user based on their behavior, called behavioral biometrics. In this study, keyboard and mouse behavior were considered. Previous research indicate promise for this authentication method. The research however is scarce, old and often not comprehensive. This study focus on two available data sets, the CMU keystroke dynamics dataset and the ReMouse data set. The data was used together with a comprehensive set of multi-class supervised classification machine learning algorithms from the scikit-learn library for Python. By performing hyperparameter optimization, two optimal algorithms with modified hyperparameters were found that improved results compared with previous research. For keystroke dynamics a classifier based on a neural network, multi-layer perceptron, achieved an Equal Error Rate (EER) of 1.26%. For mouse dynamics, a decision tree classifier achieved an EER of 0.43%. The findings indicate that the produced biometric classifiers can be used in an authentication model and importantly to strengthen existing authentication models such as password based login as a safe alternative to traditional Multi-Factor Authentication (MFA). / Användarautentisering är vitalt i ett säkert system. Autentisering genomförs med hjälp av något en genuin användare vet, har eller är. Det senare kallas biometri, ofta ihopkopplat med fingeravtryck och ansiktigenkänning. Det är även möjligt att identifiera en användare baserat på deras beteende, så kallad beteendemässig biometri. I denna studie används tangentbords- och musanvändning. Tidigare forskning tyder på att denna autentiseringsmetod är lovande. Forskningen är dock knapp, äldre och svårbegriplig. Denna studie använder två publika dataset, CMU keystroke dynamics dataset och ReMouse data set. Datan används tillsammans med en utförlig mängd maskininlärningsalgoritmer från scitkit-learn biblioteket för programmeringsspråket Python. Genom att optimera algoritmernas hyper parametrar kunde två stycken optimala klassificerare tas fram som åstadkom förbättrade resultat mot tidigare forskning. För tangentbordsbeteende producerades en klassificerare baserat på neurala nätverk, så kallad multi-layer perceptron som åstadkom en EER på 1.26%. För musrörelser kunde en modell baserat på beslutsträd åstadkomma en EER på 0.43%. Resultatet av dessa upptäckter är att liknande klassificerare kan användas i en autentiseringsmodell men också för att förbättra säkerheten hos etablerade inloggningssätt som exempelvis lösenord och därmed utgöra ett säkert alternativ till traditionell MFA.
217

Stock picking via nonsymmetrically pruned binary decision trees with reject option

Andriyashin, Anton 06 July 2010 (has links)
Die Auswahl von Aktien ist ein Gebiet der Finanzanalyse, die von speziellem Interesse sowohl für viele professionelle Investoren als auch für Wissenschaftler ist. Empirische Untersuchungen belegen, dass Aktienerträge vorhergesagt werden können. Während verschiedene Modellierungstechniken zur Aktienselektion eingesetzt werden könnten, analysiert diese Arbeit die meist verbreiteten Methoden, darunter allgemeine Gleichgewichtsmodelle und Asset Pricing Modelle; parametrische, nichtparametrische und semiparametrische Regressionsmodelle; sowie beliebte Black-Box Klassifikationsmethoden. Aufgrund vorteilhafter Eigenschaften binärer Klassifikationsbäume, wie zum Beispiel einer herausragenden Interpretationsmöglichkeit von Entscheidungsregeln, wird der Kern des Handelsalgorithmus unter Verwendung dieser modernen, nichtparametrischen Methode konstruiert. Die optimale Größe des Baumes wird als der entscheidende Faktor für die Vorhersageperformance von Klassifikationsbäumen angesehen. Während eine Vielfalt alternativer populärer Bauminduktions- und Pruningtechniken existiert, die in dieser Studie kritisch gewürdigt werden, besteht eines der Hauptanliegen dieser Arbeit in einer neuartigen Methode asymmetrischen Baumprunings mit Abweisungsoption. Diese Methode wird als Best Node Selection (BNS) bezeichnet. Eine wichtige inverse Fortpflanzungseigenschaft der BNS wird bewiesen. Diese eröffnet eine einfache Möglichkeit, um die Suche der optimalen Baumgröße in der Praxis zu implementieren. Das traditionelle costcomplexity Pruning zeigt eine ähnliche Performance hinsichtlich der Baumgenauigkeit verglichen mit beliebten alternativen Techniken, und es stellt die Standard Pruningmethode für viele Anwendungen dar. Die BNS wird mit cost-complexity Pruning empirisch verglichen, indem zwei rekursive Portfolios aus DAX-Aktien zusammengestellt werden. Vorhersagen über die Performance für jede einzelne Aktie werden von Entscheidungsbäumen gemacht, die aktualisiert werden, sobald neue Marktinformationen erhältlich sind. Es wird gezeigt, dass die BNS der traditionellen Methode deutlich überlegen ist, und zwar sowohl gemäß den Backtesting Ergebnissen als auch nach dem Diebold-Marianto Test für statistische Signifikanz des Performanceunterschieds zwischen zwei Vorhersagemethoden. Ein weiteres neuartiges Charakteristikum dieser Arbeit liegt in der Verwendung individueller Entscheidungsregeln für jede einzelne Aktie im Unterschied zum traditionellen Zusammenfassen lernender Muster. Empirische Daten in Form individueller Entscheidungsregeln für einen zufällig ausgesuchten Zeitpunkt in der Überprüfungsreihe rechtfertigen diese Methode. / Stock picking is the field of financial analysis that is of particular interest for many professional investors and researchers. There is a lot of research evidence supporting the fact that stock returns can effectively be forecasted. While various modeling techniques could be employed for stock price prediction, a critical analysis of popular methods including general equilibrium and asset pricing models; parametric, non- and semiparametric regression models; and popular black box classification approaches is provided. Due to advantageous properties of binary classification trees including excellent level of interpretability of decision rules, the trading algorithm core is built employing this modern nonparametric method. Optimal tree size is believed to be the crucial factor of forecasting performance of classification trees. While there exists a set of widely adopted alternative tree induction and pruning techniques, which are critically examined in the study, one of the main contributions of this work is a novel methodology of nonsymmetrical tree pruning with reject option called Best Node Selection (BNS). An important inverse propagation property of BNS is proven that provides an easy way to implement the search for the optimal tree size in practice. Traditional cost-complexity pruning shows similar performance in terms of tree accuracy when assessed against popular alternative techniques, and it is the default pruning method for many applications. BNS is compared with costcomplexity pruning empirically by composing two recursive portfolios out of DAX30 stocks. Performance forecasts for each of the stocks are provided by constructed decision trees that are updated when new market information becomes available. It is shown that BNS clearly outperforms the traditional approach according to the backtesting results and the Diebold-Mariano test for statistical significance of the performance difference between two forecasting methods. Another novel feature of this work is the use of individual decision rules for each stock as opposed to pooling of learning samples, which is done traditionally. Empirical data in the form of provided individual decision rules for a randomly selected time point in the backtesting set justify this approach.
218

Predicting user churn using temporal information : Early detection of churning users with machine learning using log-level data from a MedTech application / Förutsägning av användaravhopp med tidsinformation : Tidig identifiering av avhoppande användare med maskininlärning utifrån systemloggar från en medicinteknisk produkt

Marcus, Love January 2023 (has links)
User retention is a critical aspect of any business or service. Churn is the continuous loss of active users. A low churn rate enables companies to focus more resources on providing better services in contrast to recruiting new users. Current published research on predicting user churn disregards time of day and time variability of events and actions by feature selection or data preprocessing. This thesis empirically investigates the practical benefits of including accurate temporal information for binary prediction of user churn by training a set of Machine Learning (ML) classifiers on differently prepared data. One data preparation approach was based on temporally sorted logs (log-level data set), and the other on stacked aggregations (aggregated data set) with additional engineered temporal features. The additional temporal features included information about relative time, time of day, and temporal variability. The inclusion of the temporal information was evaluated by training and evaluating the classifiers with the different features on a real-world dataset from a MedTech application. Artificial Neural Networks (ANNs), Random Forrests (RFs), Decision Trees (DTs) and naïve approaches were applied and benchmarked. The classifiers were compared with among others the Area Under the Receiver Operating Characteristics Curve (AUC), Positive Predictive Value (PPV) and True Positive Rate (TPR) (a.k.a. precision and recall). The PPV scores the classifiers by their accuracy among the positively labeled class, the TPR measures the recognized proportion of the positive class, and the AUC is a metric of general performance. The results demonstrate a statistically significant value of including time variation features overall and particularly that the classifiers performed better on the log-level data set. An ANN trained on temporally sorted logs performs best followed by a RF on the same data set. / Bevarande av användare är en kritisk aspekt för alla företag eller tjänsteleverantörer. Ett lågt användarbortfall gör det möjligt för företag att fokusera mer resurser på att tillhandahålla bättre tjänster istället för att rekrytera nya användare. Tidigare publicerad forskning om att förutsäga användarbortfall bortser från tid på dygnet och tidsvariationer för loggad användaraktivitet genom val av förbehandlingsmetoder eller variabelselektion. Den här avhandlingen undersöker empiriskt de praktiska fördelarna med att inkludera information om tidsvariabler innefattande tid på dygnet och tidsvariation för binär förutsägelse av användarbortfall genom att träna klassificerare på data förbehandlat på olika sätt. Två förbehandlingsmetoder används, en baserad på tidssorterade loggar (loggnivå) och den andra på packade aggregeringar (aggregerat) utökad med framtagna tidsvariabler. Inklusionen av tidsvariablerna utvärderades genom att träna och utvärdera en uppsättning MLklassificerare med de olika tidsvariablerna på en verklig datamängd från en digital medicinskteknisk produkt. ANNs, RFs, DTs och naiva tillvägagångssätt tillämpades och jämfördes på den aggregerade datamängden med och utan tidsvariationsvariablerna och på datamängden på loggnivå. Klassificerarna jämfördes med bland annat AUC, PPV och TPR. PPV betygsätter algoritmerna efter träffsäkerhet bland den positivt märkta klassen och TPR utvärderar hur stor del av den positiva klassen som identifierats medan AUC är ett mått av klassificerarnas allmänna prestanda. Resultaten visar ett betydande värde av att inkludera tidsvariationsvariablerna överlag och i synnerhet att klassificerarna presterade bättre på datauppsättningen på loggnivå. Ett ANN tränad på tidssorterade loggar presterar bäst följt av en RF på samma datamängd.
219

Sales Volume Forecasting of Ericsson Radio Units - A Statistical Learning Approach / : Prognostisering av försäljningsvolymer för radioenheter - Statistisk modellering

Amethier, Patrik, Gerbaulet, André January 2020 (has links)
Demand forecasting is a well-established internal process at Ericsson, where employees from various departments within the company collaborate in order to predict future sales volumes of specific products over horizons ranging from months to a few years. This study aims to evaluate current predictions regarding radio unit products of Ericsson, draw insights from historical volume data, and finally develop a novel, statistical prediction approach. Specifically, a two-part statistical model with a decision tree followed by a neural network is trained on previous sales data of radio units, and then evaluated (also on historical data) regarding predictive accuracy. To test the hypothesis that mid-range volume predictions of a 1-3 year horizon made by data-driven statistical models can be more accurate, the two-part model makes predictions per individual radio unit product based on several predictive attributes, mainly historical volume data and information relating to geography, country and customer trends. The majority of wMAPEs per product from the predictive model were shown to be less than 5% for the three different prediction horizons, which can be compared to global wMAPEs from Ericsson's existing long range forecast process of 9% for 1 year, 13% for 2 years and 22% for 3 years. These results suggest the strength of the data-driven predictive model. However, care must be taken when comparing the two error measures and one must take into account the large variances of wMAPEs from the predictive model. / Ericsson har en väletablerad intern process för prognostisering av försäljningsvolymer, där produktnära samt kundnära roller samarbetar med inköpsorganisationen för att säkra noggranna uppskattningar angående framtidens efterfrågan. Syftet med denna studie är att evaluera tidigare prognoser, och sedan utveckla en ny prediktiv, statistisk modell som prognostiserar baserad på historisk data. Studien fokuserar på produktkategorin radio, och utvecklar en två-stegsmodell bestående av en trädmodell och ett neuralt nätverk. För att testa hypotesen att en 1-3 års prognos för en produkt kan göras mer noggran med en datadriven modell, tränas modellen på attribut kopplat till produkten, till exempel historiska volymer för produkten, och volymtrender inom produktens marknadsområden och kundgrupper. Detta resulterade i flera prognoser på olika tidshorisonter, nämligen 1-12 månader, 13-24 månader samt 25-36 månder. Majoriteten av wMAPE-felen för dess prognoser visades ligga under 5%, vilket kan jämföras med wMAPE på 9% för Ericssons befintliga 1-årsprognoser, 13% för 2-årsprognerna samt 22% för 3-årsprognoserna. Detta pekar på att datadrivna, statistiska metoder kan användas för att producera gedigna prognoser för framtida försäljningsvolymer, men hänsyn bör tas till jämförelsen mellan de kvalitativa uppskattningarna och de statistiska prognoserna, samt de höga varianserna i felen.
220

Supervised Learning for Prediction of Tumour Mutational Burden / Användning av statistisk inlärning för estimering av mutationsbörda

Hargell, Joanna January 2021 (has links)
Tumour Mutational Burden is a promising biomarker to predict response to immunotherapy. In this thesis, statistical methods of supervised learning were used to predict TMB: GLM, Decision Trees and SVM. Predictions were based on data from targeted DNA sequencing, using variants found in the exonic, intronic, UTR and intergenic regions of the human DNA. This project was of an exploratory nature, performed in a pan-cancer setting. Both regression and classification were considered. The purpose was to investigate whether variants found in these regions of the DNA sequence are useful when predicting TMB. Poisson regression and Negative binomial regression were used within the framework of GLM. The results indicated deficiencies in the model assumptions and that the use of GLM for the application is questionable. The single regression tree did not yield satisfactory prediction accuracy. However, performance was improved by using variance reducing methods such as bagging and random forests. The use of boosted regression trees did not yield any significant improvement in prediction accuracy. In the classification setting, binary as well as multiple classes were considered. The distinction between classes was based on commonly used thresholds in clinical care to achieve immunotherapy. SVM and classification trees yielded high prediction accuracy for the binary case: a misclassification rate of 0.0242 and 0 respectively for the independent test set. In the multiple classification setting, bagging and random forests were implemented, yet, did not improve performance over the single classification tree. SVM produced a misclassification rate of 0.103, and the corresponding number for the single classification tree was 0.109. It was concluded that SVM and Decision trees are suitable methods for predicting TMB based on targeted gene panels. However, to obtain reliable predictions, there is a need to move from a pan-cancer setting to a diagnosis-based setting. Furthermore, parameters affecting TMB, like pre-analytical factors need to be included in the statistical analysis. / Denna uppsats undersöker tre metoder inom statistisk inlärning: GLM, Decision Trees och SVM, med avsikt att förutsäga mutationsbörda, TMB, för cancerpatienter. Metoderna har applicerats både inom regression och klassificering. Förutsägelser gjordes baserat på data från panel-baserad DNA-sekvensering som innehåller varianter från kodande, introniska UTR och intergeniska regioner av mänskligt DNA. Projektet ämnar att undersöka om varianter från dessa regioner av DNA-sekvensen kan vara användbara för att förutsäga mutationsbördan för en patient. Poisson-regression och Negativ Binomial-regression undersöktes inom GLM. Resultaten indikerade på brister i modellerna och att GLM inte är lämplig för denna tillämpning. Regressionsträden gav inte tillräckligt noggranna förutsägelser, men implementering av bagging och random forests förbättrade modellernas prestanda. Boosting förbättrade inte resultaten. Inom klassificering användes både binära klasser och multipla klasser. Avgränsningen mellan klasser baserades på kända gränser för TMB inom vården för att få immunoterapi. SVM och decision trees gav god prestanda för binär klassificering, med ett klassificeringsfel på 0.024 för SVM och 0 för decision trees. Bagging och random forests implementerades för det multipla fallet inom decision trees, men förbättrade inte prestandan. För multipla klasser gav SVM ett klassificeringnsfel på 0.103 och decision trees 0.109. Både SVM och decision trees visade sig vara lämpliga metoder för för att förutse värdet på TMB. Däremot, för att förutsägelserna ska vara tillförlitliga finns det ett behov av att göra denna typ av analys för varje enskild cancerdiagnos. Dessutom finns det ett behov av att inkludera parametrar från den bioinformatiska processen i den statistiska analysen.

Page generated in 0.1001 seconds