• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 339
  • 26
  • 21
  • 13
  • 8
  • 5
  • 5
  • 5
  • 4
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 507
  • 507
  • 272
  • 270
  • 147
  • 135
  • 129
  • 128
  • 113
  • 92
  • 88
  • 77
  • 76
  • 74
  • 59
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
211

A Machine Learning Estimation of the Occupancy of Padel Facilities in Sweden : An application of Random Forest algorithm on a padel booking dataset / Uppskattning av svenska padelanläggningars beläggningsgrad genom maskininlärning

Johansson, Michael, Gonzálvez Läth, Nadia January 2022 (has links)
Padel is one of the fastest growing sports in Sweden. Its popularity rose significantly during the Covid-19 pandemic in 2020, as many other types of sport facilities closed, and people had more flexible work schedules due to remote work. This paper is an analysis on the monthly occupancy of indoor padel facilities in Sweden between January 2018 and April 2022. It aims to answer to what degree a machine learning algorithm can predict the occupancy for a given padel facility and which key features have the largest impact on the occupancy. With these findings, it is possible to estimate the revenue for a given padel facility and therefore be used to identify which type of padel facilities have the biggest opportunity to succeed from an economical perspective. This article reviews the literature regarding different methods of machine learning, in this case, applied to booking systems and occupancy estimations. The reviewed literature also presents the most common evaluation metrics used for comparing different machine learning models. This study analyses the relationship between the occupancy level of a given padel facility and 12 input features, related to the padel facility in question, with a random forest regression model. This work results in a model that achieved a R2 score of 49% and a mean absolute error of 11%. The input features ranked according to the largest impact on the model’s estimation are (with the mean of all absolute SHAP values written in parentheses): Year (7.71), Month (5.23), Average Income in municipality (4.13), Driving Time from municipality Centre (2.35), Population of municipality (1.97), Padel Slots in municipality (1.27), Padel Slots in facility (1.27), Average Court Price (1.12), Tennis Slots in municipality (0.73), Badminton Slots in municipality (0.55), Squash Slots in municipality (0.44) and Golf Slots in municipality (0.26). Padel facilities had the highest average occupancy in 2020. The Covid-19 pandemic is likely a significant contributor to this, due to the shutdown of offices and many types of training venues. Therefore, Year has the largest impact on the model’s estimation. Occupancy of indoor facilities follows a seasonal trend, where it tends to be highest in December and January and lowest in June and July. This trend can partly be explained by a larger demand for indoor sport activities during winter and increased competition from outside padel facilities and other activities during summer. Because of this, Month had the second largest impact on the model’s estimation. / Padel är en av de snabbast växande sporterna i Sverige. Dess popularitet ökade avsevärt under Covid-19-pandemin i 2020, främst på grund av att många andra typer av sportanläggningar stängdes ner och människor hade mer flexibla arbetsscheman på grund av distansarbete. Den här uppsatsen är en analys av den månatliga beläggningen av inomhuspadelanläggningar i Sverige mellan januari 2018 och april 2022. Studien syftar till att svara på i vilken grad en maskininlärningsalgoritm kan förutsäga beläggningen för en given padelanläggning och vilka nyckelfunktioner som har störst inverkan på beläggningen. Med dessa insikter är det möjligt att uppskatta intäkterna för en given padelanläggning och kan därför användas vilka typer av padelanläggningar som har störst möjlighet att vara framgångsrika ur ett ekonomiskt perspektiv. Den granskade litteraturen studerar olika maskininlärningsmetoder tillämpad i områden som bokningssystemsanalys och beläggningsgradsstudier, samt presenterar de vanligaste utvärderingsmåtten som används för att jämföra metoderna. Denna studie analyserar sambandet mellan beläggningsgraden för en given padelanläggning och 12 inputparametrar, relaterade till padelanläggningen i fråga med hjälp av en random forest regressionsalgoritm. Detta arbete resulterar i en modell som uppnådde ett R2 värde på 49% och en genomsnittlig absolut avvikelse på 11 %. Inputparametrarna rangordnade enligt den största påverkan på modellens uppskattning är (med medelvärdet av alla absoluta SHAP-värden skrivna inom parentes): År (7.71), Månad (5.23), Genomsnittlig Inkomst i kommunen (4.13), Körtid mellan anläggning och kommunens centrum (2.35), Kommunens befolkningsmängd (1.97), Antal padeltider i kommunen (1.27), Padeltider i anläggningen(1.27), Genomsnittlig pris för bana(1.12), Tennistider i kommunen (0.73), Badmintontider i kommunen (0.55), Squashtider i kommunen (0.44) och Golftider i kommunen (0.26). Padelanläggningar hade högsta genomsnittliga beläggningsgraden under 2020. Covid-19-pandemin är sannolikt en betydande bidragande orsak till detta på grund av nedläggningen av kontor och andra sportanläggningar. Därför har inputparametern År den största inverkan på modellens uppskattning. Beläggningen av inomhusanläggningar följer en säsongsmässig trend, där den tenderar att vara högst i januari och lägst i juli. Denna trend kan delvis förklaras av en större efterfrågan på inomhussportaktiviteter under vintern och ökad konkurrens från utomstående padelanläggningar och andra aktiviteter under sommaren. På grund av detta hade Månad den näst största påverkan på modellens uppskattning.
212

Machine Learning Algorithms to Predict Cost Account Codes in an ERP System : An Exploratory Case Study

Wirdemo, Alexander January 2023 (has links)
This study aimed to investigate how Machine Learning (ML) algorithms can be used to predict the cost account code to be used when handling invoices in an Enterprise Resource Planning (ERP) system commonly found in the Swedish public sector. This implied testing which one of the tested algorithms that performs the best and what criteria that need to be met in order to perform the best. Previous studies on ML and its use in invoice classification have focused on either the accounts payable side or the accounts receivable side of the balance sheet. The studies have used a variety of methods, some not only involving common ML algorithms such as Random forest, Naïve Bayes, Decision tree, Support Vector Machine, Logistic regression, Neural network or k-nearest Neighbor but also other classifiers such as rule classifiers and naïve classifiers. The general conclusion from previous studies is that several algorithms can classify invoices with a satisfactory accuracy score and that Random forest, Naïve Bayes and Neural network have shown the most promising results. The study was performed as an exploratory case study. The case company was a small municipal community where the finance clerks handles received invoices through an ERP system. The accounting step of invoice handling involves selecting the proper cost account code before submitting the invoice for review and approval. The data used was invoice summaries holding the organization number, bankgiro, postgiro and account code used. The algorithms selected for the task were the supervised learning algorithms Random forest and Naïve Bayes and the instance-based algorithm k-Nearest Neighbor (k-NN). The findings indicated that ML could be used to predict which cost account code to be used by providing a pre-filled suggestion when the clerk opens the invoice. Among the algorithms tested, Random forest performed the best with 78% accuracy (Naïve Bayes and k-NN performed at 69% and 70% accuracy, respectively). One reason for this is Random forest’s ability to handle several input variables, generate an unbiased estimate of the generalization error, and its ability to give information about the relationship between the variables and classification. However, a high level of support is needed in order to get the algorithm to perform at its best, where 335 occurrences is a guiding number in this case. / Syftet med denna studie var att undersöka hur Machine Learning (ML) algoritmer kan användas för att förutsäga vilken kontokod som ska användas vid hantering av fakturor i ett affärssystem som är vanligt förekommande i svensk offentlig sektor. Detta innebar att undersöka vilken av de testade algoritmerna som presterar bäst och vilka kriterier som måste uppfyllas för att prestera bäst. Tidigare studier om ML och dess användning vid fakturaklassificering har fokuserat på antingen balansräkningens leverantörsreskontra (leverantörsskulder) eller kundreskontrasidan (kundfordringar) i balansräkningen. Studierna har använt olika metoder, några involverar inte bara vanliga ML-algoritmer som Random forest, Naive Bayes, beslutsträd, Support Vector Machine, Logistisk regression, Neuralt nätverk eller k-nearest Neighbour, utan även andra klassificerare som regelklassificerare och naiva klassificerare. Den generella slutsatsen från tidigare studier är att det finns flera algoritmer som kan klassificera fakturor med en tillfredsställande noggrannhet, och att Random forest, Naive Bayes och neurala nätverk har visat de mest lovande resultaten. Studien utfördes som en explorativ fallstudie. Fallföretaget var en mindre kommun där ekonomiassistenter hanterar inkommande fakturor genom ett affärssystem. Bokföringssteget för fakturahantering innebär att användaren väljer rätt kostnadskontokod innan fakturan skickas för granskning och godkännande. Uppgifterna som användes var fakturasammandrag med organisationsnummer, bankgiro, postgiro och kontokod. Algoritmerna som valdes för uppgiften var de övervakade inlärningsalgoritmerna Random forest och Naive Bayes och den instansbaserade algoritmen k-Nearest Neighbour. Resultaten tyder på att ML skulle kunna användas för att förutsäga vilken kostnadskod som ska användas genom att ge ett förifyllt förslag när expediten öppnar fakturan. Bland de testade algoritmerna presterade Random forest bäst med 78 % noggrannhet (Naïve Bayes och k-Nearest Neighbour presterade med 69 % respektive 70 % noggrannhet). En förklaring till detta är Random forests förmåga att hantera flera indatavariabler, generera en opartisk skattning av generaliseringsfelet och dess förmåga att ge information om sambandet mellan variablerna och klassificeringen. Det krävs dock en högt antal dataobservationer för att få algoritmen att prestera som bäst, där 335 förekomster är ett minimum i detta fall.
213

Modelling Customer Lifetime Value in the Retail Banking Industry / Modellering av Customer Lifetime Value inom retail banking-branschen

Völcker, Max, Stenfelt, Carl January 2021 (has links)
This thesis was conducted in cooperation with the Swedish bank SEB, who expressed interest in getting an increased understanding of how the marketing measure Customer Lifetime Value could be implemented and used in the retail banking industry. Accordingly, the purpose of this thesis was to provide insight into how Customer Lifetime Value could be modelled in an appropriate way in the retail banking industry and provide an increased understanding of necessary considerations for the modelling process. First, performance requirements for models of Customer Lifetime Value in the retail banking industry were identified through literary analysis and interviews with SEB. These requirements were then used to evaluate six general modelling approaches: RFM, Probability, Econometric, Persistance, Diffusion/Growth and Computer science. Based on the evaluation, the computer science approach and the econometric approach were identified as suitable for further investigation. This was achieved by implementing and analysing the performance of two models chosen as examples of respective approach. Specifically, a computer science model based on the \textit{random forest} algorithm and an econometric model based on \textit{Markov chains} were chosen. The results indicate that both approaches could be appropriate for the retail banking industry, but that an econometric approach could have the advantage of higher interpretability while a computer science approach can have the advantage of higher predictive accuracy.  In conclusion, the results indicate that the specific considerations and performance requirements for models of Customer Lifetime Value in the retail banking context should be based on a specific use case and area of business application. However, the discussions, considerations and examples of implementations provided in this thesis could serve as a foundation for future research and model development in this context. / Detta arbete genomfördes i samarbete med SEB, som uttryckt ett intresse för att öka sin kunskap kring hur marknadsföringsmåttet Customer Lifetime Value skulle kunna implementeras och användas i retail banking-branschen. Syftet med denna uppsats var följaktligen att ge en ökad förståelse för vad som är en lämplig modell av Customer Lifetime Value i branschen, samt ge en ökad förståelse för nödvändiga hänsynstaganden i modelleringsprocessen. Detta gjordes genom att först identifiera existerande modellkrav genom litteraturanalyys och intervjuer med SEB. Kraven användes sedan för att utvärdera sex generella modelltyper: RFM, Probability, Econometric, Persistance, Diffusion/Growth and Computer science. Baserat på utvärderingen identifierades Econometric och Computer science som lämpliga modelltyper för vidare undersökning, vilken gjordes genom att implementera en modell från respektive modelltyp. Specifikt valdes en Computer science-metod baserad på algoritmen random forest och en Econometric-metod baserad på Markovkedjor. Resultaten indikerade att båda modelltyper är lämpliga för implementering i retail banking-branschen, men att en Econometric-metod skulle kunna ha större tolkbarhet och att en Computer science-metod skulle kunna ha bättre precision. Sammanfattningsvis konstateras att hänsynstaganden och modellkrav på modeller av Customer Lifetime Value i retail banking-branschen bör utformas utifrån det specifika tilltänkta användningsområdet. De diskussioner, hänsynstaganden och implementationsexempel som presenteras i detta arbete kan dock fungera som grund för vidare forskning och modellutveckling i kontexten.
214

Predicting House Prices on the Countryside using Boosted Decision Trees / Förutseende av huspriser på landsbygden genom boostade beslutsträd

Revend, War January 2020 (has links)
This thesis intends to evaluate the feasibility of supervised learning models for predicting house prices on the countryside of South Sweden. It is essential for mortgage lenders to have accurate housing valuation algorithms and the current model offered by Booli is not accurate enough when evaluating residence prices on the countryside. Different types of boosted decision trees were implemented to address this issue and their performances were compared to traditional machine learning methods. These different types of supervised learning models were implemented in order to find the best model with regards to relevant evaluation metrics such as root-mean-squared error (RMSE) and mean absolute percentage error (MAPE). The implemented models were ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. All these models were benchmarked against Booli's current housing valuation algorithms which are based on a k-NN model. The results from this thesis indicated that the LightGBM model is the optimal one as it had the best overall performance with respect to the chosen evaluation metrics. When comparing the LightGBM model to the benchmark, the performance was overall better, the LightGBM model had an RMSE score of 0.330 compared to 0.358 for the Booli model, indicating that there is a potential of using boosted decision trees to improve the predictive accuracy of residence prices on the countryside. / Denna uppsats ämnar utvärdera genomförbarheten hos olika övervakade inlärningsmodeller för att förutse huspriser på landsbygden i Södra Sverige. Det är viktigt för bostadslånsgivare att ha noggranna algoritmer när de värderar bostäder, den nuvarande modellen som Booli erbjuder har dålig precision när det gäller värderingar av bostäder på landsbygden. Olika typer av boostade beslutsträd implementerades för att ta itu med denna fråga och deras prestanda jämfördes med traditionella maskininlärningsmetoder. Dessa olika typer av övervakad inlärningsmodeller implementerades för att hitta den bästa modellen med avseende på relevanta prestationsmått som t.ex. root-mean-squared error (RMSE) och mean absolute percentage error (MAPE). De övervakade inlärningsmodellerna var ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. Samtliga algoritmers prestanda jämförs med Boolis nuvarande bostadsvärderingsalgoritm, som är baserade på en k-NN modell. Resultatet från denna uppsats visar att LightGBM modellen är den optimala modellen för att värdera husen på landsbygden eftersom den hade den bästa totala prestandan med avseende på de utvalda utvärderingsmetoderna. LightGBM modellen jämfördes med Booli modellen där prestandan av LightGBM modellen var i överlag bättre, där LightGBM modellen hade ett RMSE värde på 0.330 jämfört med Booli modellen som hade ett RMSE värde på 0.358. Vilket indikerar att det finns en potential att använda boostade beslutsträd för att förbättra noggrannheten i förutsägelserna av huspriser på landsbygden.
215

Loss Given Default Estimation with Machine Learning Ensemble Methods / Estimering av förlust vid fallissemang med ensembelmetoder inom maskininlärning

Velka, Elina January 2020 (has links)
This thesis evaluates the performance of three machine learning methods in prediction of the Loss Given Default (LGD). LGD can be seen as the opposite of the recovery rate, i.e. the ratio of an outstanding loan that the loan issuer would not be able to recover in case the customer would default. The methods investigated are decision trees, random forest and boosted methods. All of the methods investigated performed well in predicting the cases were the loan is not recovered, LGD = 1 (100%), or the loan is totally recovered, LGD = 0 (0% ). When the performance of the models was evaluated on a dataset where the observations with LGD = 1 were removed, a significant decrease in performance was observed. The random forest model built on an unbalanced training dataset showed better performance on the test dataset that included values LGD = 1 and the random forest model built on a balanced training dataset performed better on the test set where the observations of LGD = 1 were removed. Boosted models evaluated in this study showed less accurate predictions than other methods used. Overall, the performance of random forest models showed slightly better results than the performance of decision tree models, although the computational time (the cost) was considerably longer when running the random forest models. Therefore decision tree models would be suggested for prediction of the Loss Given Default. / Denna uppsats undersöker och jämför tre maskininlärningsmetoder som estimerar förlust vid fallissemang (Loss Given Default, LGD). LGD kan ses som motsatsen till återhämtningsgrad, dvs. andelen av det utstående lånet som långivaren inte skulle återfå ifall kunden skulle fallera. Maskininlärningsmetoder som undersöks i detta arbete är decision trees, random forest och boosted metoder. Alla metoder fungerade väl vid estimering av lån som antingen inte återbetalas, dvs. LGD = 1 (100%), eller av lån som betalas i sin helhet, LGD = 0 (0%). En tydlig minskning i modellernas träffsäkerhet påvisades när modellerna kördes med ett dataset där observationer med LGD = 1 var borttagna. Random forest modeller byggda på ett obalanserat träningsdataset presterade bättre än de övriga modellerna på testset som inkluderade observationer där LGD = 1. Då observationer med LGD = 1 var borttagna visade det sig att random forest modeller byggda på ett balanserat träningsdataset presterade bättre än de övriga modellerna. Boosted modeller visade den svagaste träffsäkerheten av de tre metoderna som blev undersökta i denna studie. Totalt sett visade studien att random forest modeller byggda på ett obalanserat träningsdataset presterade en aning bättre än decision tree modeller, men beräkningstiden (kostnaden) var betydligt längre när random forest modeller kördes. Därför skulle decision tree modeller föredras vid estimering av förlust vid fallissemang.
216

A Predictive Analysis of Customer Churn / : En Prediktiv Analys av Kundbortfall

Eskils, Olivia, Backman, Anna January 2023 (has links)
Churn refers to the discontinuation of a contract; consequently, customer churn occurs when existing customers stop being customers. Predicting customer churn is a challenging task in customer retention, but with the advancements made in the field of artificial intelligence and machine learning, the feasibility to predict customer churn has increased. Prior studies have demonstrated that machine learning can be utilized to forecast customer churn. The aim of this thesis was to develop and implement a machine learning model to predict customer churn and identify the customer features that have a significant impact on churn. This Study has been conducted in cooperation with the Swedish insurance company Bliwa, who expressed interest in gaining an increased understanding of why customers choose to leave.  Three models, Logistic Regression, Random Forest, and Gradient Boosting, were used and evaluated. Bayesian optimization was used to optimize the models. After obtaining an indication of their predictive performance during evaluation using Cross-Validation, it was concluded that LightGBM provided the best result in terms of PR-AUC, making it the most effective approach for the problem at hand. Subsequently, a SHAP-analysis was carried out to gain insights into which customer features that have an impact on whether or not a customer churn. The outcome of the SHAP-analysis revealed specific customer features that had a significant influence on churn. This knowledge can be utilized to proactively implement measures aimed at reducing the probability of churn. / Att förutsäga kundbortfall är en utmanande uppgift inom kundbehållning, men med de framsteg som gjorts inom artificiell intelligens och maskininlärning har möjligheten att förutsäga kundbortfall ökat. Tidigare studier har visat att maskinlärning kan användas för att prognostisera kundbortfall. Syftet med denna studie var att utveckla och implementera en maskininlärningsmodell för att förutsäga kundbortfall och identifiera kundegenskaper som har en betydande inverkan på varför en kund väljer att lämna eller inte. Denna studie har genomförts i samarbete med det svenska försäkringsbolaget Bliwa, som uttryckte sitt intresse över att få en ökad förståelse för varför kunder väljer att lämna. Tre modeller, Logistisk Regression, Random Forest och Gradient Boosting användes och utvärderades. Bayesiansk optimering användes för att optimera dessa modeller. Efter att ha utvärderat prediktiv noggrannhet i samband med krossvalidering drogs slutsatsen att LightGBM gav det bästa resultatet i termer av PR-AUC och ansågs därför vara den mest effektiva metoden för det aktuella problemet. Därefter genomfördes en SHAP-analys för att ge insikter om vilka kundegenskaper som påverkar varför en kund riskerar, eller inte riskerar att lämna. Resultatet av SHAP-analysen visade att vissa kundegenskaper stack ut och verkade ha en betydande påverkan på kundbortfall. Denna kunskap kan användas för att vidta proaktiva åtgärder för att minska sannolikheten för kundbortfall.
217

Early Warning System of Students Failing a Course : A Binary Classification Modelling Approach at Upper Secondary School Level / lFörebyggande Varningssystem av elever med icke godkänt betyg : Genom applicering av binär klassificeringsmodell inom gymnasieskolan

Karlsson, Niklas, Lundell, Albin January 2022 (has links)
Only 70% of the Swedish students graduate from upper secondary school within the given time frame. Earlier research has shown that unfinished degrees disadvantage the individual student, policy makers and society. A first step for preventing dropouts is to indicate students about to fail courses. Thus the purpose is to identify tendencies whether a student will pass or not pass a course. In addition, the thesis accounts for the development of an Early Warning System to be applied to signal which students need additional support from a professional teacher. The used algorithm Random Forest functioned as a binary classification model of a failed grade against a passing grade. Data in the study are in samples of approximately 700 students from an upper secondary school within the Stockholm municipality. The chosen method originates from a Design Science Research Methodology that allows the stakeholders to be involved in the process. The results showed that the most dominant indicators for classifying correct were Absence, Previous grades and Mathematics diagnosis. Furthermore, were variables from the Learning Management System predominant indicators when the system also was utilised by teachers. The prediction accuracy of the algorithm indicates a positive tendency for classifying correctly. On the other hand, the small number of data points imply doubt if an Early Warning System can be applied in its current state. Thus, one conclusion is in further studies, it is necessary to increase the number of data points. Suggestions to address the problem are mentioned in the Discussion. Moreover, the results are analysed together with a review of the potential Early Warning Systemfrom a didactic perspective. Furthermore, the ethical aspects of the thesis are discussed thoroughly. / Endast 70% av svenska gymnasieelever tar examen inom den givna tidsramen. Tidigare forskning har visat att en oavslutad gymnasieutbildning missgynnar både eleven och samhället i stort. Ett första steg mot att förebygga att elever avviker från gymnasiet är att indikera vilka studenter som är på väg mot ett underkänt betyg i kurser. Därmed är syftet med rapporten att identifiera vilka trender som bäst indikerar att en elev kommer klara en kurs eller inte. Dessutom redogör rapporten för utvecklandet av ett förebyggande varningssystem som kan appliceras för att signalera vilka studenter som behöver ytterligare stöd från läraren och skolan. Algoritmen som användes var Random Forest och fungerar som en binär klassificeringsmodell av ett underkänt betyg mot ett godkänt. Den data som använts i studien är datapunkter för ungefär 700 elever från en gymnasieskola i Stockholmsområdet. Den valda metoden utgår från en Design Science Researchmetodik vilket möjliggör för intressenter att vara involverade i processen. Resultaten visade att de viktigaste variablerna var frånvaro, tidigare betyg och resultat från Stockholmsprovet (kommunal matematikdiagnos). Vidare var variabler från lärplattformen en viktig indikator ifall lärplattformen användes av läraren. Algoritmens noggrannhet indikerade en positiv trend för att klassificeringen gjordes korrekt. Å andra sidan är det tveksamt ifall det förebyggande systemet kan användas i sitt nuvarande tillstånd då mängden data som användes för att träna algoritmen var liten. Därav är en slutsats att det är nödvändigt för vidare studier att öka mängden datapunkter som används. I Diskussionen nämns förslag på hur problemet ska åtgärdas. Dessutom analyseras resultaten tillsammans med en utvärdering av systemet från ett didaktiskt perspektiv. Vidare diskuteras rapportens etiska aspekter genomgående.
218

Exploring relationships between in-hospital mortality and hospital case volume using random forest: results of a cohort study based on a nationwide sample of German hospitals, 2016–2018

Roessler, Martin, Walther, Felix, Eberlein-Gonska, Maria, Scriba, Peter C., Kuhlen, Ralf, Schmitt, Jochen, Schoffer, Olaf 21 May 2024 (has links)
Background Relationships between in-hospital mortality and case volume were investigated for various patient groups in many empirical studies with mixed results. Typically, those studies relied on (semi-)parametric statistical models like logistic regression. Those models impose strong assumptions on the functional form of the relationship between outcome and case volume. The aim of this study was to determine associations between in-hospital mortality and hospital case volume using random forest as a flexible, nonparametric machine learning method. Methods We analyzed a sample of 753,895 hospital cases with stroke, myocardial infarction, ventilation > 24 h, COPD, pneumonia, and colorectal cancer undergoing colorectal resection treated in 233 German hospitals over the period 2016–2018. We derived partial dependence functions from random forest estimates capturing the relationship between the patient-specific probability of in-hospital death and hospital case volume for each of the six considered patient groups. Results Across all patient groups, the smallest hospital volumes were consistently related to the highest predicted probabilities of in-hospital death. We found strong relationships between in-hospital mortality and hospital case volume for hospitals treating a (very) small number of cases. Slightly higher case volumes were associated with substantially lower mortality. The estimated relationships between in-hospital mortality and case volume were nonlinear and nonmonotonic. Conclusion Our analysis revealed strong relationships between in-hospital mortality and hospital case volume in hospitals treating a small number of cases. The nonlinearity and nonmonotonicity of the estimated relationships indicate that studies applying conventional statistical approaches like logistic regression should consider these relationships adequately.
219

Addressing Issues in the Detection of Gene-Environment Interaction Through the Study of Conduct Disorder

Prom, Elizabeth Chin 01 January 2007 (has links)
This work addresses issues in the study of gene-environment interaction (GxE) through research of conduct disorder (CD) among adolescents and extends the recent report of significant GxE and subsequent replication studies. A sub-sample of 1,299 individual participants/649 twin pairs and their parents from the Virginia Twin Study of Adolescent and Behavioral Development was used for whom Monoamine Oxidase A (MAOA) genotype, diagnosis of CD, maternal antisocial personality symptoms, and household neglect were obtained. This dissertation (1) tested for GxE by gender using MAOA and childhood adversity using multiple approaches to CD measurement and model assessment, (2) determined whether other mechanisms would explain differences in GxE by gender and (3) identified and assessed other genes and environments related to the interaction MAOA and childhood adversity. Using a multiple regression approach, a main effect of the low/low MAOA genotype remained after controlling other risk factors in females. However, the effects of GxE were modest and were removed by transforming the environmental measures. In contrast, there was no significant effect of the low activity MAOA allele in males although significant GxE was detected and remained after transformation. The sign of the interaction for males was opposite from females, indicating genetic sensitivity to childhood adversity may differ by gender. Upon further investigation, gender differences in GxE were due to genotype-sex interaction and may involve MAOA. A Markov Chain Monte Carlo approach including a genetic Item Response Theory modeled CD as a trait with continuous liability, since false detection of GxE may result from measurement. In males and females, the inclusion of GxE while controlling for the other covariates was appropriate, but was little improvement in model fit and effect sizes of GxE were small. Other candidate genes functioning in the serotonin and dopamine neurotransmitter systems were tested for interaction with MAOA to affect risk for CD. Main genetic effects of dopamine transporter genotype and MAOA in the presence of comorbidity were detected. No epistatic effects were detected. The use of random forests systematically assessed the environment and produced several interesting environments that will require more thoughtful consideration before incorporation into a model testing GxE.
220

Interpretation, identification and reuse of models : theory and algorithms with applications in predictive toxicology

Palczewska, Anna Maria January 2014 (has links)
This thesis is concerned with developing methodologies that enable existing models to be effectively reused. Results of this thesis are presented in the framework of Quantitative Structural-Activity Relationship (QSAR) models, but their application is much more general. QSAR models relate chemical structures with their biological, chemical or environmental activity. There are many applications that offer an environment to build and store predictive models. Unfortunately, they do not provide advanced functionalities that allow for efficient model selection and for interpretation of model predictions for new data. This thesis aims to address these issues and proposes methodologies for dealing with three research problems: model governance (management), model identification (selection), and interpretation of model predictions. The combination of these methodologies can be employed to build more efficient systems for model reuse in QSAR modelling and other areas. The first part of this study investigates toxicity data and model formats and reviews some of the existing toxicity systems in the context of model development and reuse. Based on the findings of this review and the principles of data governance, a novel concept of model governance is defined. Model governance comprises model representation and model governance processes. These processes are designed and presented in the context of model management. As an application, minimum information requirements and an XML representation for QSAR models are proposed. Once a collection of validated, accepted and well annotated models is available within a model governance framework, they can be applied for new data. It may happen that there is more than one model available for the same endpoint. Which one to chose? The second part of this thesis proposes a theoretical framework and algorithms that enable automated identification of the most reliable model for new data from the collection of existing models. The main idea is based on partitioning of the search space into groups and assigning a single model to each group. The construction of this partitioning is difficult because it is a bi-criteria problem. The main contribution in this part is the application of Pareto points for the search space partition. The proposed methodology is applied to three endpoints in chemoinformatics and predictive toxicology. After having identified a model for the new data, we would like to know how the model obtained its prediction and how trustworthy it is. An interpretation of model predictions is straightforward for linear models thanks to the availability of model parameters and their statistical significance. For non linear models this information can be hidden inside the model structure. This thesis proposes an approach for interpretation of a random forest classification model. This approach allows for the determination of the influence (called feature contribution) of each variable on the model prediction for an individual data. In this part, there are three methods proposed that allow analysis of feature contributions. Such analysis might lead to the discovery of new patterns that represent a standard behaviour of the model and allow additional assessment of the model reliability for new data. The application of these methods to two standard benchmark datasets from the UCI machine learning repository shows a great potential of this methodology. The algorithm for calculating feature contributions has been implemented and is available as an R package called rfFC.

Page generated in 0.0763 seconds