Global ETD Search

1	Interpretable Machine Learning in Alzheimer’s Disease Dementia Kadem, Mason January 2023 (has links) Alzheimer’s disease (AD) is among the top 10 causes of global mortality, and dementia imposes a yearly $1 trillion USD economic burden. Of particular importance, women and minoritized groups are disproportionately affected by AD, with females having higher risk of developing AD compared to male cohorts. Differentiating mild cognitive impairment (MCIstable) from early stage Alzheimer’s disease (MCIAD) is vital worldwide. Despite genetic markers, such as apo-lipoprotein-E (APOE), identification of patients before they develop early stages of MCIAD, a critical period for possible pharmaceutical intervention, is not yet possible. Based on review of the literature three key limitations in existing AD-specific prediction models are apparent: 1) models developed by traditional statistics which overlook nonlinear relationships and complex interactions between features, 2) machine learning models are based on difficult to acquire, occasionally invasive, manually selected, and costly data, and 3) machine learning models often lack interpretability. Rapid, accurate, low-cost, easily accessible, non-invasive, interpretable and early clinical evaluation of AD is critical if an intervention is to have any hope at success. To support healthcare decision making and planning, and potentially reduce the burden of AD, this research leverages the Alzheimer’s Disease Neuroimaging Initiative (ADNI1/GO/2/3) database and a mathematical modelling approach based on supervised machine learning to identify 1) predictive markers of AD, and 2) patients at the highest risk of AD. Specifically we implemented a supervised XGBoost classifier with diagnostic (Exp 1) and prognostic (Exp 2) objectives. In experiment 1 (n=441) classification of AD (n=72) was performed in comparison to healthy controls (n= 369), while experiment 2 (n=738) involved classification of MCIstable (n = 444) compared to MCIAD(n = 294). In Experiment 1, machine learning tools identified three features (i.e., Everyday Cognition Questionnaire (Study partner) - Total, Alzheimer’s Disease Assessment Scale (13 items) and Delayed Total Recall) with ROC AUC scores consistently above 97%. Low performance on delayed recall alone appears to distinguish most AD patients. This finding is consistent with the pathophysiology of AD with individuals having problems storing new information into long-term memory. In experiment 2, the algorithm identified the major indicators of MCI-to-AD progression by integrating genetic, cognitive assessment, demographic and brain imaging to achieve ROC AUC scores consistently above 87%. This speaks to the multi-faceted nature of MCI progression and the utility of comprehensive feature selection. These features are important because they are non-invasive and easily collected. As an important focus of this research, the interpretability of the ML models and their predictions were investigated. The interpretable model for both experiments maintained performance with their complex counterparts while improving their interpretability. The interpretable models provide an intuitive explanation of the decision process which are vital steps towards the clinical adoption of machine learning tools for AD evaluation. The models can reliably predict patient diagnosis (Exp 1) and prognosis (Exp 2). In summary, our work extends beyond the identification of high-risk factors for developing AD. We identified accessible clinical features, together with clinically operable decision routes, to reliably and rapidly predict patients at the highest risk of developing Alzheimer’s disease. We addressed the aforementioned limitations by providing an intuitive explanation of the decision process among the high-risk non-invasive and accessible clinical features that lead to the patient’s risk. / Thesis / Master of Science in Biomedical Engineering / Early identification of patients at the highest risk of Alzheimer’s disease (AD) is crucial for possible pharmaceutical intervention. Existing prediction models have limitations, including inaccessible data and lack of interpretability. This research used a machine learning approach to identify patients at the highest risk of Alzheimer’s disease and found that certain clinical features, such as specific executive function- related cognitive testing (i.e., task switching), combined with genetic predisposition, brain imaging, and demographics, were important contributors to AD risk. The models were able to reliably predict patient diagnosis and prognosis and were designed to be low-cost, non-invasive, clinically operable and easily accessible. The interpretable models provided an intuitive explanation of the decision process, making it a valuable tool for healthcare decision-making and planning.
2	TOWARD ROBUST AND INTERPRETABLE GRAPH AND IMAGE REPRESENTATION LEARNING Juan Shu (14816524) 27 April 2023 (has links) <p>Although deep learning models continue to gain momentum, their robustness and interpretability have always been a big concern because of the complexity of such models. In this dissertation, we studied several topics on the robustness and interpretability of convolutional neural networks (CNNs) and graph neural networks (GNNs). We first identified the structural problem of deep convolutional neural networks that leads to the adversarial examples and defined DNN uncertainty regions. We also argued that the generalization error, the large sample theoretical guarantee established for DNN, cannot adequately capture the phenomenon of adversarial examples. Secondly, we studied the dropout in GNNs, which is an effective regularization approach to prevent overfitting. Contrary to CNN, GNN usually has a shallow structure because a deep GNN normally sees performance degradation. We studied different dropout schemes and established a connection between dropout and over-smoothing in GNNs. Therefore we developed layer-wise compensation dropout, which allows GNN to go deeper without suffering performance degradation. We also developed a heteroscedastic dropout which effectively deals with a large number of missing node features due to heavy experimental noise or privacy issues. Lastly, we studied the interpretability of graph neural networks. We developed a self-interpretable GNN structure that denoises useless edges or features, leading to a more efficient message-passing process. The GNN prediction and explanation accuracy were boosted compared with baseline models. </p> Deep learning Applied statistics Deep Learning Dropout Interpretable Machine Learning Robustness Missing data
3	Enhancement of an Ad Reviewal Process through Interpretable Anomaly Detecting Machine Learning Models / Förbättring av en annonsgranskingsprocess genom tolkbara och avvikelsedetekterande maskinsinlärningsmodeller Dahlgren, Eric January 2022 (has links) Technological advancements made in recent decades in the fields of artificial intelligence (AI) and machine learning (ML) has lead to further automation of tasks previously performed by humans. Manually reviewing and assessing content uploaded to social media and marketplace platforms is one of said tasks that is both tedious and expensive to perform, and could possibly be automated through ML based systems. When introducing ML model predictions to a human decision making process, interpretability and explainability of models has been proven to be important factors for humans to trust in individual sample predictions. This thesis project aims to explore the performance of interpretable ML models used together with humans in an ad review process for a rental marketplace platform. Utilizing the XGBoost framework and SHAP for interpretable ML, a system was built with the ability to score an individual ad and explain the prediction with human readable sentences based on feature importance. The model reached an ROC AUC score of 0.90 and an Average Precision score of 0.64 on a held out test set. An end user survey was conducted which indicated some trust in the model and an appreciation for the local prediction explanations, but low general impact and helpfulness. While most related work focus on model performance, this thesis contributes with a smaller model usability study which can provide grounds for utilizing interpretable ML software in any manual decision making process. Interpretable Machine Learning Explainable AI Anomaly Detection Other Computer and Information Science Annan data- och informationsvetenskap
4	The Dynamics of the Impacts of Automated Vehicles: Urban Form, Mode Choice, and Energy Demand Distribution Wang, Kaidi 24 August 2021 (has links) The commercial deployment of automated vehicles (AVs) is around the corner. With the development of automation technology, automobile and IT companies have started to test automated vehicles. Waymo, an automated driving technology development company, has recently opened the self-driving service to the public. The advancement in this emerging mobility option also drives transportation reasearchers and urban planners to conduct automated vehicle-related research, especially to gain insights on the impact of automated vehicles (AVs) in order to inform policymaking. However, the variation with urban form, the heterogeneity of mode choice, and the impacts at disaggregated levels lead to the dynamics of the impacts of AVs, which not comprehensively understood yet. Therefore, this dissertation extends existing knowledge base by understanding the dynamics of the impacts from three perspectives: (1) examining the role of urban form in the performance of SAV systems; (2) exploring the heterogeneity of AV mode choices across regions; and (3) investigating the distribution of energy consumption in the era of AVs. To examine the first aspect, Shared AV (SAV) systems are simulated for 286 cities and the simulation outcomes are regressed on urban form variables that measure density, diversity, and design. It is suggested that the compact development, a multi-core city pattern, high level of diversity, as well as more pedestrian-oriented networks can promote the performance of SAVs measured using service efficiency, trip pooling success rate, and extra VMT generation. The AV mode choice behaviors of private conventional vehicle (PCV) users in Seattle and Knasas City metropolitan areas are examined using an interpretable machine learning framework based on an AV mode choice survey. It is suggested that attitudes and trip and mode-specific attributes are the most predictive. Positive attitudes can promote the adoption of PAVs. Longer PAV in-vehicle time encourages the residents to keep the PCVs. Longer walking distance promotes the usage of SAVs. In addition, the effects of in-vehicle time and walking distance vary across the two examined regions due to distinct urban form, transportation infrustructure and cultural backgrounds. Kansas City residents can tolerate shorter walking distance before switching to SAV choices due to the car-oriented environment while Seattle residents are more sensitive to in-vehicle travel time because of the local congestion levels. The final part of the dissertation examines the demand for energy of AVs at disaggregated levels incorporating heterogeneity of AV mode choices. A three-step framework is employed including the prediction of mode choice, the determination of vehicle trajectories, and the estimation of the demand for energy. It is suggested that the AV scenario can generate -0.36% to 2.91% extra emissions and consume 2.9% more energy if gasoline is used. The revealed distribution of traffic volume suggests that the demand for charging is concentrated around the downtown areas and on highways if AVs consume electricity. In summary, the dissertation demonstrates that there is a dynamics with regard to the impacts and performance of AVs across regions due to various urban form, infrastructure and cultural environment, and the spatial heterogeneity within cities. / Doctor of Philosophy / Automated vehicles (AVs) have been a hot topic in recent years especially after various IT and automobile companies announced their plans for making AVs. Waymo, an automated driving technology development company, has recently opened the self-driving service to the public. Automated vehicles, which are defined as being able to self-drive, self-park, and automate routing, provide potentials for new business models such as privately owned automated vehicles (PAVs) that serve trips within households, shared AVs (SAVs) that offer door-to-door service to the public who request service using app-based platforms, and SAVs with pool where multiple passengers may be pooled together when the vehicles do not detour much if sequentially picking up and dropping off passengers. Therefore, AVs can transform the transportation system especially by reducing vehicle ownership and increasing travel distance. To plan for a sustainable future, it is important to gain an understanding of the impacts of AVs under various scenarios. Thus, a wealth of case studies explore the system performance of SAVs such as served trips per SAV per day. However, the impacts of AVs are not static and tend to vary across cities, depend on heterogeneous mode choices within regions, and may not be evenly distributed within a city. Therefore, this dissertation fills the research gaps by (1) investigating how urban features such as density may influence the system performance of SAVs; (2) exploring heterogeneity of key factors that influence the decisions about using AVs across regions; and (3) examining the distribution of the demand for energy in the era of AVs. The first study in the dissertation simulates the SAVs that serve trips within 286 cities and examines the relationship between the system performance of SAVs and city features such as density, diversity, and design. The system performance of SAVs is evaluated using served trips per SAV per day, percent of pooled trips that allow ridesharing, and percent of extra Vehicle Miles Traveled (VMT) compared to the VMT requested by the served trips. The results suggest that compact diverse development patterns and pedestrian-oriented networks can promote the performance of SAVs. The second study uses an interpretable machine learning framework to understand the heterogeneous mode choice behaviors of private car users in the era of AVs in two regions. The framework uses an AV mode choice survey, where respondents are asked to take mode choice experiments given attributes about the trips, to train machine learning models. Accumulated Local Effects (ALE) plots are used to analyze the model results. ALE outputs the accumulated change of the probability of choosing specific modes within small intervals across the range of the variable of interest. It is suggested that attitudes and trip-specific attributes such as in-vehicle time are the most important determinants. Positive attitudes, longer trips, and longer walking distance can promote the adoption of AV modes. In addition, the effects of in-vehicle time and walking distance vary across the two examined regions due to distinct urban form, transportation infrastructure, and cultural backgrounds. Kansas City residents can tolerate shorter walking distance before switching to SAV choices due to the car-oriented environment while Seattle residents are more sensitive to in-vehicle travel time because of the local congestion levels. The final part of the dissertation examines the demand for energy of AVs at disaggregated levels incorporating heterogeneity of AV mode choices. A three-step framework is employed including the prediction of mode choice, the determination of vehicle trajectories, and the estimation of the demand for energy. It is suggested that the AV scenario can generate -0.36% to 2.91% of extra emissions and consume 2.9% more energy compared to a business as usual (BAU) scenario if gasoline is used. The revealed distribution of traffic volume suggests that the demand for charging is concentrated around the downtown areas and on highways if AVs consume electricity. In summary, the dissertation demonstrates that there is a dynamics with regard to the impacts and performance of AVs across regions due to various urban form, infrastructure and cultural environment, and the spatial heterogeneity within cities. Automated vehicles Urban form Mode choice Interpretable machine learning Distribution of energy consumption
5	Mohou stroje vysvětlit akciové výnosy? / Can Machines Explain Stock Returns? Chalupová, Karolína January 2021 (has links) Can Machines Explain Stock Returns? Thesis Abstract Karolína Chalupová January 5, 2021 Recent research shows that neural networks predict stock returns better than any other model. The networks' mathematically complicated nature is both their advantage, enabling to uncover complex patterns, and their curse, making them less readily interpretable, which obscures their strengths and weaknesses and complicates their usage. This thesis is one of the first attempts at overcoming this curse in the domain of stock returns prediction. Using some of the recently developed machine learning interpretability methods, it explains the networks' superior return forecasts. This gives new answers to the long- standing question of which variables explain differences in stock returns and clarifies the unparalleled ability of networks to identify future winners and losers among the stocks in the market. Building on 50 years of asset pricing research, this thesis is likely the first to uncover whether neural networks support the economic mechanisms proposed by the literature. To a finance practitioner, the thesis offers the transparency of decomposing any prediction into its drivers, while maintaining a state-of-the-art profitability in terms of Sharpe ratio. Additionally, a novel metric is proposed that is particularly suited...
6	Towards Understanding slag build-up in a Grate-Kiln furnace : A study of what parameters in the Grate-Kiln furnace leads to increased slag build-up, in a modern pellet production kiln / Mot ökad förståelse av slaguppbyggnad i ett kulsintersverk Olsson, Oscar, Österman, Uno January 2022 (has links) As more data is being gathered in industrial production facilities, the interest in applying machine learning models to the data is growing. This includes the iron ore mining industry, and in particular the build-up of slag in grate-kiln furnaces. Slag is a byproduct in the pelletizing process within these furnaces, that can cause production stops, quality issues, and unplanned maintenance. Previous studies on slag build-up have been done mainly by chemists and process engineers. Whilst previous research has hypothesized contributing factors to slag build-up, the studies have mostly been conducted in simulation environments and thus have not used real sensor data utilizing machine learning models. Luossavaara-Kiirunavaara Aktiebolag (LKAB) has provided data from one of their grate-kiln furnaces, a time-series data of sensor readings, that compressed before storage. A Scala package was built to ingest and interpolate the LKAB data and make it ready for machine learning experiments. The estimation of slag within the kiln was found too arbitrary to make accurate predictions. Therefore, three quality metrics, tightly connected to the build-up of slag, were selected as target variables instead. Independent and identically distributed (IID) units of data were created by isolating fuel usage, product type produced and production rate. Further, another IID criterion was created, adjusting the time for each feature in order to be able to compare feature values for a single pellet in production. Specifically, the time it takes for a pellet to go from the feature sensor to the quality test was added to the original timestamp. This resulted in a table where each row represents multiple features and quality measures for the same small batch of pellets. An IID unit of interest was then used to find the most contributing features by using principal component analysis (PCA) and lasso regression. It was found that using the two mentioned methods, the number of features could be reduced to a smaller set of important features. Further, using decision tree regression with the subset of features, selected from the most important features, it was found that decision tree regression had a similar performance with the subset of features as the lasso regression. Decision tree and lasso regression were chosen for interpretability, which was important in order to be able to discuss the contributing factors with LKAB process engineers. / Idag genereras allt mer data från industriella produktionsanläggningar och intresset att applicera maskininlärningsmodeller på denna data växer. Detta inkluderar även industrin för utvining av järnmalm, i synnerhet uppbyggnaden av slagg i grate-kiln ugnar. Slagg är en biprodukt från pelletsproduktionen som kan orsaka produktionsstopp, kvalitetsbrister och oplanerat underhåll av ugnarna. Tidigare forskning kring slagguppbyggnad har i huvudsak gjorts av kemister och processingenjörer och ett antal bidragande faktorer till slagguppbyggnad ha antagits. Däremot har dessa studier främst utförts i simulerad experimentmiljö och därför inte applicerat maskininlärningsmodeler på sensordata från produktion. Luossavaara-Kiirunavaara Aktiebolag (LKAB) har till denna studie framställt och försett data från en av deras grate-kiln ugnar, specifikt tidsseriedata från sensorer som har komprimerats innan lagring. Ett Scala-paket byggdes för att ladda in och interpolera LKAB:s data, för att sedan göra den redo och applicerbar för experiment med maskininlärningsmodeller. Direkta mätningar för slagguppbyggnad och slaggnivå upptäcktes vara för slumpartade och bristfälliga för prediktion, därför användas istället tre kvalitetsmätningar, med tydligt samband till påföljderna från slagguppbyggnad, som målvariabler. Independent and identically distributed (IID) enheter skapades för all data genom att isolera bränsleanvändning, produkttyp och produktionstakt. Vidare, skapades ytterligare ett kriterie för IID:er, en tidsjustering av varje variabel för att göra det möjligt att kunna jämföra variabler inbördes för en enskild pellet i produktion. Specifikt, användes tiden det tar för en pellet från att den mäts av en enskild sensor till att kvalitetstestet tas. Tidsskillnaden adderas sedan till sensormätningens tidsstämpel. Detta resulterade i en tabell där varje rad representerade samma lilla mängd av pellets. En IID enhet av intresse analyserades sedan för att undersöka vilka variabler som har störst varians och påverkan genom en principal komponentsanalys (PCA) och lassoregression. Genom att använda dessa metoder konstaterades det att antalet variabler kunde reduceras till ett mindre antal variabler och ett nytt, mindre, dataset av de viktigaste variablerna skapades. Vidare, genom regression av beslutsträd med de viktigaste variablerna, konstaterades att beslutträdsregression och lassoregression hade liknande prestanda när data med de viktigaste variablerna användes. Beslutträdsregression och lassoregression användes för att experimentens resultat skulle ha en hög förklaringsgrad, vilket är viktigt för att kunna diskutera variabler med högst påverkan på slagguppbyggnaden och ge resultat som är tolkbara och användbara för LKAB:s processingenjörer. Grate-kiln Slag Interpretable machine learning IID independent and identically distributed Kulsinter pellets slagg slag LKAB Computer Sciences Datavetenskap (datalogi)
7	Interpretable Machine Learning for Insurance Risk Pricing / Förståbar Maskinlärning för Riskprissättning Inom Försäkring Darke, Felix January 2023 (has links) This Master's Thesis project set out with the objective to propose a machine learning model for predicting insurance risk at the level of an individual coverage, and compare it towards the existing models used by the project provider Gjensidige Försäkring. Due to interpretability constraints, it was found that this problem can be translated into a standard tabular regression task, with well defined target distributions. However, it was early identified that the set of feasible models do not contain pure black box models such as XGBoost, LightGBM and CatBoost which are typical choices for tabular data regression. In the report, we explicitly formulate the interpretability constraints in sharp mathematical language. It is concluded that interpretability can be ensured by enforcing a particular structure on the Hilbert space across which we are looking for the model. Using this formalism, we consider two different approaches for fitting high performing models that maintain interpretability, where we conclude that gradient boosted regression tree based Generalized Additive Models in general, and the Explainable Boosting Machine in particular, is a promising model candidate consisting of functions within the Hilbert space of interest. The other approach considered is the basis expansion approach, which is currently used at the project provider. We make the argument that the gradient boosted regression tree approach used by the Explainable Boosting Machine is a more suitable model type for an automated, data driven modelling approach which is likely to generalize well outside of the training set. Finally, we perform an empirical study on three different internal datasets, where the Explainable Boosting Machine is compared towards the current production models. We find that the Explainable Boosting Machine systematically outperforms the current models on unseen test data. There are many potential ways to explain this, but the main hypothesis brought forward in the report is that the sequential model fitting procedure allowed by the regression tree approach allows us to effectively explore a larger portion of the Hilbert space which contains all permitted models in comparison to the basis expansion approach. / Detta mastersexamensarbete utgår från målsättningen att föreslå en maskinlärningsmodell för att förutspå försäkringsrisk, på nivån av enskilda försäkringar. Denna modell ska sedan jämföras mot nuvarande modeller som används hos Gjensidige Försäkring, som tillhandahåller projektet. Detta problem kan formuleras som ett traditionellt regressionsproblem på tabulär data, med väldefinerade målfördelningar. På grund av begränsningar kring krav på modellens förståbarhet identifierades det tidigt i projektet att mängden av tillåtna modeller inte innehåller ren black box modeller som XGBoost, LightGBM eller CatBoost, vilket är typiska förstahandsval för den här problemklassen. I rapporten formulerar vi förståbarhetskraven i skarpt, matematiskt språk, och drar slutsatsen att önskad förståbarhet kan uppnås genom en specifik struktur på det Hilbertrum där vi letar efter den optimala modellen. Utifrån denna formalism evaluerar vi två olika metoder för att anpassa modeller med god prestanda som uppnår önskade förståbarhetskrav. Vi drar slutsatsen att Generalized Additive Models anpassade till datat genom gradientboostade regressionsträd i allmänhet, och Explainable Boosting Machine i synnerhet är en lovande modellkandidat bestående av funktioner i vårt Hilbertrum av intresse. Vi utvärderar dessutom ett tillvägagångssätt för att anpassa Generalized Additive Models till datat genom basexpansioner, vilket är den metod som primärt används idag hos Gjensidige Försäkring. Vi argumenterar för att metoder som bygger på gradientboostade regressionsträd, såsom Explainable Boosting Machine, är mer lämplig för ett automatiserbart, datadrivet arbetssätt till att bygga modeller som generaliserar väl utanför träningsdatat. Slutligen genomför vi en empirisk studie på tre olika interna dataset, där Explainable Boosting Machine jämförs mot nuvarande produktionsmodeller, vilka bygger på den tidigare nämnda basexpansionsmetodiken. Vi finner att Explainable Boosting Machine systematiskt överpresterar kontra nuvarande modeller på osedd testdata. Det finns många potentiella förklaringar till detta, men den huvudsakliga hypotsen som diskuteras i denna rapport är att den gradientboostade regressionsträdsmetodiken gör det möjligt att effektivt utforska en större delmängd av det Hilbertrum som innehåller alla tillåtna modeller i jämförelse med basexpansionsmetodiken. Other Mathematics Annan matematik
8	[en] APPROXIMATE BORN AGAIN TREE ENSEMBLES / [pt] ÁRVORES BA APROXIMADAS 28 October 2021 (has links) [pt] Métodos ensemble como random forest, boosting e bagging foram extensivamente estudados e provaram ter uma acurácia melhor do que usar apenas um preditor. Entretanto, a desvantagem é que os modelos obtidos utilizando esses métodos podem ser muito mais difíceis de serem interpretados do que por exemplo, uma árvore de decisão. Neste trabalho, nós abordamos o problema de construir uma árvore de decisão que aproximadamente reproduza um conjunto de árvores, explorando o tradeoff entre acurácia e interpretabilidade, que pode ser alcançado quando a reprodução exata do conjunto de árvores é relaxada. Primeiramente, nós formalizamos o problem de obter uma árvore de decisão de uma determinada profundidade que seja a mais aderente ao conjunto de árvores e propomos um algoritmo de programação dinâmica para resolver esse problema. Nós também provamos que a árvore de decisão obtida por esse procedimento satisfaz garantias de generalização relacionadas a generalização do modelo original de conjuntos de árvores, um elemento crucial para a efetividade dessa árvore de decisão em prática. Visto que a complexidade computacional do algoritmo de programação dinâmica é exponencial no número de features, nós propomos duas heurísticas para gerar árvores de uma determinada profundidade com boa aderência em relação ao conjunto de árvores. Por fim, nós conduzimos experimentos computacionais para avaliar os algoritmos propostos. Quando utilizados classificadores mais interpretáveis, os resultados indicam que em diversas situações a perda em acurácia é pequena ou inexistente: restrigindo a árvores de decisão de profundidade 6, nossos algoritmos produzem árvores que em média possuem acurácias que estão a 1 por cento (considerando o algoritmo de programção dinâmica) ou 2 por cento (considerando os algoritmos heurísticos) do conjunto original de árvores. / [en] Ensemble methods in machine learning such as random forest, boosting, and bagging have been thoroughly studied and proven to have better accuracy than using a single predictor. However, their drawback is that they give models that can be much harder to interpret than those given by, for example, decision trees. In this work, we approach in a principled way the problem of constructing a decision tree that approximately reproduces a tree ensemble, exploring the tradeoff between accuracy and interpretability that can be obtained once exact reproduction is relaxed. First, we formally define the problem of obtaining the decision tree of a given depth that is most adherent to a tree ensemble and give a Dynamic Programming algorithm for solving this problem. We also prove that the decision trees obtained by this procedure satisfy generalization guarantees related to the generalization of the original tree ensembles, a crucial element for their effectiveness in practice. Since the computational complexity of the Dynamic Programming algorithm is exponential in the number of features, we also design heuristics to compute trees of a given depth with good adherence to a tree ensemble. Finally, we conduct a comprehensive computational evaluation of the algorithms proposed. The results indicate that in many situations, there is little or no loss in accuracy in working more interpretable classifiers: even restricting to only depth-6 decision trees, our algorithms produce trees with average accuracies that are within 1 percent (for the Dynamic Programming algorithm) or 2 percent (heuristics) of the original random forest. [pt] CONJUNTOS DE ARVORES [pt] MODELOS DE COMPRESSAO DE DADOS [pt] MACHINE LEARNING INTERPRETAVEL [pt] ARVORES DE DECISAO [en] TREE ENSEMBLES [en] MODEL COMPRESSION [en] INTERPRETABLE MACHINE LEARNING [en] DECISION TREE
9	Applying Machine Learning to Explore Nutrients Predictive of Cardiovascular Disease Using Canadian Linked Population-Based Data / Machine Learning to Predict Cardiovascular Disease with Nutrition Morgenstern, Jason D. January 2020 (has links) McMaster University MASTER OF PUBLIC HEALTH (2020) Hamilton, Ontario (Health Research Methods, Evidence, and Impact) TITLE: Applying Machine Learning to Determine Nutrients Predictive of Cardiovascular Disease Using Canadian Linked Population-Based Data AUTHOR: Jason D. Morgenstern, B.Sc. (University of Guelph), M.D. (Western University) SUPERVISOR: Professor L.N. Anderson, NUMBER OF PAGES: xv, 121 / The use of big data and machine learning may help to address some challenges in nutritional epidemiology. The first objective of this thesis was to explore the use of machine learning prediction models in a hypothesis-generating approach to evaluate how detailed dietary features contribute to CVD risk prediction. The second objective was to assess the predictive performance of the models. A population-based retrospective cohort study was conducted using linked Canadian data from 2004 – 2018. Study participants were adults age 20 and older (n=12 130 ) who completed the 2004 Canadian Community Health Survey, Cycle 2.2, Nutrition (CCHS 2.2). Statistics Canada has linked the CCHS 2.2 data to the Discharge Abstracts Database and the Canadian Vital Statistics Death database, which were used to determine cardiovascular outcomes (stroke or ischemic heart disease events or deaths). Conditional inference forests were used to develop models. Then, permutation feature importance (PFI) and accumulated local effects (ALEs) were calculated to explore contributions of nutrients to predicted disease. Supplement-use (median PFI (M)=4.09 x 10-4, IQR=8.25 x 10-7 – 1.11 x 10-3) and caffeine (M=2.79 x 10-4, IQR= -9.11 x 10-5 – 5.86 x 10-4) had the highest median PFIs for nutrition-related features. Supplement-use was associated with decreased predicted risk of CVD (accumulated local effects range (ALER)= -3.02 x 10-4 – 2.76 x 10-4) and caffeine was associated with increased predicted risk (ALER= -9.96 x 10-4 – 0.035). The best-performing model had a logarithmic loss of 0.248. Overall, many non-linear relationships were observed, including threshold, j-shaped, and u-shaped. The results of this exploratory study suggest that applying machine learning to the nutritional epidemiology of CVD, particularly using big datasets, may help elucidate risks and improve predictive models. Given the limited application thus far, work such as this could lead to improvements in public health recommendations and policy related to dietary behaviours. / Thesis / Master of Public Health (MPH) / This work explores the potential for machine learning to improve the study of diet and disease. In chapter 2, opportunities are identified for big data to make diet easier to measure. Also, we highlight how machine learning could find new, complex relationships between diet and disease. In chapter 3, we apply a machine learning algorithm, called conditional inference forests, to a unique Canadian dataset to predict whether people developed strokes or heart attacks. This dataset included responses to a health survey conducted in 2004, where participants’ responses have been linked to administrative databases that record when people go to hospital or die up until 2017. Using these techniques, we identified aspects of nutrition that predicted disease, including caffeine, alcohol, and supplement-use. This work suggests that machine learning may be helpful in our attempts to understand the relationships between diet and health. machine learning nutritional epidemiology public health artificial intelligence conditional inference forest interpretable machine learning population health survey data linkage cardiovascular disease nutrition prediction predictive modeling
10	Assessment of Predictive Models for Improving Default Settings in Streaming Services / Bedömning av prediktiva modeller för att förbättra standardinställningar i streamingtjänster Lattouf, Mouzeina January 2020 (has links) Streaming services provide different settings where customers can choose a sound and video quality based on personal preference. The majority of users never make an active choice; instead, they get a default quality setting which is chosen automatically for them based on some parameters, like internet connection quality. This thesis explores personalising the default audio setting, intending to improve the user experience. It achieves this by leveraging machine learning trained on the fraction of users that have made active choices in changing the quality setting. The assumption that user similarity in users who make an active choice can be leveraged to impact user experience was the idea behind this thesis work. It was issued to study which type of data from different categories: demographic, product and consumption is most predictive of a user's taste in sound quality. A case study was conducted to achieve the goals for this thesis. Five predictive model prototypes were trained, evaluated, compared and analysed using two different algorithms: XGBoost and Logistic Regression, and targeting two regions: Sweden and Brazil. Feature importance analysis was conducted using SHapley Additive exPlanations(SHAP), a unified framework for interpreting predictions with a game theoretic approach, and by measuring coefficient weights to determine the most predictive features. Besides exploring the feature impact, the thesis also answers how reasonable it is to generalise these models to non-selecting users by performing hypothesis testing. The project also covered bias analysis between users with and without active quality settings and how that affects the models. The models with XGBoost had higher performance. The results showed that demographic and product data had a higher impact on model predictions in both regions. Although, different regions did not have the same data features as most predictive, so there were differences observed in feature importance between regions and also between platforms. The results of hypothesis testing did not indicate a valid reason to consider the models to work for non-selective users. However, the method is negatively affected by other factors such as small changes in big datasets that impact the statistical significance. Data bias in some data features was found, which indicated a correlation but not the causation behind the patterns. The results of this thesis additionally show how machine learning can improve user experience in regards to default sound quality settings, by leveraging models on user similarity in users who have changed the sound quality to the most suitable for them. / Streamingtjänster erbjuder olika inställningar där kunderna kan välja ljud- och videokvalitet baserat på personliga preferenser. Majoriteten av användarna gör aldrig ett aktivt val; de tilldelas istället en standardkvalitetsinställning som väljs automatiskt baserat på vissa parametrar, som internetanslutningskvalitet. Denna avhandling undersöker anpassning av standardljudinställningen, med avsikt att förbättra användarupplevelsen. Detta uppnås genom att tillämpa maskininlärning på den andel användare som har aktivt ändrat kvalitetsinställningen. Antagandet att användarlikhet hos användare som gör ett aktivt val kan utnyttjas för att påverka användarupplevelsen var tanken bakom detta examensarbete. Det utfärdades för att studera vilken typ av data från olika kategorier: demografi, produkt och konsumtion är mest förutsägande för användarens smak i ljudkvalitet. En fallstudie genomfördes för att uppnå målen för denna avhandling. Fem prediktiva modellprototyper tränades, utvärderades, jämfördes och analyserades med två olika algoritmer: XGBoost och Logistisk Regression, och inriktade på två regioner: Sverige och Brasilien. Analys av funktionsvikt genomfördes med SHapley Additive exPlanations (SHAP), en enhetlig ram för att tolka förutsägelser med en spelteoretisk metod, och genom att mäta koefficientvikter för att bestämma de mest prediktiva funktionerna. Förutom att utforska funktionens påverkan, svarar avhandlingen också på hur rimligt det är att generalisera dessa modeller för icke-selektiva användare genom att utföra hypotesprövning. Projektet omfattade också biasanalys mellan användare med och utan aktiva kvalitetsinställningar och hur det påverkar modellerna. Modellerna med XGBoost hade högre prestanda. Resultaten visade att demografisk data och produktdata hade en högre inverkan på modellförutsägelser i båda regionerna. Däremot hade olika regioner inte samma datafunktioner som mest prediktiva, skillnader observerades i funktionsvikt mellan regioner och även mellan plattformar. Resultaten av hypotesprövningen indikerade inte på vägande anledning för att anse att modellerna skulle fungera för icke-selektiva användare. Däremot har metoden påverkats negativt av andra faktorer som små förändringar i stora datamängder som påverkar den statistiska signifikansen. Data bias hittades i vissa datafunktioner, vilket indikerade en korrelation men inte orsaken bakom mönstren. Resultaten av denna avhandling visar dessutom hur maskininlärning kan förbättra användarupplevelsen när det gäller standardinställningar för ljudkvalitet, genom att utnyttja modeller för användarlikhet hos användare som har ändrat ljudkvaliteten till det mest lämpliga för dem. Interpretable Machine Learning Machine Learning Shapley Additive Explanations User Settings User experience Användarinställningar Användarupplevelse Maskininlärning Shapley Additive Explanations Tolkningsbar Maskininlärning Computer and Information Sciences Data- och informationsvetenskap

Search results