Global ETD Search

41	Improving the accuracy of statistics used in de-identification and model validation (via the concordance statistic) pertaining to time-to-event data Caetano, Samantha-Jo January 2020 (has links) Time-to-event data is very common in medical research. Thus, clinicians and patients need analysis of this data to be accurate, as it is often used to interpret disease screening results, inform treatment decisions, and identify at-risk patient groups (ie, sex, race, gene expressions, etc.). This thesis tackles three statistical issues pertaining to time-to-event data. The first issue was incurred from an Institute for Clinical and Evaluative Sciences lung cancer registry data set, which was de-identified by censoring patients at an earlier date. This resulted in an underestimate of the observed times of censored patients. Five methods were proposed to account for the underestimation incurred by de-identification. A subsequent simulation study was conducted to compare the effectiveness of each method in reducing bias, and mean squared error as well as improving coverage probabilities of four different KM estimates. The simulation results demonstrated that situations with relatively large numbers of censored patients required methodology with larger perturbation. In these scenarios, the fourth proposed method (which perturbed censored times such that they were censored in the final year of study) yielded estimates with the smallest bias, mean squared error, and largest coverage probability. Alternatively, when there were smaller numbers of censored patients, any manipulation to the altered data set worsened the accuracy of the estimates. The second issue arises when investigating model validation via the concordance (c) statistic. Specifically, the c-statistic is intended for measuring the accuracy of statistical models which assess the risk associated with a binary outcome. The c-statistic estimates the proportion of patient pairs where the patient with a higher predicted risk had experienced the event. The definition of a c-statistic cannot be uniquely extended to time-to-event outcomes, thus many proposals have been made. The second project developed a parametric c-statistic which assumes to the true survival times are exponentially distributed to invoke the memoryless property. A simulation study was conducted which included a comparative analysis of two other time-to-event c-statistics. Three different definitions of concordance in the time-to-event setting were compared, as were three different c-statistics. The c-statistic developed by the authors yielded the smallest bias when censoring is present in data, even when the exponentially distributed parametric assumptions do not hold. The c-statistic developed by the authors appears to be the most robust to censored data. Thus, it is recommended to use this c-statistic to validate prediction models applied to censored data. The third project in this thesis developed and assessed the appropriateness of an empirical time-to-event c-statistic that is derived by estimating the survival times of censored patients via the EM algorithm. A simulation study was conducted for various sample sizes, censoring levels and correlation rates. A non-parametric bootstrap was employed and the mean and standard error of the bias of 4 different time-to-event c-statistics were compared, including the empirical EM c-statistic developed by the authors. The newly developed c-statistic yielded the smallest mean bias and standard error in all simulated scenarios. The c-statistic developed by the authors appears to be the most appropriate when estimating concordance of a time-to-event model. Thus, it is recommended to use this c-statistic to validate prediction models applied to censored data. / Thesis / Doctor of Philosophy (PhD) Statistics Time-to-Event Analysis Survival Analysis De-identification Anonymization Model Validation Concordance Statistic AUC
42	Analytisk Studie av Avancerade Gradientförstärkningsalgoritmer för Maskininlärning : En jämförelse mellan XGBoost, CatBoost, LightGBM, SnapBoost, KTBoost, AdaBoost och GBDT för klassificering- och regressionsproblem Wessman, Filip January 2021 (has links) Maskininlärning (ML) är idag ett mycket aktuellt, populärt och aktivt forskat område. Därav finns det idag en stor uppsjö av olika avancerade och moderna ML-algoritmer. Svårigheten är att bland dessa identifiera den mest optimala att applicera på ens tillämpningsområde. Algoritmer som bygger på Gradientförstärkning (eng. Gradient Boosting (GB)) har visat sig ha ett väldigt brett spektrum av appliceringsområden, flexibilitet, hög förutsägelseprestanda samt låga tränings- och förutsägelsetider. Huvudsyftet med denna studie är på klassificerings- och regressiondataset utvärdera och belysa prestandaskillnaderna av 5 moderna samt 2 äldre GB-algoritmer. Målet är att avgöra vilken av dessa moderna algoritmer som presterar i genomsnitt bäst utifrån på flera utvärderingsmått. Initialt utfördes en teoretisk förstudie inom det aktuella forskningsområdet. Algoritmerna XGBoost, LightGBM, CatBoost, AdaBoost, SnapBoost, KTBoost, GBDT implementerades på plattformen Google Colab. Där utvärderades dess respektive, tränings- och förutsägelsestid samt prestandamåtten, uppdelat i ROCAUC och Log Loss för klassificering samt R2 och RMSE för regression. Resultaten visade att det generellt var små skillnader mellan dom olika testade algoritmerna. Med undantag för AdaBoost som i allmänhet, med större marginal, hade den sämsta prestandan. Därmed gick det inte i denna jämförelse utse en klar vinnare. Däremot presterade SnapBoost väldigt bra på flera utvärderingsmått. Modellresultaten är generellt sätt väldigt begränsade och bundna till det applicerade datasetet vilket gör att det överlag är väldigt svårt att generalisera det till andra datauppsättningar. Detta speglar sig från resultaten med svårigheten att identifiera ett ML-ramverk som utmärker sig och presterar bra i alla scenarier. / Machine learning (ML) is today a very relevent, popular and actively researched area. As a result, today there exits a large numer of different advanced and modern ML algorithms. The difficulty is to identify among these the most optimal to apply to one’s area of application. Algorithms based on Gradient Boosting (GB) have been shown to have a very wide range of application areas, flexibility, high prediction performance and low training and prediction times. The main purpose of this study is on classification and regression datasets evaluate and illustrate the performance differences of 5 modern and 2 older GB algorithms. The goal is to determine which of these modern algorithms, on average, performs best on the basis of several evaluation metrics. Initially, a theoretical feasibility study was carried out in the current research area. The algorithms XGBoost, LightGBM, CatBoost, AdaBoost, SnapBoost, KTBoost, GBDT were implemented on the Google Colab platform. There, respective training and prediction time as well as the performance metrics were evaluated, divided into ROC-AUC and Log Loss for classification and R2 and RMSE for regression. The results showed that there were generally small differences between the different algorithms tested. With the exception of AdaBoost which in general, by a larger margin, had the worst performance. Thus, it was not possible in this comparison to nominate a clear winner. However, SnapBoost performed very well in several evaluation metrics. The model results are generally very limited and bound to the applied dataset, which makes it generally very difficult to generalize it to other data sets. This is reflected in the results with the difficulty of identifying an ML framework that excels and performs well in all scenarios. Machine learning Classification Regression XGBoost LightGBM CatBoost AdaBoost SnapBoost KTBoost GBDT ROC-AUC Log Loss R2 RMSE Maskininlärning Klassificering Regression XGBoost LightGBM CatBoost AdaBoost SnapBoost KTBoost GBDT ROC-AUC Log Loss R2 RMSE Software Engineering Programvaruteknik
43	Vitiligo image classification using pre-trained Convolutional Neural Network Architectures, and its economic impact on health care / Vitiligo bildklassificering med hjälp av förtränade konvolutionella neurala nätverksarkitekturer och dess ekonomiska inverkan på sjukvården Bashar, Nour, Alsaid Suliman, MRami January 2022 (has links) Vitiligo is a skin disease where the pigment cells that produce melanin die or stop functioning, which causes white patches to appear on the body. Although vitiligo is not considered a serious disease, there is a risk that something is wrong with a person's immune system. In recent years, the use of medical image processing techniques has grown, and research continues to develop new techniques for analysing and processing medical images. In many medical image classification tasks, deep convolutional neural network technology has proven its effectiveness, which means that it may also perform well in vitiligo classification. Our study uses four deep convolutional neural networks in order to classify images of vitiligo and normal skin. The architectures selected are VGG-19, ResNeXt101, InceptionResNetV2 and Inception V3. ROC and AUC metrics are used to assess each model's performance. In addition, the authors investigate the economic benefits that this technology may provide to the healthcare system and patients. To train and evaluate the CNN models, the authors used a dataset that contains 1341 images in total. Because the dataset is limited, 5-fold cross validation is also employed to improve the model's prediction. The results demonstrate that InceptionV3 achieves the best performance in the classification of vitiligo, with an AUC value of 0.9111, and InceptionResNetV2 has the lowest AUC value of 0.8560. / Vitiligo är en hudsjukdom där pigmentcellerna som producerar melanin dör eller slutar fungera, vilket får vita fläckar att dyka upp på kroppen. Även om Vitiligo inte betraktas som en allvarlig sjukdom, det finns fortfarande risk att något är fel på en persons immun. Under de senaste åren har användningen av medicinska bildbehandlingstekniker vuxit och forskning fortsätter att utveckla nya tekniker för att analysera och bearbeta medicinska bilder. I många medicinska bildklassificeringsuppgifter har djupa konvolutionella neurala nätverk bevisat sin effektivitet, vilket innebär att den också kan fungera bra i Vitiligo klassificering. Vår studie använder fyra djupa konvolutionella neurala nätverk för att klassificera bilder av vitiligo och normal hud. De valda arkitekturerna är VGG-19, RESNEXT101, InceptionResNetV2 och Inception V3. ROC- och AUC mätvärden används för att bedöma varje modells prestanda. Dessutom undersöker författarna de ekonomiska fördelarna som denna teknik kan ge till sjukvårdssystemet och patienterna. För att träna och utvärdera CNN modellerna använder vi ett dataset som innehåller totalt 1341 bilder. Eftersom datasetet är begränsat används också 5-faldigt korsvalidering för att förbättra modellens förutsägelse. Resultaten visar att InceptionV3 uppnår bästa prestanda i klassificeringen av Vitiligo, med ett AUC -värde på 0,9111, och InceptionResNetV2 har det lägsta AUC -värdet på 0,8560. Vitiligo deep CNN architectures Image classification pre-trained models dataset AUC economic impact. Vitiligo djupa CNN-arkitekturer bildklassificering förtränade modeller dataset AUC ekonomisk påverkan. Medical Image Processing Medicinsk bildbehandling
44	Jackknife Emperical Likelihood Method and its Applications Yang, Hanfang 01 August 2012 (has links) In this dissertation, we investigate jackknife empirical likelihood methods motivated by recent statistics research and other related fields. Computational intensity of empirical likelihood can be significantly reduced by using jackknife empirical likelihood methods without losing computational accuracy and stability. We demonstrate that proposed jackknife empirical likelihood methods are able to handle several challenging and open problems in terms of elegant asymptotic properties and accurate simulation result in finite samples. These interesting problems include ROC curves with missing data, the difference of two ROC curves in two dimensional correlated data, a novel inference for the partial AUC and the difference of two quantiles with one or two samples. In addition, empirical likelihood methodology can be successfully applied to the linear transformation model using adjusted estimation equations. The comprehensive simulation studies on coverage probabilities and average lengths for those topics demonstrate the proposed jackknife empirical likelihood methods have a good performance in finite samples under various settings. Moreover, some related and attractive real problems are studied to support our conclusions. In the end, we provide an extensive discussion about some interesting and feasible ideas based on our jackknife EL procedures for future studies. Empirical likelihood Transformation model U-statistics Jackknife Partial AUC Smoothed empirical likelihood Missing data ROC curves Difference of two quantiles.
45	BAYESIAN-DERIVED VANCOMYCIN AUC24H THRESHOLD FOR NEPHROTOXICITY IN SPECIAL POPULATIONS Ho, Dan 01 January 2021 (has links) A Bayesian-derived 24-hour area under the concentration-time curve over minimum inhibitory concentration from broth microdilution (AUC24h/MICBMD) ratio of 400 to 600 is recommended as the new monitoring parameter for vancomycin to optimize efficacy and minimize nephrotoxicity. The AUC24h threshold of 600 mgh/L for nephrotoxicity was extrapolated from studies that assessed the general population. It is unclear if this upper threshold is consistent or varies when used in special populations such as critically ill patients, obese patients, patients with preexisting renal disease, and patients on concomitant nephrotoxins.The purpose of this study is to investigate the generalizability of the proposed vancomycin AUC24h threshold of 600 mgh/L for nephrotoxicity. The objective is to determine the optimal Bayesian-derived AUC24h threshold to minimize vancomycin-associated nephrotoxicity in special populations such as critically ill patients, obese patients, patients with preexisting renal disease, and patients on concomitant loop diuretics, ACEIs, ARBs, NSAIDs, aminoglycosides, piperacillin-tazobactam, and IV contrast dyes. The study design is a single-center, retrospective cohort study. For each patient, nephrotoxicity was assessed and the Bayesian-derived AUC24h was estimated. Using classification and regression tree (CART) analysis, the AUC24h threshold for nephrotoxicity was determined for each special population that had at least ten nephrotoxic patients. The predictive performances (e.g., positive predictive value [PPV], negative predictive value [NPV], sensitivity, specificity, and area under the receiver operating characteristic [ROC] curve) of each CART-derived threshold were then compared to the guideline threshold’s predictive performances. PPV and sensitivity were given greater weight when comparing the thresholds. Of the 336 patients, 29 (8.6%) nephrotoxic patients were observed after initiating vancomycin. Among the special populations of interest, critically ill patients, obese patients, patients with preexisting renal disease, and patients on concomitant loop diuretics included at least ten nephrotoxic patients and thus were further analyzed to determine the CART-derived AUC24h thresholds. The CART-derived AUC24h thresholds were 544 mgh/L for critically ill patients (n=116), 586 mgh/L for obese patients (n=111), 539 mgh/L for patients with preexisting renal disease (n=54), and 543 mgh/L for patients on concomitant loop diuretics (n=126). Compared to the guideline threshold of 600 mgh/L, the CART-derived thresholds for critically ill patients, patients with preexisting renal disease, and patients on concomitant loop diuretics had comparable PPVs but significantly higher sensitivities. On the other hand, the CART-derived threshold for obese patients did not have a significantly different PPV, NPV, sensitivity, specificity, and area under the ROC curve. For critically ill patients, patients with preexisting renal disease, and patients on concomitant loop diuretics, a lower vancomycin AUC24h threshold for nephrotoxicity such as 544 mgh/L, 539 mgh/L, and 543 mgh/L, respectively, may be considered to minimize the risk of nephrotoxicity. On the other hand, this study supports the continued use of the guideline threshold of 600 mg*h/L to minimize the risk of nephrotoxicity in obese patients. Pharmaceutical sciences AUC Bayesian vancomycin Medicinal and Pharmaceutical Chemistry Medicinal-Pharmaceutical Chemistry Medicine and Health Sciences Pharmacy and Pharmaceutical Sciences Physical Sciences and Mathematics
46	Från krig till fred : En kvalitativ studie om Colombia konflikten mellan år 2002 - 2016. Sabanovic, Amna January 2022 (has links) The purpose of this study is to examine the Colombian conflict, more specifically how it became peace in Colombia after five decades. The study will be limited by year due to the scope of the study. The analysis will specifically focus on the years 2002 to 2016. This essay is a case study, with the Colombia conflict as the focus. Furthermore, the theoretical framework for the thesis will mainly be the rational choice model. The analysis is based on various analysis tools and an analysis model. This will be consistent in the study that constitutes the main investigation in the thesis. The theoretical perspectives will be examined using the article "Essence of Decision: Explaining the Cuban Missile Crisis" written by Graham Allison and Philip Zelikow. Furthermore, other relevant articles will be processed to get a better analysis of the Colombia conflict. The articles will, among other things, be of great help in answering the study's questions - In what way can the theory of rational choice explain the conflict resolution in Colombia between the years 2002–2016? And Which motives have primarily governed the state, the FARC guerrillas, and the AUC - paramilitary group. Colombia FARC AUC Colombia conflict International Politics Rational Choice Latin America Political Science Statsvetenskap Social Sciences Samhällsvetenskap
47	Early Stopping of a Neural Network via the Receiver Operating Curve. Yu, Daoping 13 August 2010 (has links) (PDF) This thesis presents the area under the ROC (Receiver Operating Characteristics) curve, or abbreviated AUC, as an alternate measure for evaluating the predictive performance of ANNs (Artificial Neural Networks) classifiers. Conventionally, neural networks are trained to have total error converge to zero which may give rise to over-fitting problems. To ensure that they do not over fit the training data and then fail to generalize well in new data, it appears effective to stop training as early as possible once getting AUC sufficiently large via integrating ROC/AUC analysis into the training process. In order to reduce learning costs involving the imbalanced data set of the uneven class distribution, random sampling and k-means clustering are implemented to draw a smaller subset of representatives from the original training data set. Finally, the confidence interval for the AUC is estimated in a non-parametric approach. ANNs Classifiers Sampling ROC AUC Early Stopping Applied Statistics Artificial Intelligence and Robotics Computer Sciences Physical Sciences and Mathematics Statistics and Probability
48	Exploring Alarm Data for Improved Return Prediction in Radios : A Study on Imbalanced Data Classification Färenmark, Sofia January 2023 (has links) The global tech company Ericsson has been tracking the return rate of their products for over 30 years, using it as a key performance indicator (KPI). These KPIs play a critical role in making sound business decisions, identifying areas for improvement, and planning. To enhance the customer experience, the company highly values the ability to predict the number of returns in advance each month. However, predicting returns is a complex problem affected by multiple factors that determine when radios are returned. Analysts at the company have observed indications of a potential correlation between alarm data and the number of returns. This paper aims to address the need for better prediction models to improve return rate forecasting for radios, utilizing alarm data. The alarm data, which is stored in an internal database, includes logs of activated alarms at various sites, along with technical and logistical information about the products, as well as the historical records of returns. The problem is approached as a classification task, where radios are classified as either "return" or "no return" for a specific month, using the alarm dataset as input. However, due to the significantly smaller number of returned radios compared to the distributed ones, the dataset suffers from a heavy class imbalance. The imbalance class problem has garnered considerable attention in the field of machine learning in recent years, as traditional classification models struggle to identify patterns in the minority class of imbalanced datasets. Therefore, a specific method that addresses the imbalanced class problem was required to construct an effective prediction model for returns. Therefore, this paper has adopted a systematic approach inspired by similar problems. It applies the feature selection methods LASSO and Boruta, along with the resampling technique SMOTE, and evaluates various classifiers including the Support vector machine (SVM), Random Forest classifier (RFC), Decision tree (DT), and a Neural network (NN) with weights to identify the best-performing model. As accuracy is not suitable as an evaluation metric for imbalanced datasets, the AUC and AUPRC values were calculated for all models to assess the impact of feature selection, weights, resampling techniques, and the choice of classifier. The best model was determined to be the NN with weights, achieving a median AUC value of 0.93 and a median AUPRC value of 0.043. Likewise, both the LASSO+SVM+SMOTE and LASSO+RFC+SMOTE models demonstrated similar performance with median AUC values of 0.92 and 0.93, and median AUPRC values of 0.038 and 0.041, respectively. The baseline for the AUPRC value for this data set was 0.005. Furthermore, the results indicated that resampling techniques are necessary for successful classification of the minority class. Thorough pre-processing and a balanced split between the test and training sets are crucial before applying resampling, as this technique is sensitive to noisy data. While feature selection improved performance to some extent, it could also lead to unreadable results due to noise. The choice of classifier did not have an equal impact on model performance compared to the effects of resampling and feature selection. Imbalanced data classification LASSO Boruta SVM RFC neural network decision tree AUC AUPRC Computer Sciences Datavetenskap (datalogi)
49	Approches statistiques en apprentissage : boosting et ranking Vayatis, Nicolas 09 December 2006 (has links) (PDF) Depuis une dizaine d'années, la théorie statistique de l'apprentissage a connu une forte expansion. L'avènement d'algorithmes hautement performants pour la classification de données en grande dimension, tels que le boosting ou les machines à noyaux (SVM) a engendré de nombreuses questions statistiques que la théorie de Vapnik-Chervonenkis (VC) ne permettait pas de résoudre. En effet, le principe de Minimisation du Risque Empirique ne rend pas compte des méthodes d'apprentissage concrètes et le concept de complexité combinatoire de VC dimension ne permet pas d'expliquer les capacités de généralisation d'algorithmes<br />sélectionnant un estimateur au sein d'une classe massive telle que l'enveloppe convexe d'une classe de VC. Dans le premier volet du mémoire, on rappelle les interprétations des algorithmes de boosting comme des implémentations de principes de minimisation<br />de risques convexes et on étudie leurs propriétés sous cet angle. En particulier, on montre l'importance de la<br />régularisation pour obtenir des stratégies consistantes. On développe également une nouvelle classe d'algorithmes de type gradient stochastique appelés algorithmes de descente miroir avec moyennisation et on évalue leur comportement à travers des simulations informatiques. Après avoir présenté les principes fondamentaux du boosting, on s'attache dans le<br />deuxième volet à des questions plus avancées telles que<br />l'élaboration d'inégalités d'oracle. Ainsi, on étudie la<br />calibration précise des pénalités en fonction des critères<br />de coût utilisés. On présente des résultats<br />non-asymptotiques sur la performance des estimateurs du boosting pénalisés, notamment les vitesses rapides sous les conditions de marge de type Mammen-Tsybakov et on décrit les capacités d'approximation du boosting utilisant les "rampes" (stumps) de décision. Le troisième volet du mémoire explore le problème du ranking. Un enjeu important dans des applications<br />telles que la fouille de documents ou le "credit scoring" est d'ordonner les instances plutôt que de les catégoriser. On propose une formulation simple de ce problème qui permet d'interpréter le ranking comme une classification sur des paires d'observations. La différence dans ce cas vient du fait que les<br />critères empiriques sont des U-statistiques et on développe donc la théorie de la classification adaptée à ce contexte. On explore également la question de la généralisation de l'erreur de ranking afin de pouvoir inclure des a priori sur l'ordre des instances, comme dans le cas où on ne s'intéresse qu'aux "meilleures" instances. [MATH] Mathematics apprentissage<br />statistique algorithmes de classification inégalités oracles vitesses rapides <br />approximation stochastique critère AUC $U$-processus
50	Processo alternativo para obtenção de tetrafluoreto de urânio a partir de efluentes fluoretados da etapa de reconversão de urânio / Dry uranium tetrafluoride process preparation using the uranium hexafluoride reconversion process effluents SILVA NETO, JOAO B. da 09 October 2014 (has links) Made available in DSpace on 2014-10-09T12:54:58Z (GMT). No. of bitstreams: 0 / Made available in DSpace on 2014-10-09T14:07:31Z (GMT). No. of bitstreams: 0 / Dissertacao (Mestrado) / IPEN/D / Instituto de Pesquisas Energeticas e Nucleares - IPEN/CNEN-SP AQUEOUS SOLUTIONS AUC EXPERIMENTAL DATA FUEL CYCLE IEAR-1 REACTOR LIQUID WASTE URANIUM DIOXIDE URANIUM HEXAFLUORETO URANIUM OXIDES U3O8 URANIUM TETRAFLUORIDES X-RAY DIFFRACTION

Search results