Spelling suggestions: "subject:"boosted"" "subject:"roosted""
41 |
Lyme Disease and Forest Fragmentation in the Peridomestic EnvironmentTelionis, Pyrros A. 14 May 2020 (has links)
Over the last 20 years, Lyme disease has grown to become the most common vector-borne disease affecting Americans. Spread in the eastern U.S. primarily by the bite of Ixodes scapularis, the black-legged tick, the disease affects an estimated 329,000 Americans per year. Originally confined to New England, it has since spread across much of the east coast and has become endemic in Virginia. Since 2010 the state has averaged 1200 cases per year, with 200 annually in the New River Health District (NRHD), the location of our study.
Efforts to geographically model Lyme disease primarily focus on landscape and climatic variables. The disease depends highly on the survival of the tick vector, and white-footed mouse, the primary reservoir. Both depend on the existence of forest-herbaceous edge-habitats, as well as warm summer temperatures, mild winter lows, and summer wetness. While many studies have investigated the effect of forest fragmentation on Lyme, none have made use of high-resolution land cover data to do so at the peridomestic level.
To fill this knowledge gap, we made use of the Virginia Geographic Information Network’s 1-meter land cover dataset and identified forest-herbaceous edge-habitats for the NRHD. We then calculated the density of these edge-habitats at 100, 200 and 300-meter radii, representing the peridomestic environment. We also calculated the density of <2-hectare forest patches at the same distance thresholds. To avoid confounding from climatic variation, we also calculated mean summer temperatures, total summer rainfall, and number of consecutive days below freezing of the prior winters. Adding to these data, elevation, terrain shape index, slope, and aspect, and including lags on each of our climatic variables, we created environmental niche models of Lyme in the NRHD. We did so using both Boosted Regression Trees (BRT) and Maximum Entropy (MaxEnt) modeling, the two most common niche modeling algorithms in the field today.
We found that Lyme is strongly associated with higher density of developed-herbaceous edges within 100-meters from the home. Forest patch density was also significant at both 100-meter and 300-meter levels. This supports the notion that the fine scale peridomestic environment is significant to Lyme outcomes, and must be considered even if one were to account for fragmentation at a wider scale, as well as variations in climate and terrain. / M.S. / Lyme disease is the most common vector-borne disease in the United States today. Infecting about 330,000 Americans per year, the disease continues to spread geographically. Originally found only in New England, the disease is now common in Virginia. The New River Health District, where we did our study, sees over 200 cases per year.
Lyme disease is mostly spread by the bite of the black-legged tick. As such we can predict where Lyme cases might be found if we understand the environmental needs of these ticks. The ticks themselves depend on warm summer temperatures, mild winter lows, and summer wetness. But they are also affected by forest fragmentation which drives up the population of white-footed mice, the tick’s primary host. The mice are particularly fond of the interface between forests and open fields. These edge habitats provide food and cover for the mice, and in turn support a large population of ticks.
Many existing studies have demonstrated this link, but all have done so across broad scales such as counties or census tracts. To our knowledge, no such studies have investigated forest fragmentation near the home of known Lyme cases. To fill this gap in our knowledge, we made use of high-resolution forest cover data to identify forest-field edge habitats and small isolated forest patches. We then calculated the total density of both within 100, 200 and 300 meters of the homes of known Lyme cases, and compared these to values from non-cases using statistical modeling. We also included winter and summer temperatures, rainfall, elevation, slope, aspect, and terrain shape.
We found that a large amount of forest-field edges within 100 meters of a home increases the risk of Lyme disease to residents of that home. The same can be said for isolated forest patches. Even after accounting for all other variables, this effect was still significant. This information can be used by health departments to predict which neighborhoods may be most at risk for Lyme. They can then increase surveillance in those areas, warn local doctors, or send out educational materials.
|
42 |
Development of a Surface Roughness Prediction & Optimization Framework for CNC TurningBennett, Kristin S. January 2024 (has links)
Computer numerical control (CNC) machining is an integral element to the
manufacturing industry for production of components with requirements to meet several
outcome conditions. The surface roughness (Ra) of a workpiece is one of the most
important outcomes in finish machining processes due to it’s direct impact on the
functionality and lifespan of components in their intended applications. Several factors
contribute to the creation of Ra in machining including, but not limited to, the machining
parameters, properties of the workpiece, tool geometry and wear. Alternative to traditional
selection of machining parameters using existing standards and/or expert knowledge,
current studies in literature have examined methods to consider these factors for prediction
and optimization of machining parameters to minimize Ra. These methods span many
approaches including theoretical modelling and simulation, design of experiments,
statistical and machine learning methods. Despite the abundance of research in this area,
challenges remain regarding the generalizability of models for multiple machining
conditions, and lengthy training requirements of methods based solely on machine learning
methods. Furthermore, many machine learning methods focus on static cutting parameters
rather than consideration of properties of the tool and workpiece, and dynamic factors such
as tool wear.
The main contribution of this research was to develop a prediction and optimization
model framework to minimize Ra for finish turning that combines theoretical and machine
learning methods, and can be practically utilized by CNC machine operators for parameter
v
decision making. The presented research work was divided into four distinct objectives.
The first objective of this research focused on analyzing the relationship between the
machining parameters and Ra for three different materials with varying properties (AISI
4340, AISI 316, and CGI 450). This was followed by the second objective that targeted the
development of an Ra prediction framework that utilized a kinematics-based prediction
model with an ensemble gradient boosted regression tree (GBRT) to create a multi-material
model with justified results, while strengthening accuracy with the machine learning
component. The results demonstrated the multi-material model was able to provide
predictions with a root-mean-square error (RMSE) of 0.166 μm and attained 70% of testing
predictions to fall within limits set by the ASME B46.1-2019 standard. This standard was
utilized as an efficient evaluation tool for determining if the prediction accuracy was within
an acceptable range.
The remaining objectives of this research focused on investigating the relationship
between tool wear and Ra through a focused study on AISI 316, followed by application
of the prediction model framework as the fitness function for testing of three different
metaheuristic optimization algorithms to minimize Ra. The results revealed a significant
relationship between tool wear and Ra, which enabled improvement in the prediction
framework through the use of the tool’s total cutting distance for an indicator of tool wear
as an input into the prediction model. Significant prediction improvement was achieved,
demonstrated by metrics including RMSE of 0.108 μm and 87% of predictions were within
the ASME B46.1-2019 limits. The improved prediction model was used as the fitness
function for comparison performance of genetic algorithm (GA), particle swarm
vi
optimization (PSO), and simulated annealing (SA), under constrained and unconstrained
conditions. SA demonstrated superior performance with less than 5% error between the
optimal and experimental Ra when constrained to the experimental data set during
validation testing. The overall results of this research establish the feasibility of a
framework that could be applied in an industrial setting for both prediction of Ra for
multiple materials, and supports the determination of parameters for minimizing Ra
considering the dynamic nature of tool wear. / Thesis / Master of Applied Science (MASc) / The surface quality produced on a workpiece via computer numerical control
(CNC) machining is influenced by many factors, including the machining parameters,
characteristics of the workpiece, and the cutting tool’s geometry and wear. When the
optimal machining parameters are not used, manufacturing companies may incur
unexpected costs associated with scrapped components, as well as time and materials
required for re-machining the component. This research focuses on developing a model to
indirectly predict surface roughness (Ra) in CNC turning, and to provide operators
guidance regarding the optimal machining parameters to ensure the machined surface is
within specifications. A multi-material Ra prediction model was produced to allow for use
under multiple machining conditions. This was enhanced by comparing three different
optimization algorithms to evaluate their suitability with the prediction framework for
providing recommendation on the optimal machining parameters, considering an indicator
for tool wear as an input factor.
|
43 |
Understanding spatial patterns of land-system change in EuropeLevers, Christian 27 April 2016 (has links)
Die Nutzung von terrestrischen Ökosystemen zur Befriedigung der Grundbedürfnisse der Menschheit hat tiefgreifende Auswirkungen auf das Erdsystem und führte zur Ausprägung von anthropogen dominierten Landsystemen. Diese sind von hoher Komplexität, da sie aus einer Vielzahl von unterschiedlichsten Einflussfaktoren angetriebenen Landnutzungsveränderungen hervorgegangen sind. Aktuelle Forderungen nach einer nachhaltigen zukünftigen Landnutzung erfordern ein fundiertes und integratives Verständnis dieser Komplexität. Das Hauptziel dieser Arbeit ist es, ein besseres Verständnis der raum-zeitlichen Muster und Determinanten des Landsystemwandels, insbesondere der Landnutzungsintensität, in Europa zwischen 1990 und 2010 zu erlangen. Europa ist ein interessantes Studiengebiet, da es jüngst starke Landnutzungsveränderungen erlebte und seine Heterogenität zu einer Vielfalt von Landsystemen und Landsystemveränderungen führte. Das Ziel der Arbeit wurde durch (i) die Kartierung von Intensitätsmustern und deren Veränderungen in Forst- und Agrarsystemen sowie der Ermittlung der dafür einflussreichsten räumlichen Determinanten und (ii) die Kartierung und Charakterisierung archetypischer Muster und Entwicklungsverläufe von Landsystemen untersucht. Die Ergebnisse dieser Arbeit zeigten einen deutlichen Ost-West-Unterschied in Landsystemmustern und -veränderungen in Europa, mit intensiv genutzten und intensivierenden Regionen vor allem in Westeuropa. Dennoch wurde Europa vor allem durch relativ stabile Landsystemmuster gekennzeichnet und (De-)Intensivierungstrends waren nur von untergeordneter Bedeutung. Intensitätsmuster und -veränderungen waren stark an Standortbedingungen gebunden, vor allem an edaphische, klimatische, und länderspezifische Besonderheiten. Diese Arbeit erweitert das Verständnis des Landsystemwandels in Europa und kann zur Entwicklung wissenschaftlicher und politikbezogener Maßnahmen sowie zur Erreichung einer nachhaltigeren Landnutzung in Europa beitragen. / The utilisation of terrestrial ecosystems to satisfy the basic needs of humankind has profound impacts on the Earth System and led to the development of human-dominated land systems. These are substantially complex as they evolved from a multitude of land-change pathways driven by a variety of influential factors. Current calls for a more sustainable future land-use require a sound and integrative understanding of this complexity. The main goal of this thesis is to better understand the spatio-temporal patterns and the determinants of land-system change in Europe between 1990 and 2010, especially with regard to land-use intensity. Europe serves as an interesting study region as it recently experienced a period of marked land-use change, and since its large environmental, political, and socio-economic heterogeneity resulted in a diversity of land systems and land-change pathways. Land-system changes in Europe were examined by (i) mapping patterns and changes in forestry and agricultural intensity and identifying the most influential spatial determinants related to these changes, and (ii) mapping and characterising archetypical patterns and trajectories of land systems considering both land-use extent and intensity indicators. Results revealed a distinct east-west divide in Europe’s land-system patterns and change trajectories, with intensively used and intensifying regions particularly located in Western Europe. However, Europe was mainly characterised by relatively stable land-systems patterns with (de-) intensification trends being only of minor importance. Land-use intensity levels and changes were strongly related to site conditions, especially with regard to soil and climate, as well as to country-specific characteristics. By fostering the understanding of land-system change, this thesis has the potential to contribute to scientific and policy-related actions that address current efforts to guide future land systems in Europe to a more sustainable use.
|
44 |
Telecommunications Trouble Ticket Resolution Time Modelling with Machine Learning / Modellering av lösningstid för felanmälningar i telenät med maskininlärningBjörling, Axel January 2021 (has links)
This report explores whether machine learning methods such as regression and classification can be used with the goal of estimating the resolution time of trouble tickets in a telecommunications network. Historical trouble ticket data from Telenor were used to train different machine learning models. Three different machine learning classifiers were built: a support vector classifier, a logistic regression classifier and a deep neural network classifier. Three different machine learning regressors were also built: a support vector regressor, a gradient boosted trees regressor and a deep neural network regressor. The results from the different models were compared to determine what machine learning models were suitable for the problem. The most important features for estimating the trouble ticket resolution time were also investigated. Two different prediction scenarios were investigated in this report. The first scenario uses the information available at the time of ticket creation to make a prediction. The second scenario uses the information available after it has been decided whether a technician will be sent to the affected site or not. The conclusion of the work is that it is easier to make a better resolution time estimation in the second scenario compared to the first scenario. The differences in results between the different machine learning models were small. Future work can include more information and data about the underlying root cause of the trouble tickets, more weather data and power outage information in order to make better predictions. A standardised way of recording and logging ticket data is proposed to make a future trouble ticket time estimation easier and reduce the problem of missing data. / Den här rapporten undersöker om maskininlärningsmetoder som regression och klassificering kan användas för att uppskatta hur lång tid det tar att lösa en felanmälan i ett telenät. Data från tidigare felanmälningar användes för att träna olika maskininlärningsmodeller. Tre olika klassificerare byggdes: en support vector-klassificerare, en logistic regression-klassificerare och ett neuralt nätverk-klassificerare. Tre olika regressionsmodeller byggdes också: en support vector-regressor, en gradient boosted trees-regressor och ett neuralt nätverk-regressor. Resultaten från de olika modellerna jämfördes för att se vilken modell som är lämpligast för problemet. En undersökning om vilken information och vilka datavariabler som är viktigast för att uppskatta tiden det tar att lösa felanmälan utfördes också. Två olika scenarion för att uppskatta tiden har undersökts i rapporten. Det första scenariot använder informationen som är tillgänglig när en felanmälan skapas. Det andra scenariot använder informationen som finns tillgänglig efter det har bestämts om en tekniker ska skickas till den påverkade platsen. Slutsatsen av arbetet är att det är lättare att göra en bra tidsuppskattning i det andra scenariot jämfört med det första scenariot. Skillnaden i resultat mellan de olika maskininlärningsmodellerna var små. Framtida arbete inom ämnet kan använda information och data om de bakomliggande orsakerna till felanmälningarna, mer väderdata och information om elavbrott. En standardiserad metod för att samla in och logga data för varje felanmälan föreslås också för att göra framtida tidsuppskattningar bättre och undvika problemet med datapunkter som saknas.
|
45 |
Radar based tank level measurement using machine learning : Agricultural machines / Nivåmätning av tank med radar sensorer och maskininlärningThorén, Daniel January 2021 (has links)
Agriculture is becoming more dependent on computerized solutions to make thefarmer’s job easier. The big step that many companies are working towards is fullyautonomous vehicles that work the fields. To that end, the equipment fitted to saidvehicles must also adapt and become autonomous. Making this equipment autonomoustakes many incremental steps, one of which is developing an accurate and reliable tanklevel measurement system. In this thesis, a system for tank level measurement in a seedplanting machine is evaluated. Traditional systems use load cells to measure the weightof the tank however, these types of systems are expensive to build and cumbersome torepair. They also add a lot of weight to the equipment which increases the fuel consump-tion of the tractor. Thus, this thesis investigates the use of radar sensors together witha number of Machine Learning algorithms. Fourteen radar sensors are fitted to a tankat different positions, data is collected, and a preprocessing method is developed. Then,the data is used to test the following Machine Learning algorithms: Bagged RegressionTrees (BG), Random Forest Regression (RF), Boosted Regression Trees (BRT), LinearRegression (LR), Linear Support Vector Machine (L-SVM), Multi-Layer Perceptron Re-gressor (MLPR). The model with the best 5-fold crossvalidation scores was Random For-est, closely followed by Boosted Regression Trees. A robustness test, using 5 previouslyunseen scenarios, revealed that the Boosted Regression Trees model was the most robust.The radar position analysis showed that 6 sensors together with the MLPR model gavethe best RMSE scores.In conclusion, the models performed well on this type of system which shows thatthey might be a competitive alternative to load cell based systems.
|
46 |
Recherche de Supersymétrie à l’aide de leptons de même charge électrique dans l’expérience ATLASTrépanier, Hubert 08 1900 (has links)
La théorie de la Supersymétrie est étudiée ici en tant que théorie complémentaire au Modèle Standard, sachant que celui-ci n'explique qu'environ 5% de l'univers et est incapable de répondre à plusieurs questions fondamentales en physique des particules. Ce mémoire contient les résultats d'une recherche de Supersymétrie effectuée avec le détecteur ATLAS et utilisant des états finaux contenant entre autres une paire de leptons de même charge électrique ou trois leptons. Les données proviennent de collisions protons-protons à 13 TeV d'énergie dans le centre-de-masse produites au Grand Collisionneur de Hadrons (LHC) en 2015. L'analyse n'a trouvé aucun excès significatif au-delà des attentes du Modèle Standard mais a permis tout de même de poser de nouvelles limites sur la masse de certaines particules supersymétriques. Ce mémoire contient aussi l'étude exhaustive d'un bruit de fond important pour cette analyse, soit le bruit de fond provenant des électrons dont la charge est mal identifiée. L'extraction du taux d'inversion de charge, nécessaire pour connaître combien d'événements seront attribuables à ce bruit de fond, a démontré que la probabilité pour que la charge d'un électron soit mal identifiée par ATLAS variait du dixième de pourcent à 8-9% selon l'impulsion transverse et la pseudorapidité des électrons. Puis, une étude fut effectuée concernant l'élimination de ce bruit de fond via l'identification et la discrimination des électrons dont la charge est mal identifiée. Une analyse multi-variée se servant d'une méthode d'apprentissage par arbres de décision, basée sur les caractéristiques distinctives de ces électrons, montra qu'il était possible de conserver un haut taux d'électrons bien identifiés (95%) tout en rejetant la grande majorité des électrons possédant une charge mal identifiée (90-93%). / Since the Standard Model only explains about 5% of our universe and leaves us with a lot of open questions in fundamental particle physics, a new theory called Supersymmetry is studied as a complementary model to the Standard Model. A search for Supersymmetry with the ATLAS detector and using final states with same-sign leptons or three leptons is presented in this master thesis. The data used for this analysis were produced in 2015 by the Large Hadron Collider (LHC) using proton-proton collisions at 13 TeV of center-of-mass energy. No excess was found above the Standard Model expectations but we were able to set new limits on the mass of some supersymmetric particles. This thesis describes in detail the topic of the electron charge-flip background, which arises when the electric charge of an electron is mis-measured by the ATLAS detector. This is an important background to take into account when searching for Supersymmetry with same-sign leptons. The extraction of charge-flip probabilities, which is needed to determine the number of charge-flip events among our same-sign selection, was performed and found to vary from less than a percent to 8-9% depending on the transverse momentum and the pseudorapidity of the electron. The last part of this thesis consists in a study for the potential of rejection of charge-flip electrons. It was performed by identifying and discriminating those electrons based on a multi-variate analysis with a boosted decision tree method using distinctive properties of charge-flip electrons. It was found that we can reject the wide majority of mis-measured electrons (90-93%) while keeping a very high level of efficiency for well-measured ones (95%).
|
47 |
Intensive poultry production and highly pathogenic avian influenza H5N1 in Thailand: statistical and process-based models / Production intensive de volailles et influenza aviaire hautement pathogène H5N1 en Thaïlande: approches statistiques et mécanistiquesVan Boeckel, Thomas 26 September 2013 (has links)
Le virus de l’influenza aviaire hautement pathogène (IAHP) de type H5N1 apparu en Chine en 1996 constitue une menace pour la santé humaine en raison de sa circulation endémique dans les volailles domestiques et de son potentiel zoonotique. La sévérité de l'infection liée à l'IAHP H5N1 est variable selon les espèces d'oiseaux: certains anatidés sont porteurs sains et asymptomatiques du virus tandis que dans les élevages de poulets, l'IAHP est fortement contagieux et caractérisé par des taux de mortalité supérieurs à 90%. Chez les humains, l'impact de l'IAHP H5N1 reste à ce jour modéré (630 cas humains dont 375 morts, World Health Organization Juin, 2013) en raison de la faible transmission du virus des volailles aux humains et d'humain à humain. Cependant, étant donné les taux de létalité élevés (>50%), un changement des modalités de transmission pourrait mener à un impact beaucoup plus élevé.<p>Depuis son émergence, l'IAHP H5N1 a eu un impact économique important dans de nombreux pays d’Asie du Sud-Est. La Thaïlande, pays qui fait partie des principaux exportateurs mondiaux de viande de volaille, a été sévèrement touchée par les multiples vagues épidémiques entre 2003 et 2005. Ces épisodes ont eu un impact sur les revenus des petits et moyens producteurs, mais également causé des pertes économiques importantes dans le secteur de la production intensive de volailles en raison de l'embargo imposé par les principaux marchés d'exportation. <p>L'objectif de ce travail est d’étudier quantitativement l'association entre la production intensive de la volaille et la distribution spatio-temporelle de l'IAHP H5N1 en Thaïlande. Deux approches ont été développées pour aborder cette étude: le développement d’une part de modèles statistiques visant à identifier les déterminants du risque d'IAHP H5N1, et d'autre part, de modèles mécanistiques visant à simuler des trajectoires épidémiques sur base de la connaissance des mécanismes de transmission de l'IAHP H5N1, de la structure du secteur de la production de volaille et des mesures d'intervention mises en place. <p>A l’aide de facteurs environnementaux et anthropogéniques, nous montrons que: (i) la distribution des canards domestiques en Asie peut être prédite en utilisant des modèles de régression non-linéaire, et (ii) la production de volailles peut être désagrégée entre production extensive et intensive sur base du nombre de volailles par éleveur. Enfin (iii), nous montrons en utilisant des arbres de régression boostés ("Boosted Regression Trees", BRT) que les principaux déterminants de la distribution du risque d'IAHP H5N1 sont les canards élevés en systèmes intensifs, le nombre de cycles de culture de riz et la proportion d'eau présente dans le paysage. Finalement, nous illustrons les potentialités des modèles mécanistiques pour évaluer l'efficacité des mesures d'intervention implémentées, tester des scénarios alternatifs d'intervention et identifier des stratégies optimales de prévention et d'intervention contre de futures épidémies<p> / Doctorat en Sciences agronomiques et ingénierie biologique / info:eu-repo/semantics/nonPublished
|
48 |
Rare dileptonic B meson decays at LHCbMorda, Alessandro 28 September 2015 (has links)
Les désintégrations rares B0(s)→ll sont générées par des courants neutres avec changement de la saveur. Pour cette raison, ainsi qu'à cause de la suppression d'hélicité, leurs taux de désintégration sont très petits dans le Modèle Standard (MS), mais la présence de particules virtuelles de Nouvelle Physique peut radicalement modifier cette prédiction. Une partie du travail original présenté dans cette thèse est dédié à l'optimisation de l'algorithme d'Analyse Multi Varié (MVA) pour la recherche de la désintégration B0(s)→μμ avec l'échantillon collecté par l'expérience LHCb. Cet échantillon a été combiné avec celui collecté par l'expérience CMS et pour la première fois la désintégration B0(s)→μμ a été observée. En vue d'améliorer la sensibilité au mode B0(s)→μμ de nouvelles études ont également été menées pour augmenter la performance des analyses multivariées. Une autre partie du travail original présenté dans cette thèse concerne la définition d'une chaine de sélection pour la recherche des désintégrations B0(s)→τ τ dans l'état final où les deux τ vont en trois π chargées et un τ est étudié. La présence des deux ν dans l'état final de la désintégration rend difficile une reconstruction des impulsions des deux τ. Cependant, la possibilité de mesurer les deux vertex de désintégration des τ ainsi que le vertex d'origine du candidat B, permet d'imposer des contraintes géométriques qui peuvent être utilisées dans la reconstruction des impulsions des deux τ. En particulier, un nouvel algorithme pour la reconstruction complété, événement par événement, de ces impulsions et de leurs variables associées est présenté et discuté. / The B0(s)→ll decays are generated by Flavor Changing Neutral Currents, hence they can proceed only through loop processes. For this reason, and because of an additional helicity suppression, their branching ratios are predicted to be very small in the Standard Model (SM). A part of the original work presented in this thesis has been devoted to the optimization of the Multi Variate Analysis classifier for the search of the B0(s)→μμ with the full dataset collected at LHCb. This dataset has also been combined with the one collected from CMS to obtain the first observation of B0(s)→μμ has been obtained. In view of the update of the analysis aiming to improve the sensitivity for the B0(s)→μμ mode, a new isolation variable, exploiting a topological vertexing algorithm, has been developed and additional studies for a further optimization of the MVA classifier performances have been done. The presence of two ν in the final state of the decay makes the reconstruction of the τ momenta of the two τ. Nevertheless the possibility of measuring the two decay vertexes of the τ, as well as the B candidate production vertex, allows to impose geometrical constraints that can be used in the reconstruction of the τ momenta. In particular, a new algorithm for the full reconstruction of each event of these momenta and of related variables has been presented and discussed.
|
49 |
Free-text Informed Duplicate Detection of COVID-19 Vaccine Adverse Event ReportsTuresson, Erik January 2022 (has links)
To increase medicine safety, researchers use adverse event reports to assess causal relationships between drugs and suspected adverse reactions. VigiBase, the world's largest database of such reports, collects data from numerous sources, introducing the risk of several records referring to the same case. These duplicates negatively affect the quality of data and its analysis. Thus, efforts should be made to detect and clean them automatically. Today, VigiBase holds more than 3.8 million COVID-19 vaccine adverse event reports, making deduplication a challenging problem for existing solutions employed in VigiBase. This thesis project explores methods for this task, explicitly focusing on records with a COVID-19 vaccine. We implement Jaccard similarity, TF-IDF, and BERT to leverage the abundance of information contained in the free-text narratives of the reports. Mean-pooling is applied to create sentence embeddings from word embeddings produced by a pre-trained SapBERT model fine-tuned to maximise the cosine similarity between narratives of duplicate reports. Narrative similarity is quantified by the cosine similarity between sentence embeddings. We apply a Gradient Boosted Decision Tree (GBDT) model for classifying report pairs as duplicates or non-duplicates. For a more calibrated model, logistic regression fine-tunes the leaf values of the GBDT. In addition, the model successfully implements a ruleset to find reports whose narratives mention a unique identifier of its duplicate. The best performing model achieves 73.3% recall and zero false positives on a controlled testing dataset for an F1-score of 84.6%, vastly outperforming VigiBase’s previously implemented model's F1-score of 60.1%. Further, when manually annotated by three reviewers, it reached an average 87% precision when fully deduplicating 11756 reports amongst records relating to hearing disorders.
|
50 |
Anticipating bankruptcies among companies with abnormal credit risk behaviour : Acase study adopting a GBDT model for small Swedish companies / Förutseende av konkurser bland företag med avvikande kreditrisks beteende : En fallstudie som använder en GBDT-modell för små svenska företagHeinke, Simon January 2022 (has links)
The field of bankruptcy prediction has experienced a notable increase of interest in recent years. Machine Learning (ML) models have been an essential component of developing more sophisticated models. Previous studies within bankruptcy prediction have not evaluated how well ML techniques adopt for data sets of companies with higher credit risks. This study introduces a binary decision rule for identifying companies with higher credit risks (abnormal companies). Two categories of abnormal companies are explored based on the activity of: (1) abnormal credit risk analysis (”AC”, herein) and (2) abnormal payment remarks (”AP”, herein) among small Swedish limited companies. Companies not fulfilling the abnormality criteria are considered normal (”NL”, herein). The abnormal companies showed a significantly higher risk for future payment defaults than NL companies. Previous studies have mainly used financial features for bankruptcy prediction. This study evaluates the contribution of different feature categories: (1) financial, (2) qualitative, (3) performed credit risk analysis, and (4) payment remarks. Implementing a Light Gradient Boosting Machine (LightGBM), the study shows that bankruptcies are easiest to anticipate among abnormal companies compared to NL and all companies (full data set). LightGBM predicted bankruptcies with an average Area Under the Precision Recall Curve (AUCPR) of 45.92% and 61.97% for the AC and AP data sets, respectively. This performance is 6.13 - 27.65 percentage units higher compared to the AUCPR achieved on the NL and full data set. The SHapley Additive exPlanations (SHAP)-values indicate that financial features are the most critical category. However, qualitative features highly contribute to anticipating bankruptcies on the NL companies and the full data set. The features of performed credit risk analysis and payment remarks are primarily useful for the AC and AP data sets. Finally, the field of bankruptcy prediction is introduced to: (1) evaluate if bankruptcies among companies with other forms of credit risk can be anticipated with even higher predictive performance and (2) test if other qualitative features bring even better predictive performance to bankruptcy prediction. / Konkursklassificering har upplevt en anmärkningsvärd ökning av intresse de senaste åren. I denna utveckling har maskininlärningsmodeller utgjort en nyckelkompentent i utvecklingen mot mer sofistikerade modeller. Tidigare studier har inte utvärderat hur väl maskininlärningsmodeller kan appliceras för att förutspå konkurser bland företag med högre kreditrisk. Denna studie introducerar en teknik för att definiera företag med högre kreditrisk, det vill säga avvikande företag. Två olika kategorier av avvikande företag introduceras baserat på företagets aktivitet av: (1) kreditrisksanalyser på företaget (”AK”, hädanefter), samt (2) betalningsanmärkningar (”AM”, hädanefter) för små svenska aktiebolag. Företag som inte uppfyller kraven för att vara ett avvikande företag klassas som normala (”NL”, hädanefter). Studien utvärderar sedan hur väl konkurser kan förutspås för avvikande företag i relation till NL och alla företag. Tidigare studier har primärt utvärdera finansiella variabler för konkursförutsägelse. Denna studie utvärderar ett bredare spektrum av variabler: (1) finansiella, (2) kvalitativa, (3) kreditrisks analyser, samt (4) betalningsanmärkningar för konkursförutsägelse. Genom att implementera LightGBM finner studien att konkurser förutspås med högst noggrannhet bland AM företag. Modellen presenterar bättre för samtliga avvikande företag i jämförelse med både NL företag och för hela datasetet. LightGBM uppnår ett genomsnittligt AUC-PR om 45.92% och 61.97% för AK och AM dataseten. Dessa resultat är 6.13-27.65 procentenheter högre i jämförelse med det AUC-PR som uppnås för NL och hela datasetet. Genom att analysera modellens variabler med SHAP-värden visar studien att finansiella variabler är mest betydelsefulla för modells prestation. Kvalitativa variabler har däremot en stor betydelse för hur väl konkurser kan förutspås för NL företag samt alla företag. Variabelkategorierna som indikerar företagets historik av genomförda kreditrisksanalyser samt betalningsanmärkningar är primärt betydelsefulla för konkursklassificering av AK samt AM företag. Detta introducerar området av konkursförutsägelse till att: (1) undersöka om konkurser bland företag med andra kreditrisker kan förutspås med högre noggrannhet och (2) test om andra kvalitativa variabler ger bättre prediktive prestandard för konkursförutsägelse.
|
Page generated in 0.0535 seconds