Spelling suggestions: "subject:"9gradient boosted codecision trees"" "subject:"9gradient boosted codecision àrees""
1 |
Combining Partial Least Squares and the Gradient-Boosting Method for Soil Property Retrieval Using Visible Near-Infrared Shortwave Infrared SpectraLiu, Lanfa, Ji, Min, Buchroithner, Manfred F. 06 June 2018 (has links) (PDF)
Soil spectroscopy has experienced a tremendous increase in soil property characterisation, and can be used not only in the laboratory but also from the space (imaging spectroscopy). Partial least squares (PLS) regression is one of the most common approaches for the calibration of soil properties using soil spectra. Besides functioning as a calibration method, PLS can also be used as a dimension reduction tool, which has scarcely been studied in soil spectroscopy. PLS components retained from high-dimensional spectral data can further be explored with the gradient-boosted decision tree (GBDT) method. Three soil sample categories were extracted from the Land Use/Land Cover Area Frame Survey (LUCAS) soil library according to the type of land cover (woodland, grassland, and cropland). First, PLS regression and GBDT were separately applied to build the spectroscopic models for soil organic carbon (OC), total nitrogen content (N), and clay for each soil category. Then, PLS-derived components were used as input variables for the GBDT model. The results demonstrate that the combined PLS-GBDT approach has better performance than PLS or GBDT alone. The relative important variables for soil property estimation revealed by the proposed method demonstrated that the PLS method is a useful dimension reduction tool for soil spectra to retain target-related information.
|
2 |
Combining Partial Least Squares and the Gradient-Boosting Method for Soil Property Retrieval Using Visible Near-Infrared Shortwave Infrared SpectraLiu, Lanfa, Ji, Min, Buchroithner, Manfred F. 06 June 2018 (has links)
Soil spectroscopy has experienced a tremendous increase in soil property characterisation, and can be used not only in the laboratory but also from the space (imaging spectroscopy). Partial least squares (PLS) regression is one of the most common approaches for the calibration of soil properties using soil spectra. Besides functioning as a calibration method, PLS can also be used as a dimension reduction tool, which has scarcely been studied in soil spectroscopy. PLS components retained from high-dimensional spectral data can further be explored with the gradient-boosted decision tree (GBDT) method. Three soil sample categories were extracted from the Land Use/Land Cover Area Frame Survey (LUCAS) soil library according to the type of land cover (woodland, grassland, and cropland). First, PLS regression and GBDT were separately applied to build the spectroscopic models for soil organic carbon (OC), total nitrogen content (N), and clay for each soil category. Then, PLS-derived components were used as input variables for the GBDT model. The results demonstrate that the combined PLS-GBDT approach has better performance than PLS or GBDT alone. The relative important variables for soil property estimation revealed by the proposed method demonstrated that the PLS method is a useful dimension reduction tool for soil spectra to retain target-related information.
|
3 |
Free-text Informed Duplicate Detection of COVID-19 Vaccine Adverse Event ReportsTuresson, Erik January 2022 (has links)
To increase medicine safety, researchers use adverse event reports to assess causal relationships between drugs and suspected adverse reactions. VigiBase, the world's largest database of such reports, collects data from numerous sources, introducing the risk of several records referring to the same case. These duplicates negatively affect the quality of data and its analysis. Thus, efforts should be made to detect and clean them automatically. Today, VigiBase holds more than 3.8 million COVID-19 vaccine adverse event reports, making deduplication a challenging problem for existing solutions employed in VigiBase. This thesis project explores methods for this task, explicitly focusing on records with a COVID-19 vaccine. We implement Jaccard similarity, TF-IDF, and BERT to leverage the abundance of information contained in the free-text narratives of the reports. Mean-pooling is applied to create sentence embeddings from word embeddings produced by a pre-trained SapBERT model fine-tuned to maximise the cosine similarity between narratives of duplicate reports. Narrative similarity is quantified by the cosine similarity between sentence embeddings. We apply a Gradient Boosted Decision Tree (GBDT) model for classifying report pairs as duplicates or non-duplicates. For a more calibrated model, logistic regression fine-tunes the leaf values of the GBDT. In addition, the model successfully implements a ruleset to find reports whose narratives mention a unique identifier of its duplicate. The best performing model achieves 73.3% recall and zero false positives on a controlled testing dataset for an F1-score of 84.6%, vastly outperforming VigiBase’s previously implemented model's F1-score of 60.1%. Further, when manually annotated by three reviewers, it reached an average 87% precision when fully deduplicating 11756 reports amongst records relating to hearing disorders.
|
4 |
Anticipating bankruptcies among companies with abnormal credit risk behaviour : Acase study adopting a GBDT model for small Swedish companies / Förutseende av konkurser bland företag med avvikande kreditrisks beteende : En fallstudie som använder en GBDT-modell för små svenska företagHeinke, Simon January 2022 (has links)
The field of bankruptcy prediction has experienced a notable increase of interest in recent years. Machine Learning (ML) models have been an essential component of developing more sophisticated models. Previous studies within bankruptcy prediction have not evaluated how well ML techniques adopt for data sets of companies with higher credit risks. This study introduces a binary decision rule for identifying companies with higher credit risks (abnormal companies). Two categories of abnormal companies are explored based on the activity of: (1) abnormal credit risk analysis (”AC”, herein) and (2) abnormal payment remarks (”AP”, herein) among small Swedish limited companies. Companies not fulfilling the abnormality criteria are considered normal (”NL”, herein). The abnormal companies showed a significantly higher risk for future payment defaults than NL companies. Previous studies have mainly used financial features for bankruptcy prediction. This study evaluates the contribution of different feature categories: (1) financial, (2) qualitative, (3) performed credit risk analysis, and (4) payment remarks. Implementing a Light Gradient Boosting Machine (LightGBM), the study shows that bankruptcies are easiest to anticipate among abnormal companies compared to NL and all companies (full data set). LightGBM predicted bankruptcies with an average Area Under the Precision Recall Curve (AUCPR) of 45.92% and 61.97% for the AC and AP data sets, respectively. This performance is 6.13 - 27.65 percentage units higher compared to the AUCPR achieved on the NL and full data set. The SHapley Additive exPlanations (SHAP)-values indicate that financial features are the most critical category. However, qualitative features highly contribute to anticipating bankruptcies on the NL companies and the full data set. The features of performed credit risk analysis and payment remarks are primarily useful for the AC and AP data sets. Finally, the field of bankruptcy prediction is introduced to: (1) evaluate if bankruptcies among companies with other forms of credit risk can be anticipated with even higher predictive performance and (2) test if other qualitative features bring even better predictive performance to bankruptcy prediction. / Konkursklassificering har upplevt en anmärkningsvärd ökning av intresse de senaste åren. I denna utveckling har maskininlärningsmodeller utgjort en nyckelkompentent i utvecklingen mot mer sofistikerade modeller. Tidigare studier har inte utvärderat hur väl maskininlärningsmodeller kan appliceras för att förutspå konkurser bland företag med högre kreditrisk. Denna studie introducerar en teknik för att definiera företag med högre kreditrisk, det vill säga avvikande företag. Två olika kategorier av avvikande företag introduceras baserat på företagets aktivitet av: (1) kreditrisksanalyser på företaget (”AK”, hädanefter), samt (2) betalningsanmärkningar (”AM”, hädanefter) för små svenska aktiebolag. Företag som inte uppfyller kraven för att vara ett avvikande företag klassas som normala (”NL”, hädanefter). Studien utvärderar sedan hur väl konkurser kan förutspås för avvikande företag i relation till NL och alla företag. Tidigare studier har primärt utvärdera finansiella variabler för konkursförutsägelse. Denna studie utvärderar ett bredare spektrum av variabler: (1) finansiella, (2) kvalitativa, (3) kreditrisks analyser, samt (4) betalningsanmärkningar för konkursförutsägelse. Genom att implementera LightGBM finner studien att konkurser förutspås med högst noggrannhet bland AM företag. Modellen presenterar bättre för samtliga avvikande företag i jämförelse med både NL företag och för hela datasetet. LightGBM uppnår ett genomsnittligt AUC-PR om 45.92% och 61.97% för AK och AM dataseten. Dessa resultat är 6.13-27.65 procentenheter högre i jämförelse med det AUC-PR som uppnås för NL och hela datasetet. Genom att analysera modellens variabler med SHAP-värden visar studien att finansiella variabler är mest betydelsefulla för modells prestation. Kvalitativa variabler har däremot en stor betydelse för hur väl konkurser kan förutspås för NL företag samt alla företag. Variabelkategorierna som indikerar företagets historik av genomförda kreditrisksanalyser samt betalningsanmärkningar är primärt betydelsefulla för konkursklassificering av AK samt AM företag. Detta introducerar området av konkursförutsägelse till att: (1) undersöka om konkurser bland företag med andra kreditrisker kan förutspås med högre noggrannhet och (2) test om andra kvalitativa variabler ger bättre prediktive prestandard för konkursförutsägelse.
|
Page generated in 0.1042 seconds