Global ETD Search

1	Moderní predikční metody pro finanční časové řady / Modern predictive methods for financial time series Herrmann, Vojtěch January 2021 (has links) This thesis deals with comparing two approaches to modelling and predicting time series: a traditional one (the ARIMAX model) and a modern one (gradiently boosted decision trees within the framework of the XGBoost library). In the first part of the thesis we introduce the theoretical framework of supervised learning, the ARIMAX model and gradient boosting in the context of decision trees. In the second part we fit the ARIMAX and XGBoost models which both predict a specific time series, the daily volume of the S&P 500 index, which is a crucial task in many branches. After that we compare the results of the two approaches, we describe the advantages of the XGBoost model, which presumably lead to its better results in this specific simulation study and we show the importance of hyperparameter optimization. Afterwards, we compare the practicality of the methods, especially in regards to their computational demands. In the last part of the thesis, a hybrid model theory is derived and algorithms to get the optimal hybrid model are proposed. These algorithms are then used for the mentioned prediction problem. The optimal hybrid model combines ARIMAX and XGBoost models and performs better than each of the individual models on its own. 1
2	[pt] MODELAGEM E SIMULAÇÃO DA POLIMERIZAÇÃO DO 1,3-BUTADIENO VIA CATALISADOR DE MÚLTIPLOS SÍTIOS / [en] MODELING AND SIMULATION OF POLYMERIZATION OF 1,3-BUTADIENE VIA MULTI-SITE CATALYST FRANCISCO RENAN LOPES FARIAS 25 January 2023 (has links) [pt] A indústria da borracha sintética tem grande importância e está presente no cotidiano da sociedade mundial. A borracha de butadieno ou polibutadieno é um dos polímeros mais utilizados neste campo, principalmente na produção de pneus. Portanto, o controle das condições operacionais e das propriedades finais do polímero formado são pontos importantes a serem estudados, pois são um desafio para a indústria. Assim, o presente trabalho tem como foco simular a polimerização em solução de polibutadieno utilizando o software Aspen Plus, onde foram utilizados 1,3-butadieno, tetracloreto de titânio, trietilalumínio e hexano como monômero, catalisador, cocatalisador e solvente, respectivamente. Nesta parte do trabalho, obtiveram-se gráficos de distribuição de massa molar que apresentaram propriedades semelhantes aos polibutadienos comerciais e alguns polibutadienos sintetizados em escala de bancada encontrados na literatura. Além disso, em uma segunda parte do trabalho, estuda-se e explica-se a técnica de distribuição instantânea e como foi gerada uma base de dados para um modelo de aprendizagem de máquina chamado de XGBoost, onde pontos dos gráficos da MMD (molar mass distribution) do polímero serviram como entrada do modelo a fim de prever as constantes cinéticas da polimerização. Ambos os estudos e simulações mostram que três e quatro sítios de catalisadores ativos são capazes de sintetizar polímeros com propriedades semelhantes aos polibutadienos comerciais e em escala de bancada. / [en] The synthetic rubber industry is of great importance and is present in the daily life of world society. Butadiene rubber or polybutadiene is one of the most used polymers in this field, mainly in the production of tires. Therefore, controlling the operating conditions and the final properties of the polymer formed are important points to be studied, as they are a challenge for the industry. Thus, the present work focuses on simulating the polymerization in solution of polybutadiene using the Aspen Plus software, where 1,3-butadiene, titanium tetrachloride, triethylaluminum and hexane were used as monomer, catalyst, cocatalyst and solvent, respectively. From the work, molar mass distribution graphs were obtained that showed properties similar to commercial polybutadienes and some polybutadienes synthesized on a bench scale found in the literature. Furthermore, in a second part of the work, the instant distribution technique is studied and explained and how a database was generated for a machine learning model called XGBoost, where points from the MMD (molar mass distribution) graphs of the polymer served as input to the model in order to predict the kinetic constants of polymerization. Both studies and simulations show that three and four sites of active catalysts are able to synthesize polymers with properties similar to commercial and bench-scale polybutadienes. [pt] SIMULACAO [pt] POLIMERIZACAO DO POLIBUTADIENO [pt] ASPEN PLUS [pt] XGBOOST [en] SIMULATION [en] POLYMERIZATION OF POLYBUTADIENE [en] ASPEN PLUS [en] XGBOOST
3	Contributions statistiques à l'analyse de mégadonnées publiques / Statical contributions to the analysis of public big data Sainct, Benoît 12 June 2018 (has links) L'objectif de cette thèse est de proposer un ensemble d'outils méthodologiques pour répondre à deux problématiques : la prédiction de masse salariale des collectivités, et l'analyse de leurs données de fiscalité. Pour la première, les travaux s'articulent à nouveau autour de deux thèmes statistiques : la sélection de modèle de série temporelle, et l'analyse de données fonctionnelles. Du fait de la complexité des données et des fortes contraintes de temps de calcul, un rassemblement de l'information a été privilégié. Nous avons utilisé en particulier l'Analyse en Composantes Principales Fonctionnelle et un modèle de mélanges gaussiens pour faire de la classification non-supervisée des profils de rémunération. Ces méthodes ont été appliquées dans deux prototypes d'outils qui représentent l'une des réalisations de cette thèse. Pour la seconde problématique, le travail a été effectué en trois temps : d'abord, des méthodes novatrices de classification d'une variable cible ordinale ont été comparées sur des données publiques déjà analysées dans la littérature, notamment en exploitant des forêts aléatoires, des SVM et du gradient boosting. Ensuite, ces méthodes ont été adaptées à la détection d'anomalies dans un contexte ciblé, ordinal, non supervisé et non paramétrique, et leur efficacité a été principalement comparée sur des jeux de données synthétiques. C'est notre forêt aléatoire ordinale par séparation de classes qui semble présenter le meilleur résultat. Enfin, cette méthode a été appliquée sur des données réelles de bases fiscales, où les soucis de taille et de complexité des données sont plus importants. Destinée aux directions des collectivités territoriales, cette nouvelle approche de l'examen de leur base de données constitue le second aboutissement de ces travaux de thèse. / The aim of this thesis is to provide a set of methodological tools to answer two problems: the prediction of the payroll of local authorities, and the analysis of their tax data. For the first, the work revolves around two statistical themes: the selection of time series model, and the analysis of functional data. Because of the complexity of the data and the heavy computation time constraints, a clustering approach has been favored. In particular, we used Functional Principal Component Analysis and a model of Gaussian mixtures to achieve unsupervised classification. These methods have been applied in two prototypes of tools that represent one of the achievements of this thesis. For the second problem, the work was done in three stages: first, innovative methods for classifying an ordinal target variable were compared on public data, notably by exploiting random forests, SVM and gradient boosting. Then, these methods were adapted to outlier detection in a targeted, ordinal, unsupervised and non-parametric context, and their efficiency was mainly compared on synthetic datasets. It is our ordinal random forest by class separation that seems to have the best result. Finally, this method has been applied to real data of tax bases, where the concerns of size and complexity are more important. Aimed at local authorities directorates, this new approach to examining their database is the second outcome of this work. Détection d'anomalies Classification Forêt aléatoire SVM XGBoost Variable ordinale
4	Using XGBoost to classify theBeihang Keystroke Dynamics Database Blomqvist, Johanna January 2018 (has links) Keystroke Dynamics enable biometric security systems by collecting and analyzing computer keyboard usage data. There are different approaches to classifying keystroke data and a method that has been gaining a lot of attention in the machine learning industry lately is the decision tree framework of XGBoost. XGBoost has won several Kaggle competitions in the last couple of years, but its capacity in the keystroke dynamics field has not yet been widely explored. Therefore, this thesis has attempted to classify the existing Beihang Keystroke Dynamics Database using XGBoost. To do this, keystroke features such as dwell time and flight time were extracted from the dataset, which contains 47 usernames and passwords. XGBoost was then applied to a binary classification problem, where the model attempts to distinguish keystroke feature sequences from genuine users from those of `impostors'. In this way, the ratio of inaccurately and accurately labeled password inputs can be analyzed. The result showed that, after tuning of the hyperparameters, the XGBoost yielded Equal Error Rates (EER) at best 0.31 percentage points better than the SVM used in the original study of the database at 11.52%, and a highest AUC of 0.9792. The scores achieved by this thesis are however significantly worse than a lot of others in the same field, but so were the results in the original study. The results varied greatly depending on user tested. These results suggests that XGBoost may be a useful tool, that should be tuned, but that a better dataset should be used to sufficiently benchmark the tool. Also, the quality of the model is greatly affected by variance among the users. For future research purposes, one should make sure that the database used is of good quality. To create a security system utilizing XGBoost, one should be careful of the setting and quality requirements when collecting training data Keystroke XGBoost machine learning biometrics keyboard Computer Sciences Datavetenskap (datalogi)
5	Automated event prioritization for security operation center using graph-based features and deep learning Jindal, Nitika 06 April 2020 (has links) A security operation center (SOC) is a cybersecurity clearinghouse responsible for monitoring, collecting and analyzing security events from organizations’ IT infrastructure and security controls. Despite their popularity, SOCs are facing increasing challenges and pressure due to the growing volume, velocity and variety of the IT infrastructure and security data observed on a daily basis. Due to the mixed performance of current technological solutions, e.g. intrusion detection system (IDS) and security information and event management (SIEM), there is an over-reliance on manual analysis of the events by human security analysts. This creates huge backlogs and slows down considerably the resolution of critical security events. Obvious solutions include increasing the accuracy and efficiency of crucial aspects of the SOC automation workflow, such as the event classification and prioritization. In the current thesis, we present a new approach for SOC event classification and prioritization by identifying a set of new machine learning features using graph visualization and graph metrics. Using a real-world SOC dataset and by applying different machine learning classification techniques, we demonstrate empirically the benefit of using the graph-based features in terms of improved classification accuracy. Three different classification techniques are explored, namely, logistic regression, XGBoost and deep neural network (DNN). The experimental evaluation shows for the DNN, the best performing classifier, area under curve (AUC) values of 91% for the baseline feature set and 99% for the augmented feature set that includes the graph-based features, which is a net improvement of 8% in classification performance. / Graduate Logistic regression XGBoost Deep Neural Network Security operation center (SOC)
6	Investigation of Interfacial Property with Imperfection: A Machine Learning Approach Ferdousi, Sanjida 07 1900 (has links) Interfacial mechanical properties of adhesive joints are very crucial in board applications, including composites, multilayer structures, and biomedical devices. Establishing traction-separation (T-S) relations for interfacial adhesion can evaluate mechanical and structural reliability, robustness, and failure criteria. Due to the short range of interfacial adhesion such as micro to nanoscale, accurate measurements of T-S relations remain challenging. The advent of machine learning (ML) became a promising tool to predict materials behaviors and establish data-driven mechanical models. In this study, we integrated a state-of-the-art ML method, finite element analysis (FEA), and standard experiments to develop data-driven models for characterizing the interfacial mechanical properties precisely. Macroscale force-displacement curves are derived from FEA with incorporation of double cantilever beam tests to generate the dataset for ML model. The eXtreme Gradient Boosting (XGBoost) multi-output regressions and classifier models are used to determine T-S relations with R2 score of 98.8% and locate imperfections at the interface with accuracy of around 80.8%. The outcome of the XGBoost models demonstrated accurate predictions and fast calculation speed, outperforming several other ML methods. Using 3D printed double cantilever beam specimens, the performance of the ML models is validated experimentally for different materials. Furthermore, a XGBoost model-based package is designed to obtain different adhesive materials T-S relations without creating a database or training a model. Traction-separation relation XGBoost
7	A Comparative Study of Machine Learning Models for Multivariate NextG Network Traffic Prediction with SLA-based Loss Function Baykal, Asude 20 October 2023 (has links) As Next Generation (NextG) networks become more complex, the need to develop a robust, reliable network traffic prediction framework for intelligent network management increases. This study compares the performance of machine learning models in network traffic prediction using a custom Service-Level Agreement (SLA) - based loss function to ensure SLA violation constraints while minimizing overprovisioning. The proposed SLA-based parametric custom loss functions are used to maintain the SLA violation rate percentages the network operators require. Our approach is multivariate, spatiotemporal, and SLA-driven, incorporating 20 Radio Access Network (RAN) features, custom peak traffic time features, and custom mobility-based clustering to leverage spatiotemporal relationships. In this study, five machine learning models are considered: one recurrent neural network (LSTM) model, two encoder-decoder architectures (Transformer and Autoformer), and two gradient-boosted tree models (XGBoost and LightGBM). The prediction performance of the models is evaluated based on different metrics such as SLA violation rate constraints, overprovisioning, and the custom SLA-based loss function parameter. According to our evaluations, Transformer models with custom peak time features achieve the minimum overprovisioning volume at 3% SLA violation constraint. Gradient-boosted tree models have lower overprovisioning volumes at higher SLA violation rates. / Master of Science / As the Next Generation (NextG) networks become more complex, the need to develop a robust, reliable network traffic prediction framework for intelligent network management increases. This study compares the performance of machine learning models in network traffic prediction using a custom loss function to ensure SLA violation constraints. The proposed SLA-based custom loss functions are used to maintain the SLA violation rate percentages required by the network operators while minimizing overprovisioning. Our approach is multivariate, spatiotemporal, and SLA-driven, incorporating 20 Radio Access Network (RAN) features, custom peak traffic time features, and mobility-based clustering to leverage spatiotemporal relationships. We use five machine learning and deep learning models for our comparative study: one recurrent neural network (RNN) model, two encoder-decoder architectures, and two gradient-boosted tree models. The prediction performance of the models was evaluated based on different metrics such as SLA violation rate constraints, overprovisioning, and the custom SLA-based loss function parameter. Cellular traffic prediction 5G and beyond LSTM Transformer Autoformer XGBoost LightGBM
8	Enhancing Student Graduation Rates by Mitigating Failure, Dropout, and Withdrawal in Introduction to Statistical Courses Using Statistical and Machine Learning Abbaspour Tazehkand, Shahabeddin 01 January 2024 (has links) (PDF) The elevated rates of failure, dropout, and withdrawal (FDW) in introductory statistics courses pose a significant barrier to students' timely graduation from college. Identifying actionable strategies to support instructors in facilitating student success by reducing FDW rates is paramount. This thesis undertakes a comprehensive approach, leveraging various machine learning algorithms to address this pressing issue. Drawing from three years of data from an introductory statistics course at one of the largest universities in the USA, this study examines the problem in depth. Numerous predictive classification models have been developed, showcasing the efficacy of machine learning techniques in this context. Actionable insights gleaned from these statistical and machine learning models have been consolidated, offering valuable guidance for instructors. Moreover, the complete analytical framework, encompassing data identification, integration, feature engineering, model development, and report generation, is meticulously outlined. By sharing this methodology, the aim is to empower researchers in the field to extend these approaches to similarly critical courses, fostering a more supportive learning environment. Ultimately, this endeavor seeks to enhance student retention and success, thereby contributing to the broader goal of promoting timely graduation from college. Identifying At-Risk Students Machine Learning XGBoost Statistics Undergraduate
9	Probability of Default Term Structure Modeling : A Comparison Between Machine Learning and Markov Chains Englund, Hugo, Mostberg, Viktor January 2022 (has links) During the recent years, numerous so-called Buy Now, Pay Later companies have emerged. A type of financial institution offering short term consumer credit contracts. As these institutions have gained popularity, their undertaken credit risk has increased vastly. Simultaneously, the IFRS 9 regulatory requirements must be complied with. Specifically, the Probability of Default (PD) for the entire lifetime of such a contract must be estimated. The collection of incremental PDs over the entire course of the contract is called the PD term structure. Accurate estimates of the PD term structures are desirable since they aid in steering business decisions based on a given risk appetite, while staying compliant with current regulations. In this thesis, the efficiency of Machine Learning within PD term structure modeling is examined. Two categories of Machine Learning algorithms, in five variations each, are evaluated; (1) Deep Neural Networks; and (2) Gradient Boosted Trees. The Machine Learning models are benchmarked against a traditional Markov Chain model. The performance of the models is measured by a set of calibration and discrimination metrics, evaluated at each time point of the contract as well as aggregated over the entire time horizon. The results show that Machine Learning can be used efficiently within PD term structure modeling. The Deep Neural Networks outperform the Markov Chain model in all performance metrics, whereas the Gradient Boosted Trees are better in all except one metric. For short-term predictions, the Machine Learning models barely outperform the Markov Chain model. For long-term predictions, however, the Machine Learning models are superior. / Flertalet s.k. Köp nu, betala senare-företag har växt fram under de senaste åren. En sorts finansiell institution som erbjuder kortsiktiga konsumentkreditskontrakt. I samband med att dessa företag har blivit alltmer populära, har deras åtagna kreditrisk ökat drastiskt. Samtidigt måste de regulatoriska kraven ställda av IFRS 9 efterlevas. Specifikt måste fallisemangsrisken för hela livslängden av ett sådant kontrakt estimeras. Samlingen av inkrementell fallisemangsrisk under hela kontraktets förlopp kallas fallisemangsriskens terminsstruktur. Precisa estimat av fallisemangsriskens terminsstruktur är önskvärda eftersom de understödjer verksamhetsbeslut baserat på en given riskaptit, samtidigt som de nuvarande regulatoriska kraven efterlevs. I denna uppsats undersöks effektiviteten av Maskininlärning för modellering av fallisemangsriskens terminsstruktur. Två kategorier av Maskinlärningsalgoritmer, i fem variationer vardera, utvärderas; (1) Djupa neuronnät; och (2) Gradient boosted trees. Maskininlärningsmodellerna jämförs mot en traditionell Markovkedjemodell. Modellernas prestanda mäts via en uppsättning kalibrerings- och diskrimineringsmått, utvärderade i varje tidssteg av kontraktet samt aggregerade över hela tidshorisonten. Resultaten visar att Maskininlärning är effektivt för modellering av fallisemangsriskens terminsstruktur. De djupa neuronnäten överträffar Markovkedjemodellen i samtliga prestandamått, medan Gradient boosted trees är bättre i alla utom ett mått. För kortsiktiga prediktioner är Maskininlärningsmodellerna knappt bättre än Markovkedjemodellen. För långsiktiga prediktioner, däremot, är Maskininlärningsmodellerna överlägsna. Machine Learning Deep Neural Networks XGBoost Probability of Default Term Structure Modeling IFRS 9 Maskininlärning Djupa neuronnät XGBoost Fallisemangsrisk Terminsstruktursmodellering IFRS 9 Mathematics Matematik
10	Improving Visibility Forecasts in Denmark Using Machine Learning Post-processing / Förbättring av siktprognoser i Danmark med hjälp av maskininlärning Thomasson, August January 2023 (has links) Accurate fog prediction is an important task facing forecast centers since low visibility can affect anthropogenic systems, such as aviation. Therefore, this study investigates the use of Machine Learning classification algorithms for post-processing the output of the Danish Meteorological Institute’s operational Numerical Weather Prediction (NWP) model to improve visibility prediction. Two decision tree ensemble methods, XGBoost and Random Forest, were trained on more than 4 years of archived forecast data and visibility observations from 28 locations in Denmark. Observations were classified into eight classes, while models were optimized with resampling and Bayesian optimization. On an independent 15-month period, the Machine Learning methods show an improvement in balanced accuracy, F1-score, and Extremal Dependency Index compared to the NWP and persistence models. XGBoost performs slightly better. However, both methods suffer from an increase in overprediction of the low visibility classes. The models are also discussed regarding usability, coping with model changes and preservation of spatial features. Finally, the study shows how the interpretation of the post-processing models may be included operationally. Future research recommendations include incorporating more variables, using alternative class imbalance methods and further analyzing the models’ implementation and usage. Overall, the study demonstrates the potential of these models to improve visibility point forecasts in an operational setting. / Begränsad sikt kan på olika sätt påverka samhällen och naturen. Till exempel kan dimma störa både flyg- och biltrafiken. Därför är det viktigt att kunna förutspå sikt. Eftersom traditionella prognosmetoder, som numeriska vädermodeller, inte alltid är tillförlitliga för detta ändamål, är det viktigt att utforska alternativa metoder. I den här studien undersöks användningen av maskininlärning för att förbättra numeriska prognoser av sikt. Två olika maskininlärningsalgoritmer användes för att post-processera Danmarks Meteorologiska Instituts numeriska vädermodell och de tränades på siktobservationer från 28 olika platser. Resultaten visar att maskininlärnings-metoderna förbättrar den numeriska vädermodellen, med 10 - 30% beroende på hur man mäter. Dock har algoritmerna en liten tendens att förutspå låg sikt för ofta, och båda är bättre på kustnära platser. Den bäst presterande av de två algoritmerna lyckas identifiera förväntade meteorologiska förhållande i samband med låg sikt. Dessutom presenteras en metod för att förbättra förståelsen av de post-processerade modellerna. Men det finns fortfarande utmaningar med att implementera metoden operationellt. Därför föreslås det att framtida studier bland annat undersöker om algoritmerna presterar bättre med fler väderparametrar, hur de presterar på nyaplatser samt att djupare analys av hur de hanterar updateringar till den numeriska vädermodellen görs. Sammanfattningsvis visar studien att maskininlärning är ett lovande verktyg för att förbättra numeriska prognoser av sikt. visibility forecast fog machine learning numerical weather predicition XGBoost Random Forest siktprognos dimma maskininlärning numerisk vädermodell XGBoost Random Forest Meteorology and Atmospheric Sciences Meteorologi och atmosfärforskning

Search results