Global ETD Search

1	Fog and fog deposition: A novel approach to estimate the occurrence of fog and the amount of fog deposition: a case study for Germany Körner, Philipp 07 December 2021 (has links) This thesis is written as a cumulative dissertation. It presents methods and results which contribute to an improved understanding of the spatio-temporal variability of fog and fog deposition. The questions to be answered are: When is there how much fog, and where and how much fog is deposited on the vegetation as fog precipitation? Freely available data sets serve as a database. The meteorological input data are obtained from the Climate Data Center (CDC) of the German Meteorological Service (DWD). Station data for temperature, relative humidity and wind speed in hourly resolution are used. In addition, visibility data are used for validation purposes. Furthermore, Global Forest Heights (GFH) data from the National Aeronautics and Space Administration (NASA) are used as vegetation height data. The data from NASA’s Shuttle Radar Topography Mission (SRTM) is used as a digital elevation model. The first publication deals with gap filling and data compression for further calculations. This is necessary since the station density for hourly data is relatively low, especially before the 2000s. In addition, there are more frequent gaps in hourly data than in, for instance, daily data, which can thus be filled. It is shown that gradient boosting (gb) enables high quality gap filling in a short computing time. The second publication deals with the determination of the fog, especially with the liquid water content (lwc). Here the focus is on the correction of measurement errors of the relative humidity as well as methods of spatial interpolation are dealt with. The resulting lwc data for Germany with a temporal resolution of one hour and a spatial resolution of one kilometre, are validated against measured lwc data as well as visibility data of the DWD. The last publication uses the data and methods of the two previous publications. The vegetation and wind speed data are also used to determine fog precipitation from the lwc data. This is validated using data from other publications and water balance calculations. In addition to the measured precipitation, the fog precipitation data are used as an input variable for the modelling. This is also one of the possible applications: To determine precipitation from fog, which is not recorded by standard measuring methods, and thus to make water balance modelling more realistic.:1 MOTIVATION 6 2 PROBLEM DEFINITION AND TARGET SETTING 6 3 STRUCTURE 7 4 MODEL LIMITS 9 5 PUBLICATIONS 9 6 OUTLOOK 29 info:eu-repo/classification/ddc/550 ddc:550
2	Evolutionary algorithms in statistical learning : Automating the optimization procedure / Evolutionära algoritmer i statistisk inlärning : Automatisering av optimeringsprocessen Sjöblom, Niklas January 2019 (has links) Scania has been working with statistics for a long time but has invested in becoming a data driven company more recently and uses data science in almost all business functions. The algorithms developed by the data scientists need to be optimized to be fully utilized and traditionally this is a manual and time consuming process. What this thesis investigates is if and how well evolutionary algorithms can be used to automate the optimization process. The evaluation was done by implementing and analyzing four variations of genetic algorithms with different levels of complexity and tuning parameters. The algorithm subject to optimization was XGBoost, a gradient boosted tree model, applied to data that had previously been modelled in a competition. The results show that evolutionary algorithms are applicable in finding good models but also emphasizes the importance of proper data preparation. / Scania har länge jobbat med statistik men har på senare år investerat i att bli ett mer datadrivet företag och använder nu data science i nästan alla avdelningar på företaget. De algoritmer som utvecklas av data scientists måste optimeras för att kunna utnyttjas till fullo och detta är traditionellt sett en manuell och tidskrävade process. Detta examensarbete utreder om och hur väl evolutionära algoritmer kan användas för att automatisera optimeringsprocessen. Utvärderingen gjordes genom att implementera och analysera fyra varianter avgenetiska algoritmer med olika grader av komplexitet och trimningsparameterar. Algoritmen som var målet för optimering var XGBoost, som är en gradient boosted trädbaserad modell. Denna applicerades på data som tidigare hade modellerats i entävling. Resultatet visar att evolutionära algoritmer är applicerbara i att hitta bra modellermen påvisar även hur fundamentalt det är att arbeta med databearbetning innan modellering. evolutionary algorithms statistical learning gradient boosting automation artificial intelligence evolutionära algoritmer statistisk inlärning gradient boosting automation artificiell intelligens Mathematics Matematik
3	Breeding white storks in former East Prussia : comparing predicted relative occurrences across scales and time using a stochastic gradient boosting method (TreeNet), GIS and public data Wickert, Claudia January 2007 (has links) In dieser Arbeit wurden verschiedene GIS-basierte Habitatmodelle für den Weißstorch (Ciconia ciconia) im Gebiet der ehemaligen deutschen Provinz Ostpreußen (ca. Gebiet der russischen Exklave Kaliningrad und der polnischen Woiwodschaft Ermland-Masuren) erstellt. Zur Charakterisierung der Beziehung zwischen dem Weißstorch und der Beschaffenheit seiner Umwelt wurden verschiedene historische Datensätze über den Bestand des Weißstorches in den 1930er Jahren sowie ausgewählte Variablen zur Habitat-Beschreibung genutzt. Die Aufbereitung und Modellierung der verwendeten Datensätze erfolgte mit Hilfe eines geographischen Informationssystems (ArcGIS) und einer statistisch-mathematischen Methode aus den Bereichen „Machine Learning“ und „Data-Mining“ (TreeNet, Salford Systems Ltd.). Unter Verwendung der historischen Habitat-Parameter sowie der Daten zum Vorkommen des Weißstorches wurden quantitative Modelle auf zwei Maßstabs-Ebenen erstellt: (i) auf Punktskala unter Verwendung eines Rasters mit einer Zellgröße von 1 km und (ii) auf Verwaltungs-Kreisebene basierend auf der Gliederung der Provinz Ostpreußen in ihre Landkreise. Die Auswertung der erstellten Modelle zeigt, dass das Vorkommen von Storchennestern im ehemaligen Ostpreußen, unter Berücksichtigung der hier verwendeten Variablen, maßgeblich durch die Variablen ‚forest’, ‚settlement area’, ‚pasture land’ und ‚coastline’ bestimmt wird. Folglich lässt sich davon ausgehen, dass eine gute Nahrungsverfügbarkeit, wie der Weißstorch sie auf Wiesen und Weiden findet, sowie die Nähe zu menschlichen Siedlungen ausschlaggebend für die Nistplatzwahl des Weißstorches in Ostpreußen sind. Geschlossene Waldgebiete zeigen sich in den Modellen als Standorte für Horste des Weißstorches ungeeignet. Der starke Einfluss der Variable ‚coastline’ lässt sich höchstwahrscheinlich durch die starke naturräumliche Gliederung Ostpreußens parallel zur Küstenlinie erklären. In einem zweiten Schritt konnte unter Verwendung der in dieser Arbeit erstellten Modelle auf beiden Skalen Vorhersagen für den Zeitraum 1981-1993 getroffen werden. Dabei wurde auf dem Punktmaßstab eine Abnahme an potentiellem Bruthabitat vorhergesagt. Im Gegensatz dazu steigt die vorhergesagte Weißstorchdichte unter Verwendung des Modells auf Verwaltungs-Kreisebene. Der Unterschied zwischen beiden Vorhersagen beruht vermutlich auf der Verwendung unterschiedlicher Skalen und von zum Teil voneinander verschiedenen erklärenden Variablen. Weiterführende Untersuchungen sind notwendig, um diesen Sachverhalt zu klären. Des Weiteren konnten die Modellvorhersagen für den Zeitraum 1981-1993 mit den vorliegenden Bestandserfassungen aus dieser Zeit deskriptiv verglichen werden. Es zeigt sich hierbei, dass die hier vorhergesagten Bestandszahlen höher sind als die in den Zählungen ermittelten. Die hier erstellten Modelle beschreiben somit vielmehr die Kapazität des Habitats. Andere Faktoren, die die Größe der Weißstorch-Population bestimmen, wie z.B. Bruterfolg oder Mortalität sollten in zukünftige Untersuchungen mit einbezogen werden. Es wurde ein möglicher Ansatz aufgezeigt, wie man mit den hier vorgestellten Methoden und unter Verwendung historischer Daten wertvolle Habitatmodelle erstellen sowie die Auswirkung von Landnutzungsänderungen auf den Weißstorch beurteilen kann. Die hier erstellten Modelle sind als erste Grundlage zu sehen und lassen sich mit Hilfe weitere Daten hinsichtlich Habitatstruktur und mit exakteren räumlich expliziten Angaben zu Neststandorten des Weißstorches weiter verfeinern. In einem weiteren Schritt sollte außerdem ein Habitatmodell für die heutige Zeit erstellt werden. Dadurch wäre ein besserer Vergleich möglich hinsichtlich erdenklicher Auswirkungen von Änderungen der Landnutzung und relevanten Umweltbedingungen auf den Weißstorch im Gebiet des ehemaligen Ostpreußens sowie in seinem gesamten Verbreitungsgebiet. / Different habitat models were created for the White Stork (Ciconia ciconia) in the region of the former German province of East Prussia (equals app. the current Russian oblast Kaliningrad and the Polish voivodship Warmia-Masuria). Different historical data sets describing the occurrence of the White Stork in the 1930s, as well as selected variables for the description of landscape and habitat, were employed. The processing and modeling of the applied data sets was done with a geographical information system (ArcGIS) and a statistical modeling approach that comes from the disciplines of machine-learning and data mining (TreeNet by Salford Systems Ltd.). Applying historical habitat descriptors, as well as data on the occurrence of the White Stork, models on two different scales were created: (i) a point scale model applying a raster with a cell size of 1 km2 and (ii) an administrative district scale model based on the organization of the former province of East Prussia. The evaluation of the created models show that the occurrence of White Stork nesting grounds in the former East Prussia for most parts is defined by the variables ‘forest’, ‘settlement area’, ‘pasture land’ and ‘proximity to coastline’. From this set of variables it can be assumed that a good food supply and nesting opportunities are provided to the White Stork in pasture and meadows as well as in the proximity to human settlements. These could be seen as crucial factors for the choice of nesting White Stork in East Prussia. Dense forest areas appear to be unsuited as nesting grounds of White Storks. The high influence of the variable ‘coastline’ is most likely explained by the specific landscape composition of East Prussia parallel to the coastline and is to be seen as a proximal factor for explaining the distribution of breeding White Storks. In a second step, predictions for the period of 1981 to 1993 could be made applying both scales of the models created in this study. In doing so, a decline of potential nesting habitat was predicted on the point scale. In contrast, the predicted White Stork occurrence increases when applying the model of the administrative district scale. The difference between both predictions is to be seen in the application of different scales (density versus suitability as breeding ground) and partly dissimilar explanatory variables. More studies are needed to investigate this phenomenon. The model predictions for the period 1981 to 1993 could be compared to the available inventories of that period. It shows that the figures predicted here were higher than the figures established by the census. This means that the models created here show rather a capacity of the habitat (potential niche). Other factors affecting the population size e.g. breeding success or mortality have to be investigated further. A feasible approach on how to generate possible habitat models was shown employing the methods presented here and applying historical data as well as assessing the effects of changes in land use on the White Stork. The models present the first of their kind, and could be improved by means of further data regarding the structure of the habitat and more exact spatially explicit information on the location of the nesting sites of the White Stork. In a further step, a habitat model of the present times should be created. This would allow for a more precise comparison regarding the findings from the changes of land use and relevant conditions of the environment on the White Stork in the region of former East Prussia, e.g. in the light of coming landscape changes brought by the European Union (EU). Weißstorch Ostpreußen Habitatmodell TreeNet stochastic gradient boosting white stork ciconia ciconia East Prussia predictive habitat model TreeNet stochastic gradient boosting Life sciences
4	Avaliação do algoritmo Gradient Boosting em aplicações de previsão de carga elétrica a curto prazo Mayrink, Victor Teixeira de Melo 31 August 2016 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-03-07T14:25:21Z No. of bitstreams: 1 victorteixeirademelomayrink.pdf: 2587774 bytes, checksum: 1319cc37a15480796050b618b4d7e5f7 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-03-07T15:06:57Z (GMT) No. of bitstreams: 1 victorteixeirademelomayrink.pdf: 2587774 bytes, checksum: 1319cc37a15480796050b618b4d7e5f7 (MD5) / Made available in DSpace on 2017-03-07T15:06:57Z (GMT). No. of bitstreams: 1 victorteixeirademelomayrink.pdf: 2587774 bytes, checksum: 1319cc37a15480796050b618b4d7e5f7 (MD5) Previous issue date: 2016-08-31 / FAPEMIG - Fundação de Amparo à Pesquisa do Estado de Minas Gerais / O armazenamento de energia elétrica em larga escala ainda não é viável devido a restrições técnicas e econômicas. Portanto, toda energia consumida deve ser produzida instantaneamente; não é possível armazenar o excesso de produção, ou tampouco cobrir eventuais faltas de oferta com estoques de segurança, mesmo que por um curto período de tempo. Consequentemente, um dos principais desafios do planejamento energético consiste em realizar previsões acuradas para as demandas futuras. Neste trabalho, apresentamos um modelo de previsão para o consumo de energia elétrica a curto prazo. A metodologia utilizada compreende a construção de um comitê de previsão, por meio da aplicação do algoritmo Gradient Boosting em combinação com modelos de árvores de decisão e a técnica de amortecimento exponencial. Esta estratégia compreende um método de aprendizado supervisionado que ajusta o modelo de previsão com base em dados históricos do consumo de energia, das temperaturas registradas e de variáveis de calendário. Os modelos propostos foram testados em duas bases de dados distintas e demonstraram um ótimo desempenho quando comparados com resultados publicados em outros trabalhos recentes. / The storage of electrical energy is still not feasible on a large scale due to technical and economic issues. Therefore, all energy to be consumed must be produced instantly; it is not possible to store the production leftover, or either to cover any supply shortages with safety stocks, even for a short period of time. Thus, one of the main challenges of energy planning consists in computing accurate forecasts for the future demand. In this paper, we present a model for short-term load forecasting. The methodology consists in composing a prediction comitee by applying the Gradient Boosting algorithm in combination with decision tree models and the exponential smoothing technique. This strategy comprises a supervised learning method that adjusts the forecasting model based on historical energy consumption data, the recorded temperatures and calendar variables. The proposed models were tested in two di erent datasets and showed a good performance when compared with results published in recent papers. CNPQ::CIENCIAS EXATAS E DA TERRA Previsão de carga elétrica Amortecimento exponencial Árvores de decisão Gradient Boosting Short Term Load Forecasting Exponential Smoothing Decision Trees Gradient Boosting
5	Strojové učení v algoritmickém obchodování / Machine Learning in Algorithmic Trading Bureš, Michal January 2021 (has links) This thesis is dedicated to the application of machine learning methods to algorithmic trading. We take inspiration from intraday traders and implement a system that predicts future price based on candlestick patterns and technical indicators. Using forex and US stocks tick data we create multiple aggregated bar representations. From these bars we construct original features based on candlestick pattern clustering by K-Means and long-term features derived from standard technical indicators. We then setup regression and classification tasks for Extreme Gradient Boosting models. From their predictions we extract buy and sell trading signals. We perform experiments with eight different configurations over multiple assets and trading strategies using walk-forward validation. The results report Sharpe ratios and mean profits of all the combinations. We discuss the results and recommend suitable configurations. In overall our strategies outperform randomly selected strategies. Furthermore, we provide and discuss multiple opportunities for further research.
6	Sentimentanalys av svenskt aktieforum för att förutspå aktierörelse / Sentiment analysis of Swedish stock trading forum for predicting stock market movement Ouadria, Michel Sebastian, Ciobanu, Ann-Stephanie January 2020 (has links) Förevarande studie undersöker möjligheten att förutsäga aktierörelse på en dagligbasis med sentimentanalys av inlägg från ett svenskt aktieforum. Sentimentanalys används för att finna subjektivitet i form av känslor (sentiment) ur text. Textdata extraherades från ett svenskt aktieforum för att förutsäga aktierörelsen för den relaterade aktien. All data aggregerades inom en bestämd tidsperiod på två år. Undersökningen utnyttjade maskininlärning för att träna tre maskininlärningsmodeller med textdata och aktiedata. Resultatet påvisade ingen tydlig korrelation mellan sentiment och aktierörelse. Vidare uppnåddes inte samma resultat som tidigare arbeten inom området. Den högst uppnådda noggrannheten med modellerna beräknades till 64%. / The present study examines the possibility of predicting stock movement on a daily basis with sentiment analysis of posts in a swedish stock trading forum. Sentiment analysis is used to find subjectivity in the form of emotions (sentiment) from text. Textdata was extracted from a stock forum to predict the share movement of the related share. All data was aggregated within a fixed period of two years. The analysis utilizes machine learning to train three machine learning models with textdata and stockdata. The result showed no clear correlation between sentiment and stock movement. Furthermore, the result was not able to replicate accuracy as previous work in the field. The highest accuracy achieved with the models was calculated at 64%. Sentiment analysis Stock market Machine Learning Support Vector Machine Naive Bayes Extreme Gradient Boosting Sentimentanalys Aktiemarknad Maskininlärning Stödvektormaskin Naive Bayes Extreme Gradient Boosting Computer and Information Sciences Data- och informationsvetenskap
7	Modelling default probabilities: The classical vs. machine learning approach / Modellering av fallissemang: Klassisk metod vs. maskininlärning Jovanovic, Filip, Singh, Paul January 2020 (has links) Fintech companies that offer Buy Now, Pay Later products are heavily dependent on accurate default probability models. This is since the fintech companies bear the risk of customers not fulfilling their obligations. In order to minimize the losses incurred to customers defaulting several machine learning algorithms can be applied but in an era in which machine learning is gaining popularity, there is a vast amount of algorithms to select from. This thesis aims to address this issue by applying three fundamentally different machine learning algorithms in order to find the best algorithm according to a selection of chosen metrics such as ROCAUC and precision-recall AUC. The algorithms that were compared are Logistic Regression, Random Forest and CatBoost. All these algorithms were benchmarked against Klarna's current XGBoost model. The results indicated that the CatBoost model is the optimal one according to the main metric of comparison, the ROCAUC-score. The CatBoost model outperformed the Logistic Regression model by seven percentage points, the Random Forest model by three percentage points and the XGBoost model by one percentage point. / Fintechbolag som erbjuder Köp Nu, Betala Senare-tjänster är starkt beroende av välfungerande fallissemangmodeller. Detta då dessa fintechbolag bär risken av att kunder inte betalar tillbaka sina krediter. För att minimera förlusterna som uppkommer när en kund inte betalar tillbaka finns flera olika maskininlärningsalgoritmer att applicera, men i dagens explosiva utveckling på maskininlärningsfronten finns det ett stort antal algoritmer att välja mellan. Denna avhandling ämnar att testa tre olika maskininlärningsalgoritmer för att fastställa vilken av dessa som presterar bäst sett till olika prestationsmått så som ROCAUC och precision-recall AUC. Algoritmerna som jämförs är Logistisk Regression, Random Forest och CatBoost. Samtliga algoritmers prestanda jämförs även med Klarnas nuvarande XGBoost-modell. Resultaten visar på att CatBoost-modellen är den mest optimala sett till det primära prestationsmåttet ROCAUC. CatBoost-modellen var överlägset bättre med sju procentenheter högre ROCAUC än Logistisk Regression, tre procentenheter högre ROCAUC än Random Forest och en procentenhet högre ROCAUC än Klarnas nuvarande XGBoost-modell Machine learning gradient boosting pd-modelling CatBoost Random Forest Logistic Regression Maskininlärning gradient boosting fallissemangmodellering CatBoost Random Forest Logistisk Regression Mathematics Matematik
8	Customer acquisition and onboarding at an online grocery company Borg, Ida January 2022 (has links) The master thesis is carried out in a collaboration with a Swedish online grocery company. The goal of the thesis is to investigate if it is possible to explain the underlying factors that affect new customers to be retained. Because of the difficulties of defining churn and retention in non-contractual settings, most of the literature is focused on contractual and subscription settings. There are a limited number of studies when trying to predict customer churn in non-contractual businesses and even fewer studies that emphasize retention. This thesis aims to contribute to the field of retention in non-contractual business and also highlight the assumptions and drawbacks of churn-related task. To achieve the goal of the thesis a literature review is carried out together with two statistical learning approaches; logistic regression model and extreme gradient boosting model. The results shows that it is possible to find the underlying factors that drive customers to be retained. The greatest drivers that could increase the probability of retaining new customers are the days between the first and second order, the second order value, and the total order value. / Examensarbetet är genomfört som ett samarbete med ett svenskt matvaruföretag på nätet. Målet med examensarbetet är att undersöka om det är möjligt att förklara de bakomliggande faktorer som påverkar nya kunder att stanna kvar som kunder. På grund av svårigheterna med att definiera kundbortfall och bibehållande av kunder i icke-kontraktuella affärer fokuserar den mesta av litteraturen på avtals- och prenumerationsmiljöer. Det finns ett begränsat antal studier där man försöker förutsäga kundbortfall i icke-kontraktuella verksamheter och ännu färre studier som fokuserar på bibehållande av kunder. Denna uppsats syftar till att bidra till området bibehållande av kunder i icke-kontraktuella affärer och även belysa antagandena och nackdelarna med analyser inom kundbortfall. För att uppnå målet med avhandlingen genomförs en litteraturgenomgång tillsammans med två statistiska lärandemetoder; logistisk regressionsmodell och extreme gradient boosting model. Resultaten visar att det är fullt möjligt att hitta de bakomliggande faktorerna som driver kunderna att stanna kvar. De största drivkrafterna som kan öka sannolikheten för att kunder ska bibehållas är dagarna mellan första och andra ordern, andra ordervärdet och det totala ordervärdet. retention churn customer acquisition customer onboarding logistic regression extreme gradient boosting model bibehållande av kunder kundbortfall kundförvärv kundonboarding logistisk regression exteme gradient boosting model Mathematics Matematik
9	Predicting the area of industry : Using machine learning to classify SNI codes based on business descriptions, a degree project at SCB / Att prediktera näringsgrensindelning : Ett examensarbete om tillämpningavmaskininlärning för att klassificeraSNI-koder utifrån företagsbeskrivningarhos SCB Dahlqvist-Sjöberg, Philip, Strandlund, Robin January 2019 (has links) This study is a part of an experimental project at Statistics Sweden,which aims to, with the use of natural language processing and machine learning, predict Swedish businesses’ area of industry codes, based on their business descriptions. The response to predict consists of the most frequent 30 out of 88 main groups of Swedish standard industrial classification (SNI) codes that each represent a unique area of industry. The transformation from business description text to numerical features was done through the bag-of-words model. SNI codes are set when companies are founded, and due to the human factor, errors can occur. Using data from the Swedish Companies Registration Office, the purpose is to determine if the method of gradient boosting can provide high enough classification accuracy to automatically set the correct SNI codes that differ from the actual response. Today these corrections are made manually. The best gradient boosting model was able to correctly classify 52 percent of the observations, which is not considered high enough to implement automatic code correction into a production environment. machine learning classification gradient boosting data analysis NLP SNI SCB Probability Theory and Statistics Sannolikhetsteori och statistik
10	Srovnání heuristických a konvenčních statistických metod v data miningu / Comparison of Heuristic and Conventional Statistical Methods in Data Mining Bitara, Matúš January 2019 (has links) The thesis deals with the comparison of conventional and heuristic methods in data mining used for binary classification. In the theoretical part, four different models are described. Model classification is demonstrated on simple examples. In the practical part, models are compared on real data. This part also consists of data cleaning, outliers removal, two different transformations and dimension reduction. In the last part methods used to quality testing of models are described.

Search results