• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 81
  • 17
  • 9
  • 7
  • 7
  • 6
  • 5
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 169
  • 169
  • 42
  • 41
  • 36
  • 32
  • 30
  • 29
  • 23
  • 22
  • 18
  • 18
  • 17
  • 16
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
151

[en] NON-PARAMETRIC ESTIMATIONS OF INTEREST RATE CURVES : MODEL SELECTION CRITERION: MODEL SELECTION CRITERIONPERFORMANCE DETERMINANT FACTORS AND BID-ASK S / [pt] ESTIMAÇÕES NÃO PARAMÉTRICAS DE CURVAS DE JUROS: CRITÉRIO DE SELEÇÃO DE MODELO, FATORES DETERMINANTES DEDESEMPENHO E BID-ASK SPREAD

ANDRE MONTEIRO DALMEIDA MONTEIRO 11 June 2002 (has links)
[pt] Esta tese investiga a estimação de curvas de juros sob o ponto de vista de métodos não-paramétricos. O texto está dividido em dois blocos. O primeiro investiga a questão do critério utilizado para selecionar o método de melhor desempenho na tarefa de interpolar a curva de juros brasileira em uma dada amostra. Foi proposto um critério de seleção de método baseado em estratégias de re-amostragem do tipo leave-k-out cross validation, onde K k £ £ 1 e K é função do número de contratos observados a cada curva da amostra. Especificidades do problema reduzem o esforço computacional requerido, tornando o critério factível. A amostra tem freqüência diária: janeiro de 1997 a fevereiro de 2001. O critério proposto apontou o spline cúbico natural -utilizado com método de ajuste perfeito aos dados - como o método de melhor desempenho. Considerando a precisão de negociação, este spline mostrou-se não viesado. A análise quantitativa de seu desempenho identificou, contudo, heterocedasticidades nos erros simulados. A partir da especificação da variância condicional destes erros e de algumas hipóteses, foi proposto um esquema de intervalo de segurança para a estimação de taxas de juros pelo spline cúbico natural, empregado como método de ajuste perfeito aos dados. O backtest sugere que o esquema proposto é consistente, acomodando bem as hipóteses e aproximações envolvidas. O segundo bloco investiga a estimação da curva de juros norte-americana construída a partir dos contratos de swaps de taxas de juros dólar-Libor pela Máquina de Vetores Suporte (MVS), parte do corpo da Teoria do Aprendizado Estatístico. A pesquisa em MVS tem obtido importantes avanços teóricos, embora ainda sejam escassas as implementações em problemas reais de regressão. A MVS possui características atrativas para a modelagem de curva de juros: é capaz de introduzir já na estimação informações a priori sobre o formato da curva e sobre aspectos da formação das taxas e liquidez de cada um dos contratos a partir dos quais ela é construída. Estas últimas são quantificadas pelo bid-ask spread (BAS) de cada contrato. A formulação básica da MVS é alterada para assimilar diferentes valores do BAS sem que as propriedades dela sejam perdidas. É dada especial atenção ao levantamento de informação a priori para seleção dos parâmetros da MVS a partir do formato típico da curva. A amostra tem freqüência diária: março de 1997 a abril de 2001. Os desempenhos fora da amostra de diversas especificações da MVS foram confrontados com aqueles de outros métodos de estimação. A MVS foi o método que melhor controlou o trade- off entre viés e variância dos erros. / [en] This thesis investigates interest rates curve estimation under non-parametric approach. The text is divided into two parts. The first one focus on which criterion to use to select the best performance method in the task of interpolating Brazilian interest rate curve. A selection criterion is proposed to measure out-of-sample performance by combining resample strategies leave-k-out cross validation applied upon the whole sample curves, where K k £ £ 1 and K is function of observed contract number in each curve. Some particularities reduce substantially the required computational effort, making the proposed criterion feasible. The data sample range is daily, from January 1997 to February 2001. The proposed criterion selected natural cubic spline, used as data perfect-fitting estimation method. Considering the trade rate precision, the spline is non-biased. However, quantitative analysis of performance determinant factors showed the existence of out-of-sample error heteroskedasticities. From a conditional variance specification of these errors, a security interval scheme is proposed for interest rate generated by perfect-fitting natural cubic spline. A backtest showed that the proposed security interval is consistent, accommodating the evolved assumptions and approximations. The second part estimate US free-for-floating interest rate swap contract curve by using Support Vector Machine (SVM), a method derived from Statistical Learning Theory. The SVM research has got important theoretical results, however the number of implementation on real regression problems is low. SVM has some attractive characteristics for interest rates curves modeling: it has the ability to introduce already in its estimation process a priori information about curve shape and about liquidity and price formation aspects of the contracts that generate the curve. The last information set is quantified by the bid-ask spread. The basic SVM formulation is changed in order to be able to incorporate the different values for bid-ask spreads, without losing its properties. Great attention is given to the question of how to extract a priori information from swap curve typical shape to be used in MVS parameter selection. The data sample range is daily, from March 1997 to April 2001. The out-of-sample performances of different SVM specifications are faced with others method performances. SVM got the better control of trade- off between bias and variance of out-of-sample errors.
152

Klientų duomenų valdymas bankininkystėje / Client data management in banking

Žiupsnys, Giedrius 09 July 2011 (has links)
Darbas apima banko klientų kredito istorinių duomenų dėsningumų tyrimą. Pirmiausia nagrinėjamos banko duomenų saugyklos, siekiant kuo geriau perprasti bankinius duomenis. Vėliau naudojant banko duomenų imtis, kurios apima kreditų grąžinimo istoriją, siekiama įvertinti klientų nemokumo riziką. Tai atliekama adaptuojant algoritmus bei programinę įrangą duomenų tyrimui, kuris pradedamas nuo informacijos apdorojimo ir paruošimo. Paskui pritaikant įvairius klasifikavimo algoritmus, sudarinėjami modeliai, kuriais siekiama kuo tiksliau suskirstyti turimus duomenis, nustatant nemokius klientus. Taip pat siekiant įvertinti kliento vėluojamų mokėti paskolą dienų skaičių pasitelkiami regresijos algoritmai bei sudarinėjami prognozės modeliai. Taigi darbo metu atlikus numatytus tyrimus, pateikiami duomenų vitrinų modeliai, informacijos srautų schema. Taip pat nurodomi klasifikavimo ir prognozavimo modeliai bei algoritmai, geriausiai įvertinantys duotas duomenų imtis. / This work is about analysing regularities in bank clients historical credit data. So first of all bank information repositories are analyzed to comprehend banks data. Then using data mining algorithms and software for bank data sets, which describes credit repayment history, clients insolvency risk is being tried to estimate. So first step in analyzis is information preprocessing for data mining. Later various classification algorithms is used to make models wich classify our data sets and help to identify insolvent clients as accurate as possible. Besides clasiffication, regression algorithms are analyzed and prediction models are created. These models help to estimate how long client are late to pay deposit. So when researches have been done data marts and data flow schema are presented. Also classification and regressions algorithms and models, which shows best estimation results for our data sets, are introduced.
153

Validation croisée et pénalisation pour l'estimation de densité / Cross-validation and penalization for density estimation

Magalhães, Nelo 26 May 2015 (has links)
Cette thèse s'inscrit dans le cadre de l'estimation d'une densité, considéré du point de vue non-paramétrique et non-asymptotique. Elle traite du problème de la sélection d'une méthode d'estimation à noyau. Celui-ci est une généralisation, entre autre, du problème de la sélection de modèle et de la sélection d'une fenêtre. Nous étudions des procédures classiques, par pénalisation et par rééchantillonnage (en particulier la validation croisée V-fold), qui évaluent la qualité d'une méthode en estimant son risque. Nous proposons, grâce à des inégalités de concentration, une méthode pour calibrer la pénalité de façon optimale pour sélectionner un estimateur linéaire et prouvons des inégalités d'oracle et des propriétés d'adaptation pour ces procédures. De plus, une nouvelle procédure rééchantillonnée, reposant sur la comparaison entre estimateurs par des tests robustes, est proposée comme alternative aux procédures basées sur le principe d'estimation sans biais du risque. Un second objectif est la comparaison de toutes ces procédures du point de vue théorique et l'analyse du rôle du paramètre V pour les pénalités V-fold. Nous validons les résultats théoriques par des études de simulations. / This thesis takes place in the density estimation setting from a nonparametric and nonasymptotic point of view. It concerns the statistical algorithm selection problem which generalizes, among others, the problem of model and bandwidth selection. We study classical procedures, such as penalization or resampling procedures (in particular V-fold cross-validation), which evaluate an algorithm by estimating its risk. We provide, thanks to concentration inequalities, an optimal penalty for selecting a linear estimator and we prove oracle inequalities and adaptative properties for resampling procedures. Moreover, new resampling procedure, based on estimator comparison by the mean of robust tests, is introduced as an alternative to procedures relying on the unbiased risk estimation principle. A second goal of this work is to compare these procedures from a theoretical point of view and to understand the role of V for V-fold penalization. We validate these theoretical results on empirical studies.
154

Improving Efficiency of Prevention in Telemedicine / Zlepšování učinnosti prevence v telemedicíně

Nálevka, Petr January 2010 (has links)
This thesis employs data-mining techniques and modern information and communication technology to develop methods which may improve efficiency of prevention oriented telemedical programs. In particular this thesis uses the ITAREPS program as a case study and demonstrates that an extension of the program based on the proposed methods may significantly improve the program's efficiency. ITAREPS itself is a state of the art telemedical program operating since 2006. It has been deployed in 8 different countries around the world, and solely in the Czech republic it helped prevent schizophrenic relapse in over 400 participating patients. Outcomes of this thesis are widely applicable not just to schizophrenic patients but also to other psychotic or non-psychotic diseases which follow a relapsing path and satisfy certain preconditions defined in this thesis. Two main areas of improvement are proposed. First, this thesis studies various temporal data-mining methods to improve relapse prediction efficiency based on diagnostic data history. Second, latest telecommunication technologies are used in order to improve quality of the gathered diagnostic data directly at the source.
155

臺灣地區的人口推估研究 / The study of population projection: a case study in Taiwan area

黃意萍 Unknown Date (has links)
台灣地區的人口隨著生育率及死亡率的雙重下降而呈現快速老化,其中生育率的降低影響尤為顯著。民國50年時,台灣平均每位婦女生育5.58個小孩,到了民國70年卻只生育1.67個小孩,去年(民國90年)生育率更創歷年新低,只有1.4。死亡率的下降可由平均壽命的延長看出,民國75年時男性為70.97歲,女性為75.88歲;到了民國90年,男性延長到72.75歲,女性延長到78.49歲。由於生育率的變化幅度高於死亡率,對人口結構的影響較大,因此本文分成兩個部份,主要在研究台灣地區15至49歲婦女生育率的變化趨勢,再將研究結果用於台灣地區未來人口總數及其結構的預測。   本研究第一部分是生育率的研究,引進Gamma函數、Gompertz函數、Lee-Carter法三種模型及單一年齡組個別估計法,以民國40年至84年(西元1951年至1995年)的資料為基礎,民國85年至89年(西元1996年至2000年)資料為檢測樣本,比較模型的優劣,尋求較適合台灣地區生育率的模型,再以最合適的模型預測民國91年至140年(西元2002年至2051年)的生育率。第二部分是人口推估,採用人口變動要素合成方法(Cohort Component Projection Method)推估台灣地區未來50年的人口總數及其結構,其中生育率採用上述最適合台灣地區的模型、死亡率則引進國外知名的Lee-Carter法及SOA法(Society of Actuaries),探討人口結構,並與人力規劃處的結果比較之。 / Both the fertility rate and mortality rate have been experiencing dramatic decreases in recent years. As a result, the population aging has become one of the major concerns in Taiwan area, and the proportion of the elderly (age 65 and over) increases promptly from 2.6% in 1965 to 8.8% in 2001. The decrease of fertility rate is especially significant. For example, the total fertility rate was 5.58 in 1961, and then decreases dramatically to 1.67 in 1981 (1.4 in 2001), a reduction of almost 70% within 20 years.   The goal of this paper is to study the population aging in Taiwan area, in particular, the fertility pattern. The first part of this paper is to explore the fertility models and decide which model is the most suitable based on age-fertility fertility rates in Taiwan. The models considered are Gamma function, Gompertz function, Lee-Carter method and individual group estimation. We use the data from 1951 to 1995 as pilot data and 1996 to 2000 as test data to judge which model fit well. The second part of this study is to project the Taiwan population for the next 50 years, i.e. 2002-2051. The projection method used is Cohort Component Projection method, assuming the population in Taiwan area is closed. We also compare our projection result to that by Council for Economic Planning and Development, the Executive Yuan of the Republic of China.
156

以部分法修正地理加權迴歸 / A conditional modification to geographically weighted regression

梁穎誼, Leong , Yin Yee Unknown Date (has links)
在二十世紀九十年代,學者提出地理加權迴歸(Geographically Weighted Regression;簡稱GWR)。GWR是一個企圖解決空間非穩定性的方法。此方法最大的特性,是模型中的迴歸係數可以依空間的不同而改變,這也意味著不同的地理位置可以有不同的迴歸係數。在係數的估計上,每個觀察值都擁有一個固定環寬,而估計值可以由環寬範圍內的觀察值取得。然而,若變數之間的特性不同,固定環寬的設定可能會產生不可靠的估計值。 為了解決這個問題,本文章提出CGWR(Conditional-based GWR)的方法嘗試修正估計值,允許各迴歸變數有不同的環寬。在估計的程序中,CGWR運用疊代法與交叉驗證法得出最終的估計值。本文驗證了CGWR的收斂性,也同時透過電腦模擬比較GWR, CGWR與local linear法(Wang and Mei, 2008)的表現。研究發現,當迴歸係數之間存有正相關時,CGWR比其他兩個方法來的優異。最後,本文使用CGWR分析台灣高齡老人失能資料,驗證CGWR的效果。 / Geographically weighted regression (GWR), first proposed in the 1990s, is a modelling technique used to deal with spatial non-stationarity. The main characteristic of GWR is that it allows regression coefficients to vary across space, and so the values of the parameters can vary depending on locations. The parameters for each location can be estimated by observations within a fixed range (or bandwidth). However, if the parameters differ considerably, the fixed bandwidth may produce unreliable or even unstable estimates. To deal with the estimation of greatly varying parameter values, we propose Conditional-based GWR (CGWR), where a different bandwidth is selected for each independent variable. The bandwidths for the independent variables are derived via an iteration algorithm using cross-validation. In addition to showing the convergence of the algorithm, we also use computer simulation to compare the proposed method with the basic GWR and a local linear method (Wang and Mei, 2008). We found that the CGWR outperforms the other two methods if the parameters are positively correlated. In addition, we use elderly disability data from Taiwan to demonstrate the proposed method.
157

High angular resolution diffusion-weighted magnetic resonance imaging: adaptive smoothing and applications

Metwalli, Nader 07 July 2010 (has links)
Diffusion-weighted magnetic resonance imaging (MRI) has allowed unprecedented non-invasive mapping of brain neural connectivity in vivo by means of fiber tractography applications. Fiber tractography has emerged as a useful tool for mapping brain white matter connectivity prior to surgery or in an intraoperative setting. The advent of high angular resolution diffusion-weighted imaging (HARDI) techniques in MRI for fiber tractography has allowed mapping of fiber tracts in areas of complex white matter fiber crossings. Raw HARDI images, as a result of elevated diffusion-weighting, suffer from depressed signal-to-noise ratio (SNR) levels. The accuracy of fiber tractography is dependent on the performance of the various methods extracting dominant fiber orientations from the HARDI-measured noisy diffusivity profiles. These methods will be sensitive to and directly affected by the noise. In the first part of the thesis this issue is addressed by applying an objective and adaptive smoothing to the noisy HARDI data via generalized cross-validation (GCV) by means of the smoothing splines on the sphere method for estimating the smooth diffusivity profiles in three dimensional diffusion space. Subsequently, fiber orientation distribution functions (ODFs) that reveal dominant fiber orientations in fiber crossings are then reconstructed from the smoothed diffusivity profiles using the Funk-Radon transform. Previous ODF smoothing techniques have been subjective and non-adaptive to data SNR. The GCV-smoothed ODFs from our method are accurate and are smoothed without external intervention facilitating more precise fiber tractography. Diffusion-weighted MRI studies in amyotrophic lateral sclerosis (ALS) have revealed significant changes in diffusion parameters in ALS patient brains. With the need for early detection of possibly discrete upper motor neuron (UMN) degeneration signs in patients with early ALS, a HARDI study is applied in order to investigate diffusion-sensitive changes reflected in the diffusion tensor imaging (DTI) measures axial and radial diffusivity as well as the more commonly used measures fractional anisotropy (FA) and mean diffusivity (MD). The hypothesis is that there would be added utility in considering axial and radial diffusivities which directly reflect changes in the diffusion tensors in addition to FA and MD to aid in revealing neurodegenerative changes in ALS. In addition, applying adaptive smoothing via GCV to the HARDI data further facilitates the application of fiber tractography by automatically eliminating spurious noisy peaks in reconstructed ODFs that would mislead fiber tracking.
158

Three Essays on Application of Semiparametric Regression: Partially Linear Mixed Effects Model and Index Model / Drei Aufsätze über Anwendung der Semiparametrischen Regression: Teilweise Lineares Gemischtes Modell und Index Modell

Ohinata, Ren 03 May 2012 (has links)
No description available.
159

Méthode non-paramétrique des noyaux associés mixtes et applications / Non parametric method of mixed associated kernels and applications

Libengue Dobele-kpoka, Francial Giscard Baudin 13 June 2013 (has links)
Nous présentons dans cette thèse, l'approche non-paramétrique par noyaux associés mixtes, pour les densités àsupports partiellement continus et discrets. Nous commençons par rappeler d'abord les notions essentielles d'estimationpar noyaux continus (classiques) et noyaux associés discrets. Nous donnons la définition et les caractéristiques desestimateurs à noyaux continus (classiques) puis discrets. Nous rappelons aussi les différentes techniques de choix deparamètres de lissage et nous revisitons les problèmes de supports ainsi qu'une résolution des effets de bord dans le casdiscret. Ensuite, nous détaillons la nouvelle méthode d'estimation de densités par les noyaux associés continus, lesquelsenglobent les noyaux continus (classiques). Nous définissons les noyaux associés continus et nous proposons laméthode mode-dispersion pour leur construction puis nous illustrons ceci sur les noyaux associés non-classiques de lalittérature à savoir bêta et sa version étendue, gamma et son inverse, gaussien inverse et sa réciproque le noyau dePareto ainsi que le noyau lognormal. Nous examinons par la suite les propriétés des estimateurs qui en sont issus plusprécisément le biais, la variance et les erreurs quadratiques moyennes ponctuelles et intégrées. Puis, nous proposons unalgorithme de réduction de biais que nous illustrons sur ces mêmes noyaux associés non-classiques. Des études parsimulations sont faites sur trois types d’estimateurs à noyaux lognormaux. Par ailleurs, nous étudions lescomportements asymptotiques des estimateurs de densité à noyaux associés continus. Nous montrons d'abord lesconsistances faibles et fortes ainsi que la normalité asymptotique ponctuelle. Ensuite nous présentons les résultats desconsistances faibles et fortes globales en utilisant les normes uniformes et L1. Nous illustrons ceci sur trois typesd’estimateurs à noyaux lognormaux. Par la suite, nous étudions les propriétés minimax des estimateurs à noyauxassociés continus. Nous décrivons d'abord le modèle puis nous donnons les hypothèses techniques avec lesquelles noustravaillons. Nous présentons ensuite nos résultats minimax tout en les appliquant sur les noyaux associés non-classiquesbêta, gamma et lognormal. Enfin, nous combinons les noyaux associés continus et discrets pour définir les noyauxassociés mixtes. De là, les outils d'unification d'analyses discrètes et continues sont utilisés, pour montrer les différentespropriétés des estimateurs à noyaux associés mixtes. Une application sur un modèle de mélange des lois normales et dePoisson tronquées est aussi donnée. Tout au long de ce travail, nous choisissons le paramètre de lissage uniquementavec la méthode de validation croisée par les moindres carrés. / We present in this thesis, the non-parametric approach using mixed associated kernels for densities withsupports being partially continuous and discrete. We first start by recalling the essential concepts of classical continuousand discrete kernel density estimators. We give the definition and characteristics of these estimators. We also recall thevarious technical for the choice of smoothing parameters and we revisit the problems of supports as well as a resolutionof the edge effects in the discrete case. Then, we describe a new method of continuous associated kernels for estimatingdensity with bounded support, which includes the classical continuous kernel method. We define the continuousassociated kernels and we propose the mode-dispersion for their construction. Moreover, we illustrate this on the nonclassicalassociated kernels of literature namely, beta and its extended version, gamma and its inverse, inverse Gaussianand its reciprocal, the Pareto kernel and the kernel lognormal. We subsequently examine the properties of the estimatorswhich are derived, specifically, the bias, variance and the pointwise and integrated mean squared errors. Then, wepropose an algorithm for reducing bias that we illustrate on these non-classical associated kernels. Some simulationsstudies are performed on three types of estimators lognormal kernels. Also, we study the asymptotic behavior of thecontinuous associated kernel estimators for density. We first show the pointwise weak and strong consistencies as wellas the asymptotic normality. Then, we present the results of the global weak and strong consistencies using uniform andL1norms. We illustrate this on three types of lognormal kernels estimators. Subsequently, we study the minimaxproperties of the continuous associated kernel estimators. We first describe the model and we give the technicalassumptions with which we work. Then we present our results that we apply on some non-classical associated kernelsmore precisely beta, gamma and lognormal kernel estimators. Finally, we combine continuous and discrete associatedkernels for defining the mixed associated kernels. Using the tools of the unification of discrete and continuous analysis,we show the different properties of the mixed associated kernel estimators. All through this work, we choose thesmoothing parameter using the least squares cross-validation method.
160

Preprocesserings påverkan på prediktiva modeller : En experimentell analys av tidsserier från fjärrvärme / Impact of preprocessing on predictive models : An experimental analysis of time series from district heating

Andersson, Linda, Laurila, Alex, Lindström, Johannes January 2021 (has links)
Värme står för det största energibehovet inom hushåll och andra byggnader i samhället och olika tekniker används för att kunna reducera mängden energi som går åt för att spara på både miljö och pengar. Ett angreppssätt på detta problem är genom informatiken, där maskininlärning kan användas för att analysera och förutspå värmebehovet. I denna studie används maskininlärning för att prognostisera framtida energiförbrukning för fjärrvärme utifrån historisk fjärrvärmedata från ett fjärrvärmebolag tillsammans med exogena variabler i form av väderdata från Sveriges meteorologiska och hydrologiska institut. Studien är skriven på svenska och utforskar effekter av preprocessering hos prediktionsmodeller som använder tidsseriedata för att prognostisera framtida datapunkter. Stegen som utförs i studien är normalisering, interpolering, hantering av numeric outliers och missing values, datetime feature engineering, säsongsmässighet, feature selection, samt korsvalidering. Maskininlärningsmodellen som används i studien är Multilayer Perceptron som är en subkategori av artificiellt neuralt nätverk. Forskningsfrågan som besvaras fokuserar på effekter av preprocessering och feature selection för prediktiva modellers prestanda inom olika datamängder och kombinationer av preprocesseringsmetoder. Modellerna delades upp i tre olika datamängder utifrån datumintervall: 2009, 2007–2011, samt 2007–2017, där de olika kombinationerna utgörs av preprocesseringssteg som kombineras inom en iterativ process. Procentuella ökningar på R2-värden för dessa olika intervall har uppnått 47,45% för ett år, 9,97% för fem år och 32,44% för 11 år. I stora drag bekräftar och förstärker resultatet befintlig teori som menar på att preprocessering kan förbättra prediktionsmodeller. Ett antal mindre observationer kring enskilda preprocesseringsmetoders effekter har identifierats och diskuterats i studien, såsom DateTime Feature Engineerings negativa effekter på modeller som tränats med ett mindre antal iterationer. / Heat accounts for the greatest energy needs in households and other buildings in society. Effective production and distribution of heat energy require techniques for minimising economic and environmental costs. One approach to this problem is through informatics where machine learning is used to analyze and predict the heating needs with the help of historical data from a district heating company and exogenous variables in the form of weather data from Sweden's Meteorological and Hydrological Institute (SMHI). This study is written in Swedish and explores the importance of preprocessing practices before training and using prediction models which utilizes time-series data to predict future energy consumption. The preprocessing steps explored in this study consists of normalization, interpolation, identification and management of numerical outliers and missing values, datetime feature engineering, seasonality, feature selection and cross-validation. The machine learning model used in this study is Multilayer Perceptron which is a subcategory of artificial neural network. The research question focuses on the effects of preprocessing and feature selection for predictive model performance within different datasets and combinations of preprocessing methods. The models were divided into three different data sets based on date ranges: 2009, 2007–2011, and 2007–2017, where the different combinations consist of preprocessing steps that are combined within an iterative process. Percentage increases in R2 values for these different ranges have reached 47,45% for one year, 9,97% for five years and 32,44% for 11 years. The results broadly confirm and reinforce the existing theory that preprocessing can improve prediction models. A few minor observations about the effects of individual preprocessing methods have been identified and discussed in the study, such as DateTime Feature Engineering having a detrimental effect on models with very few training iterations.

Page generated in 0.1451 seconds