Global ETD Search

441	Adaptive sequential feature selection in visual perception and pattern recognition Avdiyenko, Liliya 15 September 2014 (has links) In the human visual system, one of the most prominent functions of the extensive feedback from the higher brain areas within and outside of the visual cortex is attentional modulation. The feedback helps the brain to concentrate its resources on visual features that are relevant for recognition, i. e. it iteratively selects certain aspects of the visual scene for refined processing by the lower areas until the inference process in the higher areas converges to a single hypothesis about this scene. In order to minimize a number of required selection-refinement iterations, one has to find a short sequence of maximally informative portions of the visual input. Since the feedback is not static, the selection process is adapted to a scene that should be recognized. To find a scene-specific subset of informative features, the adaptive selection process on every iteration utilizes results of previous processing in order to reduce the remaining uncertainty about the visual scene. This phenomenon inspired us to develop a computational algorithm solving a visual classification task that would incorporate such principle, adaptive feature selection. It is especially interesting because usually feature selection methods are not adaptive as they define a unique set of informative features for a task and use them for classifying all objects. However, an adaptive algorithm selects features that are the most informative for the particular input. Thus, the selection process should be driven by statistics of the environment concerning the current task and the object to be classified. Applied to a classification task, our adaptive feature selection algorithm favors features that maximally reduce the current class uncertainty, which is iteratively updated with values of the previously selected features that are observed on the testing sample. In information-theoretical terms, the selection criterion is the mutual information of a class variable and a feature-candidate conditioned on the already selected features, which take values observed on the current testing sample. Then, the main question investigated in this thesis is whether the proposed adaptive way of selecting features is advantageous over the conventional feature selection and in which situations. Further, we studied whether the proposed adaptive information-theoretical selection scheme, which is a computationally complex algorithm, is utilized by humans while they perform a visual classification task. For this, we constructed a psychophysical experiment where people had to select image parts that as they think are relevant for classification of these images. We present the analysis of behavioral data where we investigate whether human strategies of task-dependent selective attention can be explained by a simple ranker based on the mutual information, a more complex feature selection algorithm based on the conventional static mutual information and the proposed here adaptive feature selector that mimics a mechanism of the iterative hypothesis refinement. Hereby, the main contribution of this work is the adaptive feature selection criterion based on the conditional mutual information. Also it is shown that such adaptive selection strategy is indeed used by people while performing visual classification.:1. Introduction 2. Conventional feature selection 3. Adaptive feature selection 4. Experimental investigations of ACMIFS 5. Information-theoretical strategies of selective attention 6. Discussion Appendix Bibliography info:eu-repo/classification/ddc/500 ddc:500
442	Dynamic prediction of repair costs in heavy-duty trucks Saigiridharan, Lakshidaa January 2020 (has links) Pricing of repair and maintenance (R&M) contracts is one among the most important processes carried out at Scania. Predictions of repair costs at Scania are carried out using experience-based prediction methods which do not involve statistical methods for the computation of average repair costs for contracts terminated in the recent past. This method is difficult to apply for a reference population of rigid Scania trucks. Hence, the purpose of this study is to perform suitable statistical modelling to predict repair costs of four variants of rigid Scania trucks. The study gathers repair data from multiple sources and performs feature selection using the Akaike Information Criterion (AIC) to extract the most significant features that influence repair costs corresponding to each truck variant. The study proved to show that the inclusion of operational features as a factor could further influence the pricing of contracts. The hurdle Gamma model, which is widely used to handle zero inflations in Generalized Linear Models (GLMs), is used to train the data which consists of numerous zero and non-zero values. Due to the inherent hierarchical structure within the data expressed by individual chassis, a hierarchical hurdle Gamma model is also implemented. These two statistical models are found to perform much better than the experience-based prediction method. This evaluation is done using the mean absolute error (MAE) and root mean square error (RMSE) statistics. A final model comparison is conducted using the AIC to draw conclusions based on the goodness of fit and predictive performance of the two statistical models. On assessing the models using these statistics, the hierarchical hurdle Gamma model was found to perform predictions the best Hurdle Gamma model Hierarchical hurdle Gamma model Generalized linear model (GLM) Supervised Machine Learning feature selection Akaike Information Criterion (AIC) prediction of repair costs heavy-duty trucks truck variants operational features Probability Theory and Statistics Sannolikhetsteori och statistik
443	Predicting Subprime Customers' Probability of Default Using Transaction and Debt Data from NPLs / Predicering av högriskkunders sannolikhet för fallissemang baserat på transaktions- och lånedata på nödlidande lån Wong, Lai-Yan January 2021 (has links) This thesis aims to predict the probability of default (PD) of non-performing loan (NPL) customers using transaction and debt data, as a part of developing credit scoring model for Hoist Finance. Many NPL customers face financial exclusion due to default and therefore are considered as bad customers. Hoist Finance is a company that manages NPLs and believes that not all conventionally considered subprime customers are high-risk customers and wants to offer them financial inclusion through favourable loans. In this thesis logistic regression was used to model the PD of NPL customers at Hoist Finance based on 12 months of data. Different feature selection (FS) methods were explored, and the best model utilized l1-regularization for FS and predicted with 85.71% accuracy that 6,277 out of 27,059 customers had a PD between 0% to 10%, which support this belief. Through analysis of the PD it was shown that the PD increased almost linearly with respect to an increase in either debt quantity, original total claim amount or number of missed payments. The analysis also showed that the payment behaviour in the last quarter had the most predictive power. At the same time, from analysing the type II error it was shown that the model was unable to capture some bad payment behaviour, due to putting to large emphasis on the last quarter. / Det här examensarbetet syftar till att predicera sannolikheten för fallissemang för nödlidande lånekunder genom transaktions- och lånedata. Detta som en del av kreditvärdighetsmodellering för Hoist Finance. På engelska kallas sannolikheten för fallissemang för "probability of default" (PD) och nödlidande lån kallas för "non-performing loan" (NPL). Många NPL-kunder står inför ekonomisk uteslutning på grund av att de konventionellt betraktas som kunder med dålig kreditvärdighet. Hoist Finance är ett företag som förvaltar nödlidande lån och påstår att inte alla konventionellt betraktade "dåliga" kunder är högrisk kunder. Därför vill Hoist Finance inkludera dessa kunder ekonomisk genom att erbjuda gynnsamma lån. I detta examensarbetet har Logistisk regression används för att predicera PD på nödlidande lånekunder på Hoist Finance baserat på 12 månaders data. Olika metoder för urval av attribut undersöktes och den bästa modellen utnyttjade lasso för urval. Denna modell predicerade med 85,71 % noggrannhet att 6 277 av 27 059 kunder har en PD mellan 0 % till 10 %, vilket stödjer påståendet. Från analys av PD visade det sig att PD ökade nästan linjärt med avseende på ökning i antingen kvantitet av lån, det ursprungliga totala lånebeloppet eller antalet missade betalningar. Analysen visade också att betalningsbeteendet under det sista kvartalet hade störst prediktivt värde. Genom analys av typ II-felet, visades det sig samtidigt att modellen hade svårigheter att fånga vissa dåliga betalningsbeteende just på grund av att för stor vikt lades på det sista kvartalet. Credit Scoring Model Probability of Default Payment Behaviour Subprime Customer Non-performing Loan Logistic Regression Regularization Feature Selection Kreditvärdighetsmodell Sannolikhet för Fallissemang Betalningsbeteende Högriskkunder Nödlidandelån Logistik Regression Regularisering Variabelselektion Mathematics Matematik
444	Benchmarking Renewable Energy Supply Forecasts Ulbricht, Robert 19 July 2021 (has links) The ability of generating precise numerical forecasts is important to successful Enterprises in order to prepare themselves for undetermined future developments. For Utility companies, forecasts of prospective energy demand are a crucial component in order to maintain the physical stability and reliability of electricity grids. The constantly increasing capacity of fluctuating renewable energy sources creates a challenge in balancing power supply and demand. To allow for better integration, supply forecasting has become an important topic in the research field of energy data management and many new forecasting methods have been proposed in the literature. However, choosing the optimal solution for a specific forecasting problem remains a time- and work-intensive Task as meaningful benchmarks are rare and there is still no standard, easy-to-use, and robust approach. Many of the models in use are obtained by executing black-box machine learning tools and then manually optimized by human experts via trial-and-error towards the requirements of the underlying use case. Due to the lack of standardized Evaluation methodologies and access to experimental data, these results are not easily comparable. In this thesis, we address the topic of systematic benchmarks for renewable Energy supply forecasts. These usually include two stages, requiring a weather- and an energy forecast model. The latter can be selected amongst the classes of physical, statistical, and hybrid models. The selection of an appropriate model is one of the major tasks included in the forecasting process. We conducted an empirical analysis to assess the most popular forecasting methods. In contrast to the classical time- and resource intensive, mostly manual evaluation procedure, we developed a more efficient decision-support solution. With the inclusion of contextual information, our heuristic approach HMR is able to identify suitable examples in a case base and generates a recommendation out of the results from already existing solutions. The usage of time series representations reduces the dimensions of the original data thus allowing for an efficient search in large data sets. A context-aware evaluation methodology is introduced to assess a forecast’s quality based on its monetary return in the corresponding market environment. Results otherwise usually evaluated using statistical accuracy criteria become more interpretable by estimating real-world impacts. Finally, we introduced the ECAST framework as an open and easy to-use online platform that supports the benchmarking of energy time series forecasting methods. It aides inexperienced practitioners by supporting the execution of automated tasks, thus making complex benchmarks much more efficient and easy to handle. The integration of modules like the Ensembler, the Recommender, and the Evaluator provide additional value for forecasters. Reliable benchmarks can be conducted on this basis, while analytical functions for output explanation provide transparency for the user.:1 INTRODUCTION 11 2 ENERGY DATA MANAGEMENT CHALLENGES 17 2.1 Market Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 EDMS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Core Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Typical Energy Data Management Processes . . . . . . . . . . . 23 2.2.3 System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Smart Metering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Energy Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.3 Energy Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.4 Mobile Consumption Devices . . . . . . . . . . . . . . . . . . . . . 30 2.3.5 Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 ENERGY SUPPLY FORECASTING CONCEPTS 35 3.1 Energy Supply Forecasting Approaches . . . . . . . . . . . . . . . . . . . 36 3.1.1 Weather Forecast Models . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.2 Energy Forecast Models . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Energy Forecasting Process . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Iterative Standard Process Model . . . . . . . . . . . . . . . . . . . 43 3.2.2 Context-Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Model Selection - A Benchmark Case Study . . . . . . . . . . . . . . . . 48 3.3.1 Use Case Description . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.3 Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4 RELEVANCE OF RENEWABLE ENERGY FORECASTING METHODS 55 4.1 Scientific Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Practical Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.2 Feedback from Software Providers . . . . . . . . . . . . . . . . . . 61 4.2.3 Feedback from Software Users . . . . . . . . . . . . . . . . . . . . . 62 4.3 Forecasting Competitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 HEURISTIC MODEL RECOMMENDATION 67 5.1 Property-based Similarity Determination . . . . . . . . . . . . . . . . . . 67 5.1.1 Time Series Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1.2 Reducing Dimensionality with Property Extraction . . . . . . . . . 69 5.1.3 Correlation Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.1 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.2 Feature Pre-Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2.3 Property-based Least Angle Regression . . . . . . . . . . . . . . . 85 5.3 HMR Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.1 Formalized Foundations . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.2 Procedure Description . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3.3 Quality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4.1 Case Base Composition . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4.2 Classifier Performance on univariate Models . . . . . . . . . . . . 95 5.4.3 HMR performance on multivariate models . . . . . . . . . . . . . 99 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6 VALUE-BASED RESULT EVALUATION METHODOLOGY 105 6.1 Accuracy evaluation in energy forecasting . . . . . . . . . . . . . . . . 106 6.2 Energy market models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Value-based forecasting performance . . . . . . . . . . . . . . . . . . . 110 6.3.1 Forecast Benefit Determination . . . . . . . . . . . . . . . . . . . . 110 6.3.2 Multi-dimensional Ranking Scores . . . . . . . . . . . . . . . . . . . 113 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.4.1 Use Case Description . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.3 Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7 ECAST BENCHMARK FRAMEWORK 129 7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.1.1 Objective Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.1.2 Fundamental Design Principles . . . . . . . . . . . . . . . . . . . . 131 7.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2.1 Task Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.3 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.3.1 Step 1: Create a new Benchmark . . . . . . . . . . . . . . . . . . 137 7.3.2 Step 2: Build Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.3.3 Step 3: Evaluate the Output . . . . . . . . . . . . . . . . . . . . . . 141 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8 CONCLUSIONS 145 BIBLIOGRAPHY 149 LIST OF FIGURES 167 LIST OF TABLES 169 A LIST OF REVIEWED JOURNAL ARTICLES 171 B QUESTIONNAIRES 175 C STANDARD ERRORS FOR RANKING SCORES 179 D ERROR DISTRIBUTION FOR BENCHMARKED PREDICTORS 183 info:eu-repo/classification/ddc/004 ddc:004
445	Insurance Fraud Detection using Unsupervised Sequential Anomaly Detection / Detektion av försäkringsbedrägeri med oövervakad sekvensiell anomalitetsdetektion Hansson, Anton, Cedervall, Hugo January 2022 (has links) Fraud is a common crime within the insurance industry, and insurance companies want to quickly identify fraudulent claimants as they often result in higher premiums for honest customers. Due to the digital transformation where the sheer volume and complexity of available data has grown, manual fraud detection is no longer suitable. This work aims to automate the detection of fraudulent claimants and gain practical insights into fraudulent behavior using unsupervised anomaly detection, which, compared to supervised methods, allows for a more cost-efficient and practical application in the insurance industry. To obtain interpretable results and benefit from the temporal dependencies in human behavior, we propose two variations of LSTM based autoencoders to classify sequences of insurance claims. Autoencoders can provide feature importances that give insight into the models' predictions, which is essential when models are put to practice. This approach relies on the assumption that outliers in the data are fraudulent. The models were trained and evaluated on a dataset we engineered using data from a Swedish insurance company, where the few labeled frauds that existed were solely used for validation and testing. Experimental results show state-of-the-art performance, and further evaluation shows that the combination of autoencoders and LSTMs are efficient but have similar performance to the employed baselines. This thesis provides an entry point for interested practitioners to learn key aspects of anomaly detection within fraud detection by thoroughly discussing the subject at hand and the details of our work. / <p>Gjordes digitalt via Zoom. </p> Insurance Fraud Detection Anomaly Detection Long Short-Term Memory Networks (LSTM) Unsupervised Learning Autoencoder (AE) Variational Autoencoder (VAE) Interpretable Machine Learning Feature Engineering Feature Selection Feature Importance Computer Sciences Datavetenskap (datalogi)
446	A multi-sensor approach for land cover classification and monitoring of tidal flats in the German Wadden Sea Jung, Richard 07 April 2016 (has links) Sand and mud traversed by tidal inlets and channels, which split in subtle branches, salt marshes at the coast, the tide, harsh weather conditions and a high diversity of fauna and flora characterize the ecosystem Wadden Sea. No other landscape on the Earth changes in such a dynamic manner. Therefore, land cover classification and monitoring of vulnerable ecosystems is one of the most important approaches in remote sensing and has drawn much attention in recent years. The Wadden Sea in the southeastern part of the North Sea is one such vulnerable ecosystem, which is highly dynamic and diverse. The tidal flats of the Wadden Sea are the zone of interaction between marine and terrestrial environments and are at risk due to climate change, pollution and anthropogenic pressure. Due to that, the European Union has implemented various directives, which formulate objectives such as achieving or maintaining a good environmental status respectively a favourable conservation status within a given time. In this context, a permanent observation for the estimation of the ecological condition is needed. Moreover, changes can be tracked or even foreseen and an appropriate response is possible. Therefore, it is important to distinguish between short-term changes, which are related to the dynamic manner of the ecosystem, and long-term changes, which are the result of extraneous influences. The accessibility both from sea and land is very poor, which makes monitoring and mapping of tidal flat environments from in situ measurements very difficult and cost-intensive. For the monitoring of big areas, time-saving applications are needed. In this context, remote sensing offers great possibilities, due to its provision of a large spatial coverage and non-intrusive measurements of the Earth’s surface. Previous studies in remote sensing have focused on the use of electro-optical and radar sensors for remote sensing of tidal flats, whereas microwave systems using synthetic aperture radar (SAR) can be a complementary tool for tidal flat observation, especially due to their high spatial resolution and all-weather imaging capability. Nevertheless, the repetitive tidal event and dynamic sedimentary processes make an integrated observation of tidal flats from multi-sourced datasets essential for mapping and monitoring. The main challenge for remote sensing of tidal flats is to isolate the sediment, vegetation or shellfish bed features in the spectral signature or backscatter intensity from interference by water, the atmosphere, fauna and flora. In addition, optically active materials, such as plankton, suspended matter and dissolved organics, affect the scattering and absorption of radiation. Tidal flats are spatially complex and temporally quite variable and thus mapping tidal land cover requires satellites or aircraft imagers with high spatial and temporal resolution and, in some cases, hyperspectral data. In this research, a hierarchical knowledge-based decision tree applied to multi-sensor remote sensing data is introduced and the results have been visually and numerically evaluated and subsequently analysed. The multi-sensor approach comprises electro-optical data from RapidEye, SAR data from TerraSAR-X and airborne LiDAR data in a decision tree. Moreover, spectrometric and ground truth data are implemented into the analysis. The aim is to develop an automatic or semi-automatic procedure for estimating the distribution of vegetation, shellfish beds and sediments south of the barrier island Norderney. The multi-sensor approach starts with a semi-automatic pre-processing procedure for the electro-optical data of RapidEye, LiDAR data, spectrometric data and ground truth data. The decision tree classification is based on a set of hierarchically structured algorithms that use object and texture features. In each decision, one satellite dataset is applied to estimate a specific class. This helps to overcome the drawbacks that arise from a combined usage of all remote sensing datasets for one class. This could be shown by the comparison of the decision tree results with a popular state-of-the-art supervised classification approach (random forest). Subsequent to the classification, a discrimination analysis of various sediment spectra, measured with a hyperspectral sensor, has been carried out. In this context, the spectral features of the tidal sediments were analysed and a feature selection method has been developed to estimate suitable wavelengths for discrimination with very high accuracy. The developed feature selection method ‘JMDFS’ (Jeffries-Matusita distance feature selection) is a filter-based supervised band elimination technique and is based on the local Euclidean distance and the Jeffries-Matusita distance. An iterative process is used to subsequently eliminate wavelengths and calculate a separability measure at the end of each iteration. If distinctive thresholds are achieved, the process stops and the remaining wavelengths are applied in the further analysis. The results have been compared with a standard feature selection method (ReliefF). The JMDFS method obtains similar results and runs 216 times faster. Both approaches are quantitatively and qualitatively evaluated using reference data and standard methodologies for comparison. The results show that the proposed approaches are able to estimate the land cover of the tidal flats and to discriminate the tidal sediments with moderate to very high accuracy. The accuracies of each land cover class vary according to the dataset used. Furthermore, it is shown that specific reflection features can be identified that help in discriminating tidal sediments and which should be used in further applications in tidal flats. Multi-Sensor approach Wadden Sea Monitoring concept Change analysis Pixel-based classification Feature selection Object-based classification Preprocessing Digital image analysis digital image processing Atmospheric correction 74.41 - Luftaufnahmen, Photogrammetrie ddc:500
447	Développement de méthodes statistiques nécessaires à l'analyse de données génomiques : application à l'influence du polymorphisme génétique sur les caractéristiques cutanées individuelles et l'expression du vieillissement cutané / Development of statistical methods for genetic data analysis : identification of genetic polymorphisms potentially involved in skin aging Bernard, Anne 20 December 2013 (has links) Les nouvelles technologies développées ces dernières années dans le domaine de la génétique ont permis de générer des bases de données de très grande dimension, en particulier de Single Nucleotide Polymorphisms (SNPs), ces bases étant souvent caractérisées par un nombre de variables largement supérieur au nombre d'individus. L'objectif de ce travail a été de développer des méthodes statistiques adaptées à ces jeux de données de grande dimension et permettant de sélectionner les variables les plus pertinentes au regard du problème biologique considéré. Dans la première partie de ce travail, un état de l'art présente différentes méthodes de sélection de variables non supervisées et supervisées pour 2 blocs de variables et plus. Dans la deuxième partie, deux nouvelles méthodes de sélection de variables non supervisées de type "sparse" sont proposées : la Group Sparse Principal Component Analysis (GSPCA) et l'Analyse des Correspondances Multiples sparse (ACM sparse). Vues comme des problèmes de régression avec une pénalisation group LASSO elles conduisent à la sélection de blocs de variables quantitatives et qualitatives, respectivement. La troisième partie est consacrée aux interactions entre SNPs et dans ce cadre, une méthode spécifique de détection d'interactions, la régression logique, est présentée. Enfin, la quatrième partie présente une application de ces méthodes sur un jeu de données réelles de SNPs afin d'étudier l'influence possible du polymorphisme génétique sur l'expression du vieillissement cutané au niveau du visage chez des femmes adultes. Les méthodes développées ont donné des résultats prometteurs répondant aux attentes des biologistes, et qui offrent de nouvelles perspectives de recherches intéressantes / New technologies developed recently in the field of genetic have generated high-dimensional databases, especially SNPs databases. These databases are often characterized by a number of variables much larger than the number of individuals. The goal of this dissertation was to develop appropriate statistical methods to analyse high-dimensional data, and to select the most biologically relevant variables. In the first part, I present the state of the art that describes unsupervised and supervised variables selection methods for two or more blocks of variables. In the second part, I present two new unsupervised "sparse" methods: Group Sparse Principal Component Analysis (GSPCA) and Sparse Multiple Correspondence Analysis (Sparse MCA). Considered as regression problems with a group LASSO penalization, these methods lead to select blocks of quantitative and qualitative variables, respectively. The third part is devoted to interactions between SNPs. A method employed to identify these interactions is presented: the logic regression. Finally, the last part presents an application of these methods on a real SNPs dataset to study the possible influence of genetic polymorphism on facial skin aging in adult women. The methods developed gave relevant results that confirmed the biologist's expectations and that offered new research perspectives. Sélection de variables ACP sparse Acm SNP-SNP interactions Régression logique Méthodes multiblocs Méthodes sparse non supervisées Feature selection Sparse PCA Mca SNP-SNP interactions Logic regression Multiblocks methods Unsupervised sparse methods
448	Exploration de données pour l'optimisation de trajectoires aériennes / Data analysis for aircraft trajectory optimization Rommel, Cédric 26 October 2018 (has links) Cette thèse porte sur l'utilisation de données de vols pour l'optimisation de trajectoires de montée vis-à-vis de la consommation de carburant.Dans un premier temps nous nous sommes intéressé au problème d'identification de modèles de la dynamique de l'avion dans le but de les utiliser pour poser le problème d'optimisation de trajectoire à résoudre. Nous commençont par proposer une formulation statique du problème d'identification de la dynamique. Nous l'interpretons comme un problème de régression multi-tâche à structure latente, pour lequel nous proposons un modèle paramétrique. L'estimation des paramètres est faite par l'application de quelques variations de la méthode du maximum de vraisemblance.Nous suggérons également dans ce contexte d'employer des méthodes de sélection de variable pour construire une structure de modèle de régression polynomiale dépendant des données. L'approche proposée est une extension à un contexte multi-tâche structuré du bootstrap Lasso. Elle nous permet en effet de sélectionner les variables du modèle dans un contexte à fortes corrélations, tout en conservant la structure du problème inhérente à nos connaissances métier.Dans un deuxième temps, nous traitons la caractérisation des solutions du problème d'optimisation de trajectoire relativement au domaine de validité des modèles identifiés. Dans cette optique, nous proposons un critère probabiliste pour quantifier la proximité entre une courbe arbitraire et un ensemble de trajectoires échantillonnées à partir d'un même processus stochastique. Nous proposons une classe d'estimateurs de cette quantitée et nous étudions de façon plus pratique une implémentation nonparamétrique basé sur des estimateurs à noyau, et une implémentation paramétrique faisant intervenir des mélanges Gaussiens. Ce dernier est introduit comme pénalité dans le critère d'optimisation de trajectoire dans l'objectif l'intention d'obtenir directement des trajectoires consommant peu sans trop s'éloigner des régions de validité. / This thesis deals with the use of flight data for the optimization of climb trajectories with relation to fuel consumption.We first focus on methods for identifying the aircraft dynamics, in order to plug it in the trajectory optimization problem. We suggest a static formulation of the identification problem, which we interpret as a structured multi-task regression problem. In this framework, we propose parametric models and use different maximum likelihood approaches to learn the unknown parameters.Furthermore, polynomial models are considered and an extension to the structured multi-task setting of the bootstrap Lasso is used to make a consistent selection of the monomials despite the high correlations among them.Next, we consider the problem of assessing the optimized trajectories relatively to the validity region of the identified models. For this, we propose a probabilistic criterion for quantifying the closeness between an arbitrary curve and a set of trajectories sampled from the same stochastic process. We propose a class of estimators of this quantity and prove their consistency in some sense. A nonparemetric implementation based on kernel density estimators, as well as a parametric implementation based on Gaussian mixtures are presented. We introduce the later as a penalty term in the trajectory optimization problem, which allows us to control the trade-off between trajectory acceptability and consumption reduction. Optimisation de trajectoires Identification de systèmes dynamiques Selection de variables Apprentissage multi-Tâches Estimation de densité Analyse de données fonctionnelles Trajectory optimization System identification Structured feature selection Multi-Task learning Density estimation Functional data analysis 519
449	Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 Dataset Zoghi, Zeinab 30 November 2020 (has links) No description available. Mathematics Computer Engineering Computer Science Engineering Statistics UNSW-NB15 Ensemble Learning Ensemble Classification XGBoost Random Forest Balanced Bagging Bagging Boosting Hellinger Distance Elastic Net Sequential Feature Selection Anomaly Detection System Machine Learning Cybersecurity Data Science
450	An Approach To Cluster And Benchmark Regional Emergency Medical Service Agencies Kondapalli, Swetha 06 August 2020 (has links) No description available. Industrial Engineering Statistics Computer Science Emergency Medical Services Unsupervised Learning Random Forest Feature selection Clustering Benchmarking CLARANS K-means K-medoids Machine Learning Python Precision Recall Silhouette Elbow method

Search results