Global ETD Search

71	Forêts uniformément aléatoires et détection des irrégularités aux cotisations sociales / Detection of irregularities in social contributions using random uniform forests Ciss, Saïp 20 June 2014 (has links) Nous présentons dans cette thèse une application de l'apprentissage statistique à la détection des irrégularités aux cotisations sociales. L'apprentissage statistique a pour but de modéliser des problèmes dans lesquels il existe une relation, généralement non déterministe, entre des variables et le phénomène que l'on cherche à évaluer. Un aspect essentiel de cette modélisation est la prédiction des occurrences inconnues du phénomène, à partir des données déjà observées. Dans le cas des cotisations sociales, la représentation du problème s'exprime par le postulat de l'existence d'une relation entre les déclarations de cotisation des entreprises et les contrôles effectués par les organismes de recouvrement. Les inspecteurs du contrôle certifient le caractère exact ou inexact d'un certain nombre de déclarations et notifient, le cas échéant, un redressement aux entreprises concernées. L'algorithme d'apprentissage "apprend", grâce à un modèle, la relation entre les déclarations et les résultats des contrôles, puis produit une évaluation de l'ensemble des déclarations non encore contrôlées. La première partie de l'évaluation attribue un caractère régulier ou irrégulier à chaque déclaration, avec une certaine probabilité. La seconde estime les montants de redressement espérés pour chaque déclaration. Au sein de l'URSSAF (Union de Recouvrement des cotisations de Sécurité sociale et d'Allocations Familiales) d'Île-de-France, et dans le cadre d'un contrat CIFRE (Conventions Industrielles de Formation par la Recherche), nous avons développé un modèle de détection des irrégularités aux cotisations sociales que nous présentons et détaillons tout au long de la thèse. L'algorithme fonctionne sous le logiciel libre R. Il est entièrement opérationnel et a été expérimenté en situation réelle durant l'année 2012. Pour garantir ses propriétés et résultats, des outils probabilistes et statistiques sont nécessaires et nous discutons des aspects théoriques ayant accompagné sa conception. Dans la première partie de la thèse, nous effectuons une présentation générale du problème de la détection des irrégularités aux cotisations sociales. Dans la seconde, nous abordons la détection spécifiquement, à travers les données utilisées pour définir et évaluer les irrégularités. En particulier, les seules données disponibles suffisent à modéliser la détection. Nous y présentons également un nouvel algorithme de forêts aléatoires, nommé "forêt uniformément aléatoire", qui constitue le moteur de détection. Dans la troisième partie, nous détaillons les propriétés théoriques des forêts uniformément aléatoires. Dans la quatrième, nous présentons un point de vue économique, lorsque les irrégularités aux cotisations sociales ont un caractère volontaire, cela dans le cadre de la lutte contre le travail dissimulé. En particulier, nous nous intéressons au lien entre la situation financière des entreprises et la fraude aux cotisations sociales. La dernière partie est consacrée aux résultats expérimentaux et réels du modèle, dont nous discutons.Chacun des chapitres de la thèse peut être lu indépendamment des autres et quelques notions sont redondantes afin de faciliter l'exploration du contenu. / We present in this thesis an application of machine learning to irregularities in the case of social contributions. These are, in France, all contributions due by employees and companies to the "Sécurité sociale", the french system of social welfare (alternative incomes in case of unemployement, Medicare, pensions, ...). Social contributions are paid by companies to the URSSAF network which in charge to recover them. Our main goal was to build a model that would be able to detect irregularities with a little false positive rate. We, first, begin the thesis by presenting the URSSAF and how irregularities can appear, how can we handle them and what are the data we can use. Then, we talk about a new machine learning algorithm we have developped for, "random uniform forests" (and its R package "randomUniformForest") which are a variant of Breiman "random Forests" (tm), since they share the same principles but in in a different way. We present theorical background of the model and provide several examples. Then, we use it to show, when irregularities are fraud, how financial situation of firms can affect their propensity for fraud. In the last chapter, we provide a full evaluation for declarations of social contributions of all firms in Ile-de-France for year 2013, by using the model to predict if declarations present irregularities or not. Apprentissage statistique Apprentissage automatique Classification Régression Forêts uniformément aléatoires Irrégularités Fraude Cotisations sociales URSSAF d'Île-de-France Machine learning Ensemble learning Classification Regression Random uniform forests Decision trees Irregularities Fraud Social contributions URSSAF of Île-de-France 510 330
72	Horseshoe RuleFit : Learning Rule Ensembles via Bayesian Regularization Nalenz, Malte January 2016 (has links) This work proposes Hs-RuleFit, a learning method for regression and classiﬁcation, which combines rule ensemble learning based on the RuleFit algorithm with Bayesian regularization through the horseshoe prior. To this end theoretical properties and potential problems of this combination are studied. A second step is the implementation, which utilizes recent sampling schemes to make the Hs-RuleFit computationally feasible. Additionally, changes to the RuleFit algorithm are proposed such as Decision Rule post-processing and the usage of Decision rules generated via Random Forest. Hs-RuleFit addresses the problem of ﬁnding highly accurate and yet interpretable models. The method shows to be capable of ﬁnding compact sets of informative decision rules that give a good insight in the data. Through the careful choice of prior distributions the horse-shoe prior shows to be superior to the Lasso in this context. In an empirical evaluation on 16 real data sets Hs-RuleFit shows excellent performance in regression and outperforms the popular methods Random Forest, BART and RuleFit in terms of prediction error. The interpretability is demonstrated on selected data sets. This makes the Hs-RuleFit a good choice for science domains in which interpretability is desired. Problems are found in classiﬁcation, regarding the usage of the horseshoe prior and rule ensemble learning in general. A simulation study is performed to isolate the problems and potential solutions are discussed. Arguments are presented, that the horseshoe prior could be a good choice in other machine learning areas, such as artiﬁcial neural networks and support vector machines. Bayesian Statistics Regularization Ensemble Learning Decision Rules Horseshoe prior Machine Learning Knowledge Discovery Probability Theory and Statistics Sannolikhetsteori och statistik Computer Sciences Datavetenskap (datalogi) Bioinformatics (Computational Biology) Bioinformatik (beräkningsbiologi) Other Computer and Information Science Annan data- och informationsvetenskap
73	Creation of a vocal emotional profile (VEP) and measurement tools Aghajani, Mahsa 10 1900 (has links) La parole est le moyen de communication dominant chez les humains. Les signaux vocaux véhiculent à la fois des informations et des émotions du locuteur. La combinaison de ces informations aide le récepteur à mieux comprendre ce que veut dire le locuteur et diminue la probabilité de malentendus. Les robots et les ordinateurs peuvent également bénéficier de ce mode de communication. La capacité de reconnaître les émotions dans la voix des locuteurs aide les ordinateurs à mieux répondre aux besoins humains. Cette amélioration de la communication entre les humains et les ordinateurs conduit à une satisfaction accrue des utilisateurs. Dans cette étude, nous avons proposé plusieurs approches pour détecter les émotions de la parole ou de la voix par ordinateur. Nous avons étudié comment différentes techniques et classificateurs d'apprentissage automatique et d'apprentissage profond permettent de détecter les émotions de la parole. Les classificateurs sont entraînés avec des ensembles de données d'émotions audio couramment utilisés et bien connus, ainsi qu'un ensemble de données personnalisé. Cet ensemble de données personnalisé a été enregistré à partir de personnes non-acteurs et non-experts tout en essayant de déclencher des émotions associées. La raison de considérer cet ensemble de données important est de rendre le modèle compétent pour reconnaître les émotions chez les personnes qui ne sont pas aussi parfaites que les acteurs pour refléter leurs émotions dans leur voix. Les résultats de plusieurs classificateurs d'apprentissage automatique et d'apprentissage profond tout en reconnaissant sept émotions de colère, de bonheur, de tristesse, de neutralité, de surprise, de peur et de dégoût sont rapportés et analysés. Les modèles ont été évalués avec et sans prise en compte de l'ensemble de données personnalisé pour montrer l'effet de l'utilisation d'un ensemble de données imparfait. Dans cette étude, tirer parti des techniques d'apprentissage en profondeur et des méthodes d'apprentissage en ensemble a dépassé les autres techniques. Nos meilleurs classificateurs pourraient obtenir des précisions de 90,41 % et 91,96 %, tout en étant entraînés par des réseaux de neurones récurrents et des classificateurs d'ensemble à vote majoritaire, respectivement. / Speech is the dominant way of communication among humans. Voice signals carry both information and emotion of the speaker. The combination of this information helps the receiver to get a better understanding of what the speaker means and decreases the probability of misunderstandings. Robots and computers can also benefit from this way of communication. The capability of recognizing emotions in speakers voice, helps the computers to serve the human need better. This improvement in communication between humans and computers leads to increased user satisfaction. In this study we have proposed several approaches to detect the emotions from speech or voice computationally. We have investigated how different machine learning and deep learning techniques and classifiers perform in detecting the emotions from speech. The classifiers are trained with some commonly used and well-known audio emotion datasets together with a custom dataset. This custom dataset was recorded from non-actor and non-expert people while trying to trigger related emotions in them. The reason for considering this important dataset is to make the model proficient in recognizing emotions in people who are not as perfect as actors in reflecting their emotions in their voices. The results from several machine learning and deep learning classifiers while recognizing seven emotions of anger, happiness, sadness, neutrality, surprise, fear and disgust are reported and analyzed. Models were evaluated with and without considering the custom data set to show the effect of employing an imperfect dataset. In this study, leveraging deep learning techniques and ensemble learning methods has surpassed the other techniques. Our best classifiers could obtain accuracies of 90.41% and 91.96%, while being trained by recurrent neural networks and majority voting ensemble classifiers, respectively. Machine Learning Deep Learning Ensemble Learning Interface Cerveau-Ordinateur Reconnaissance Vocale Des Émotions Informatique Affective Voice Emotion Recognition Brain Computer Interface Affective Computing
74	Complex Vehicle Modeling: A Data Driven Approach Schoen, Alexander C. 12 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / This thesis proposes an artificial neural network (NN) model to predict fuel consumption in heavy vehicles. The model uses predictors derived from vehicle speed, mass, and road grade. These variables are readily available from telematics devices that are becoming an integral part of connected vehicles. The model predictors are aggregated over a fixed distance traveled (i.e., window) instead of fixed time interval. It was found that 1km windows is most appropriate for the vocations studied in this thesis. Two vocations were studied, refuse and delivery trucks. The proposed NN model was compared to two traditional models. The first is a parametric model similar to one found in the literature. The second is a linear regression model that uses the same features developed for the NN model. The confidence level of the models using these three methods were calculated in order to evaluate the models variances. It was found that the NN models produce lower point-wise error. However, the stability of the models are not as high as regression models. In order to improve the variance of the NN models, an ensemble based on the average of 5-fold models was created. Finally, the confidence level of each model is analyzed in order to understand how much error is expected from each model. The mean training error was used to correct the ensemble predictions for five K-Fold models. The ensemble K-fold model predictions are more reliable than the single NN and has lower confidence interval than both the parametric and regression models. Neural Network Prediction Fuel Consumption Improvement Ensemble Learning Refuse Truck Complex System Modeling Delivery Truck Vehicle Routing SAE J1321 Synthetic Data Generation Aerodynamic Speed Characteristic Acceleration Feature Importance Influence of Weights Machine Learning Point-wise Error Artificial Neural Network
75	Algorithmic Methods for Multi-Omics Biomarker Discovery Li, Yichao January 2018 (has links) No description available. Bioinformatics Computer Science Motif Diabetes Transcription Factor HiC Set Cover Machine Learning Ensemble Learning HbA1C Glycated Peptide Motif Discovery Motif Pair 3D Genome Organization DREAM challenge Python Data Analytics Hist1 Clustering Analysis Cross Validation
76	Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 Dataset Zoghi, Zeinab 30 November 2020 (has links) No description available. Mathematics Computer Engineering Computer Science Engineering Statistics UNSW-NB15 Ensemble Learning Ensemble Classification XGBoost Random Forest Balanced Bagging Bagging Boosting Hellinger Distance Elastic Net Sequential Feature Selection Anomaly Detection System Machine Learning Cybersecurity Data Science
77	Data Driven Energy Efficiency of Ships Taspinar, Tarik January 2022 (has links) Decreasing the fuel consumption and thus greenhouse gas emissions of vessels has emerged as a critical topic for both ship operators and policy makers in recent years. The speed of vessels has long been recognized to have highest impact on fuel consumption. The solution suggestions like "speed optimization" and "speed reduction" are ongoing discussion topics for International Maritime Organization. The aim of this study are to develop a speed optimization model using time-constrained genetic algorithms (GA). Subsequent to this, this paper also presents the application of machine learning (ML) regression methods in setting up a model with the aim of predicting the fuel consumption of vessels. Local outlier factor algorithm is used to eliminate outlier in prediction features. In boosting and tree-based regression prediction methods, the overfitting problem is observed after hyperparameter tuning. Early stopping technique is applied for overfitted models.In this study, speed is also found as the most important feature for fuel consumption prediction models. On the other hand, GA evaluation results showed that random modifications in default speed profile can increase GA performance and thus fuel savings more than constant speed limits during voyages. The results of GA also indicate that using high crossover rates and low mutations rates can increase fuel saving.Further research is recommended to include fuel and bunker prices to determine more accurate fuel efficiency. Local outlier factor k-nearest neighbors random forest gradient boosting support vector machines ensemble learning ship speed optimization genetic algorithm DEAP HyperOpt Annan elektroteknik och elektronik
78	Ensembles of Single Image Super-Resolution Generative Adversarial Networks / Ensembler av generative adversarial networks för superupplösning av bilder Castillo Araújo, Victor January 2021 (has links) Generative Adversarial Networks have been used to obtain state-of-the-art results for low-level computer vision tasks like single image super-resolution, however, they are notoriously difficult to train due to the instability related to the competing minimax framework. Additionally, traditional ensembling mechanisms cannot be effectively applied with these types of networks due to the resources they require at inference time and the complexity of their architectures. In this thesis an alternative method to create ensembles of individual, more stable and easier to train, models by using interpolations in the parameter space of the models is found to produce better results than those of the initial individual models when evaluated using perceptual metrics as a proxy of human judges. This method can be used as a framework to train GANs with competitive perceptual results in comparison to state-of-the-art alternatives. / Generative Adversarial Networks (GANs) har använts för att uppnå state-of-the- art resultat för grundläggande bildanalys uppgifter, som generering av högupplösta bilder från bilder med låg upplösning, men de är notoriskt svåra att träna på grund av instabiliteten relaterad till det konkurrerande minimax-ramverket. Dessutom kan traditionella mekanismer för att generera ensembler inte tillämpas effektivt med dessa typer av nätverk på grund av de resurser de behöver vid inferenstid och deras arkitekturs komplexitet. I det här projektet har en alternativ metod för att samla enskilda, mer stabila och modeller som är lättare att träna genom interpolation i parameterrymden visat sig ge bättre perceptuella resultat än de ursprungliga enskilda modellerna och denna metod kan användas som ett ramverk för att träna GAN med konkurrenskraftig perceptuell prestanda jämfört med toppmodern teknik. Computer and Information Sciences Data- och informationsvetenskap
79	[pt] CONJUNTOS ONLINE PARA APRENDIZADO POR REFORÇO PROFUNDO EM ESPAÇOS DE AÇÃO CONTÍNUA / [en] ONLINE ENSEMBLES FOR DEEP REINFORCEMENT LEARNING IN CONTINUOUS ACTION SPACES RENATA GARCIA OLIVEIRA 01 February 2022 (has links) [pt] Este trabalho busca usar o comitê de algoritmos de aprendizado por reforço profundo (deep reinforcement learning) sob uma nova perspectiva. Na literatura, a técnica de comitê é utilizada para melhorar o desempenho, mas, pela primeira vez, esta pesquisa visa utilizar comitê para minimizar a dependência do desempenho de aprendizagem por reforço profundo no ajuste fino de hiperparâmetros, além de tornar o aprendizado mais preciso e robusto. Duas abordagens são pesquisadas; uma considera puramente a agregação de ação, enquanto que a outra também leva em consideração as funções de valor. Na primeira abordagem, é criada uma estrutura de aprendizado online com base no histórico de escolha de ação contínua do comitê com o objetivo de integrar de forma flexível diferentes métodos de ponderação e agregação para as ações dos agentes. Em essência, a estrutura usa o desempenho passado para combinar apenas as ações das melhores políticas. Na segunda abordagem, as políticas são avaliadas usando seu desempenho esperado conforme estimado por suas funções de valor. Especificamente, ponderamos as funções de valor do comitê por sua acurácia esperada, calculada pelo erro da diferença temporal. As funções de valor com menor erro têm maior peso. Para medir a influência do esforço de ajuste do hiperparâmetro, grupos que consistem em uma mistura de diferentes quantidades de algoritmos bem e mal parametrizados foram criados. Para avaliar os métodos, ambientes clássicos como o pêndulo invertido, cart pole e cart pole duplo são usados como benchmarks. Na validação, os ambientes de simulação Half Cheetah v2, um robô bípede, e o Swimmer v2 apresentaram resultados superiores e consistentes demonstrando a capacidade da técnica de comitê em minimizar o esforço necessário para ajustar os hiperparâmetros dos algoritmos. / [en] This work seeks to use ensembles of deep reinforcement learning algorithms from a new perspective. In the literature, the ensemble technique is used to improve performance, but, for the first time, this research aims to use ensembles to minimize the dependence of deep reinforcement learning performance on hyperparameter fine-tuning, in addition to making it more precise and robust. Two approaches are researched; one considers pure action aggregation, while the other also takes the value functions into account. In the first approach, an online learning framework based on the ensemble s continuous action choice history is created, aiming to flexibly integrate different scoring and aggregation methods for the agents actions. In essence, the framework uses past performance to only combine the best policies actions. In the second approach, the policies are evaluated using their expected performance as estimated by their value functions. Specifically, we weigh the ensemble s value functions by their expected accuracy as calculated by the temporal difference error. Value functions with lower error have higher weight. To measure the influence on the hyperparameter tuning effort, groups consisting of a mix of different amounts of well and poorly parameterized algorithms were created. To evaluate the methods, classic environments such as the inverted pendulum, cart pole and double cart pole are used as benchmarks. In validation, the Half Cheetah v2, a biped robot, and Swimmer v2 simulation environments showed superior and consistent results demonstrating the ability of the ensemble technique to minimize the effort needed to tune the the algorithms. [pt] APRENDIZADO POR REFORCO [pt] APRENDIZADO POR COMITE [pt] COMITE DE ACOES CONTINUAS [pt] OTIMIZACAO DE HIPERPARAMETROS [en] REINFORCEMENT LEARNING [en] ENSEMBLE LEARNING [en] CONTINUOUS ACTION ENSEMBLE [en] DEEP DETERMINISTIC POLICY GRADIENT [en] HYPERPARAMETER OPTIMIZATION
80	Shoppin’ in the Rain : An Evaluation of the Usefulness of Weather-Based Features for an ML Ranking Model in the Setting of Children’s Clothing Online Retailing / Handla i regnet : En utvärdering av användbarheten av väderbaserade variabler för en ML-rankningsmodell inom onlineförsäljning av barnkläder Lorentz, Isac January 2023 (has links) Online shopping offers numerous benefits, but large product catalogs make it difficult for shoppers to understand the existence and characteristics of every item for sale. To simplify the decision-making process, online retailers use ranking models to recommend products relevant to each individual user. Contextual user data, such as location, time, or local weather conditions, can serve as valuable features for ranking models, enabling personalized real-time recommendations. Little research has been published on the usefulness of weather-based features for ranking models in online clothing retailing, which makes additional research into this topic worthwhile. Using Swedish sales and customer data from Babyshop, an online retailer of children’s fashion, this study examined possible correlations between local weather data and sales. This was done by comparing differences in daily weather and differences in daily shares of sold items per clothing category for two cities: Stockholm and Göteborg. With Malmö as an additional city, historical observational weather data from one location each in the three cities Stockholm, Göteborg, and Malmö was then featurized and used along with the customers’ postal towns, sales features, and sales trend features to train and evaluate the ranking relevancy of a gradient boosted decision trees learning to rank LightGBM ranking model with weather features. The ranking relevancy was compared against a LightGBM baseline that omitted the weather features and a naive baseline: a popularity-based ranker. Several possible correlations between a clothing category such as shorts, rainwear, shell jackets, winter wear, and a weather variable such as feels-like temperature, solar energy, wind speed, precipitation, snow, and snow depth were found. Evaluation of the ranking relevancy was done using the mean reciprocal rank and the mean average precision @ 10 on a small dataset consisting only of customer data from the postal towns Stockholm, Göteborg, and Malmö and also on a larger dataset where customers in postal towns from larger geographical areas had their home locations approximated as Stockholm, Göteborg or Malmö. The LightGBM rankers beat the naive baseline in three out of four configurations, and the ranker with weather features outperformed the LightGBM baseline by 1.1 to 2.2 percent across all configurations. The findings can potentially help online clothing retailers create more relevant product recommendations. / Internethandel erbjuder flera fördelar, men stora produktsortiment gör det svårt för konsumenter att känna till existensen av och egenskaperna hos alla produkter som saluförs. För att förenkla beslutsprocessen så använder internethandlare rankningsmodeller för att rekommendera relevanta produkter till varje enskild användare. Kontextuell användardata såsom tid på dygnet, användarens plats eller lokalt väder kan vara värdefulla variabler för rankningsmodeller då det möjliggör personaliserade realtidsrekommendationer. Det finns inte mycket publicerad forskning inom nyttan av väderbaserade variabler för produktrekommendationssystem inom internethandel av kläder, vilket gör ytterligare studier inom detta område intressant. Med hjälp av svensk försäljnings- och kunddata från Babyshop, en internethandel för barnkläder så undersökte denna studie möjliga korrelationer mellan lokal väderdata och försäljning. Detta gjordes genom att jämföra skillnaderna i dagligt väder och skillnaderna i dagliga andelar av sålda artiklar per klädeskategori för två städer: Stockholm och Göteborg. Med Malmö som ytterligare en stad så gjordes historiska metereologiska observationer från en plats var i Stockholm, Göteborg och Malmö till variabler och användes tillsammans med kundernas postorter, försäljningsvariabler och variabler för försäljningstrender för att träna och utvärdera rankningsrelevansen hos en gradient-boosted decision trees learning to rank LightGBM rankningsmodell med vädervariabler. Rankningsrelevansen jämfördes mot en LightGBM baslinjesmodel som saknade vädervariabler samt en naiv baslinje: en popularitetsbaserad rankningsmodell. Flera möjliga korrelationer mellan en klädeskategori som shorts, regnkläder, skaljackor, vinterkläder och och en daglig vädervariabel som känns-som-temperatur, solenergi, vindhastighet, nederbörd, snö och snödjup upptäcktes. Utvärderingen av rankingsrelevansen utfördes med mean reciprocal rank och mean average precision @ 10 på ett mindre dataset som bestod endast av kunddata från postorterna Stockholm, Göteborg och Malmö och även på ett större dataset där kunder med postorter från större geografiska områden fick sina hemorter approximerade som Stockholm, Göteborg eller Malmö. LigthGBM-rankningsmodellerna slog den naiva baslinjen i tre av fyra konfigurationer och rankningsmodellen med vädervariabler slog LightGBM baslinjen med 1.1 till 2.2 procent i alla konfigurationer. Resultaten kan potentiellt hjälpa internethandlare inom mode att skapa bättre produktrekommendationssystem. Statistical analysis regression analysis recommender systems ensemble learning electronic commerce LightGBM learning to rank feature selection weather-based features fashion Statistisk analys regressionsanalys rekommendationssystem ensemble-inlärning näthandel LightGBM learning to rank variabelselektion väderbaserade variabler mode Computer and Information Sciences Data- och informationsvetenskap

Search results