Global ETD Search

101	Personalizing the post-purchase experience in online sales using machine learning. / Personalisering av efterköpsupplevelsen inom onlineförsäljning med hjälp av maskininlärning. Kamau, Nganga, Dehoky, Dylan January 2021 (has links) Advances in machine learning, together with an abundance of available data has lead to an explosion in personalized offerings and being able to predict what consumers want, and need without them having to ask for it. During the last decade, it has become a multi billion dollar industry, and a capability upon many of the leading tech companies rely on in their business model. Indeed, in today's business world, it is not only a capability for competitive advantage, but in many cases a matter of survival. This thesis aims to create a machine learning model able to predict customers interested in an upselling opportunity of changing their payment method after completing a purchase with the Swedish payment solutions company, Klarna Bank. Hence, the overall aim is to personalize the customer experience on the confirmation page. Two gradient boosting methods and one deep learning method were trained, evaluated and compared for this task. A logistic regression model was also trained and used as a baseline model. The results showed that all models performed better than the baseline model, with the gradient boosting methods showing the best performance. All of the models were also able to outperform the current solution with no personalization, with the best model reducing the amount of false positives by 50%. / Tillgång till stora datamängder har tillsammans med framsteg inom maskininlärning resulterat i en explotionsartad ökning i personifierade erbjudanden och möjligheter att förutspå kunders behov. Det har under det senaste decenniet utvecklats till en multimiljardindustri och en förmåga som många av de ledande techbolagen i världen förlitar sig på i sina verksamheter. I många fall är det till och med en förutsättning för att överleva i dagens industrilandskap. Det här examensarbetet ämnar att skapa en maskininlärningsmodell som är kapabel till att förutspå kunders intresse för att "uppgradera" sin betalmetod efter ett slutfört köp med den svenska betallösningsföretaget Klarna Bank. Konceptet att erbjuda en kund att uppgradera en redan vald produkt eller tjänst är på engelska känt som upselling. Det övergripande syftet för detta projekt är därför att skapa en personifierad kundupplevelse på Klarnas bekräftelsesida. Följaktligen implementerades och utvärderades två så kallade gradient boosting - metoder samt en djupinlärningsmetod. Vidare implementerades även en logistisk regressionsmodell som basmodell för att jämföra de övriga modeller med. Resultaten visar hur alla modeller överträffade den tillämpade basmodellen, där gradient boosting-metoderna påvisade bättre resultat än djupinlärningsmetoden. Därtill visar alla modeller en förbättring i jämförelse med dagens lösning på Klarnas bekräftelssesida, utan personifiering, där den bästa modellen förbättrade utfallet med 50%. Personalization Upselling Optimization Machine learning Binary classiﬁcation Gra-dient boosting Deep learning Supervised learning Imbalanced data Personiﬁering Merf¨ors¨aljning Optimering Maskininl¨arning Bin¨ar klassiﬁcering Gra-dient boosting Djupinl¨arning ¨Overvakat l¨arande Obalanserad data. Computational Mathematics Beräkningsmatematik
102	Modeling Melodic Accents in Jazz Solos / Modellering av melodiska accenter i jazzsolon Berrios Salas, Misael January 2023 (has links) This thesis looks at how accurately one can model accents in jazz solos, more specifically the sound level. Further understanding the structure of jazz solos can give a way of pedagogically presenting differences within music styles and even between performers. Some studies have tried to model perceived accents in different music styles. In other words, model how listeners perceive some tones as somehow accentuated and more important than others. Other studies have looked at how the sound level correlates to other attributes of the tone. But to our knowledge, no other studies have been made modeling actual accents within jazz solos, nor have other studies had such a big amount of training data. The training data used is a set of 456 solos from the Weimar Jazz Database. This is a database containing tone data and metadata from monophonic solos performed with multiple instruments. The features used for the training algorithms are features obtained from the software Director Musices created at the Royal Institute of Technology in Sweden; features obtained from the software "melfeature" created at the University of Music Franz Liszt Weimar in Germany; and features built upon tone data or solo metadata from the Weimar Jazz Database. A comparison between these is made. Three learning algorithms are used, Multiple Linear Regression (MLR), Support Vector Regression (SVR), and eXtreme Gradient Boosting (XGBoost). The first two are simpler regression models while the last is an award-winning tree boosting algorithm. The tests resulted in eXtreme Gradient Boosting (XGBoost) having the highest accuracy when combining all the available features minus some features that were removed since they did not improve the accuracy. The accuracy was around 27% with a high standard deviation. This tells that there was quite some difference when predicting the different solos, some had an accuracy of about 67% while others did not predict one tone correctly in the entire solo. But as a general model, the accuracy is too low for actual practical use. Either the methods were not the optimal ones or jazz solos differ too much to find a general pattern. / Detta examensarbete undersöker hur väl man kan modellera accenter i jazz-solos, mer specifikt ljudnivån. En bredare förståelse för strukturen i jazzsolos kan ge ett sätt att pedagogiskt presentera skillnaderna mellan olika musikstilar och även mellan olika artister. Andra studier har försökt modellera uppfattade accenter inom olika musik-stilar. Det vill säga, modellera hur åhörare upplever vissa toner som accentuerade och viktigare än andra. Andra studier har undersökt hur ljudnivån är korrelerad till andra attribut hos tonen. Men såvitt vi vet, så finns det inga andra studier som modellerar faktiska accenter inom jazzsolos, eller som haft samma stora mängd träningsdata. Träningsdatan som använts är ett set av 456 solos tagna från Weimar Jazz Database. Databasen innehåller data på toner och metadata från monofoniska solos genomförda med olika instrument. Särdragen som använts för tränings-algoritmerna är särdrag erhållna från mjukvaran Director Musices skapad på Kungliga Tekniska Högskolan i Sverige; särdrag erhållna från mjukvaran ”melfeature” skapad på University of Music Franz Liszt Weimar i Tyskland; och särdrag skapade utifrån datat i Weimar Jazz Database. En jämförelse mellan dessa har också gjorts. Tre inlärningsalgoritmer har använts, Multiple Linear Regression (MLR), Support Vector Regression (SVR), och eXtreme Gradient Boosting (XGBoost). De första två är enklare regressionsalgoritmer, medan den senare är en prisbelönt trädförstärkningsalgoritm. Testen resulterade i att eXtreme Gradient Boosting (XGBoost) skapade en modell med högst noggrannhet givet alla tillgängliga särdrag som träningsdata minus vissa särdrag som tagits bort då de inte förbättrar noggrannheten. Den erhållna noggrannheten låg på runt 27% med en hög standardavvikelse. Detta pekar på att det finns stora skillnader mellan att förutsäga ljudnivån mellan de olika solin. Vissa solin gav en noggrannhet på runt 67% medan andra erhöll inte en endaste ljudnivå korrekt i hela solot. Men som en generell modell är noggrannheten för låg för att användas i praktiken. Antingen är de valda metoderna inte de bästa, eller så är jazzsolin för olika för att hitta ett generellt mönster som går att förutsäga. Accents Jazz Solo Support Vector Regression (SVR) eXtreme Gradient Boosting (XGBoost) Multiple Linear Regression (MLR) Dynamic Accenter Jazz Solos Support Vector Regression (SVR) eXtreme Gradient Boosting (XGBoost) Multiple Linear Regression (MLR) Dynamisk Computer and Information Sciences Data- och informationsvetenskap
103	Identifying Optimal Throw-in Strategy in Football Using Logistic Regression / Identifiering av Optimal Inkaststrategi i Fotboll med Logistisk Regression Nieto, Stephan January 2023 (has links) Set-pieces such as free-kicks and corners have been thoroughly examined in studies related to football analytics in recent years. However, little focus has been put on the most frequently occurring set-piece: the throw-in. This project aims to investigate how football teams can optimize their throw-in tactics in order to improve the chance of taking a successful throw-in. Two different definitions of what constitutes a successful throw-in are considered, firstly if the ball is kept in possession and secondly if a goal chance is created after the throw-in. The analysis is conducted using logistic regression, as this model comes with high interpretability, making it easier for players and coaches to gain direct insights from the results. A substantial focus is put on the investigation of the logistic regression assumptions, with the greatest emphasis being put on the linearity assumption. The results suggest that long throws directed towards the opposition’s goal are the most effective for creating goal-scoring opportunities from throw-ins taken in the attacking third of the pitch. However, if the throw-in is taken in the middle or defensive regions of the pitch, the results interestingly indicate that throwing the ball backwards leads to increased chance of scoring. When it comes to retaining the ball possession, the results suggest that throwing the ball backwards is an effective strategy regardless of the pitch position. Moreover, the project outlines how feature transformations can be used to improve the fitting of the logistic regression model. However, it turns out that the most significant improvement in accuracy of logistic regression occurs when incorporating additional relevant features into the model. In such case, the logistic regression model achieves a predictive power comparable to more advanced machine learning methods. / Fasta situationer såsom frisparkar och hörnor har varit välstuderade i studier rörande fotbollsanalys de senaste åren. Lite fokus har emellertid lagts på den vanligast förekommande fasta situationen: inkastet. Detta projekt syftar till att undersöka hur fotbollslag kan optimera sin inkasttaktik för att förbättra möjligheterna till att genomföra ett lyckat inkast. Två olika definitioner av vad som utgör ett lyckat inkast beaktas, dels om bollinnehavet behålls och dels om en målchans skapas efter inkastet. Analysen görs med logistisk regression eftersom denna modell har hög tolkningsbarhet, vilket gör det lättare för spelare och tränare att få direkta insikter från resultaten. Stort fokus läggs på undersökning av de logistiska regressionsantagandena, där störst vikt läggs på antagandet gällande linjäritet. Resultaten tyder på att långa inkast riktade mot motståndarnas mål är de mest gynnsamma för att skapa en målchans från inkast tagna i den offensiva tredjedelen av planen. Om inkastet istället tas från de mellersta eller defensiva delarna av planen tyder resultaten intressant nog på att inkast riktade bakåt leder till ökad chans till att göra mål. När det kommer till att behålla bollinnehavet visar resultaten att kast bakåt är en gynnsam strategi, oavsett var på planen inkasten tas ifrån. Vidare visar projektet hur variabeltransformationer kan användas för att förbättra modellanpassningen för logistisk regression. Det visar sig dock att den tydligaste förbättringen fås då fler relevanta variabler läggs till i modellen. I sådant fall, får logistisk regression en prediktiv förmåga som är jämförbar med mer avancerade maskininlärningsmetoder. Set-piece throw-in football analytics optimal strategy logistic regression model assumptions feature importance feature transformations gradient boosting Fasta situationer inkast fotbollsanalys optimal strategi logistisk regression modellantaganden variabelvikt variabeltransformationer gradient boosting Other Mathematics Annan matematik
104	Predicting House Prices on the Countryside using Boosted Decision Trees / Förutseende av huspriser på landsbygden genom boostade beslutsträd Revend, War January 2020 (has links) This thesis intends to evaluate the feasibility of supervised learning models for predicting house prices on the countryside of South Sweden. It is essential for mortgage lenders to have accurate housing valuation algorithms and the current model offered by Booli is not accurate enough when evaluating residence prices on the countryside. Different types of boosted decision trees were implemented to address this issue and their performances were compared to traditional machine learning methods. These different types of supervised learning models were implemented in order to find the best model with regards to relevant evaluation metrics such as root-mean-squared error (RMSE) and mean absolute percentage error (MAPE). The implemented models were ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. All these models were benchmarked against Booli's current housing valuation algorithms which are based on a k-NN model. The results from this thesis indicated that the LightGBM model is the optimal one as it had the best overall performance with respect to the chosen evaluation metrics. When comparing the LightGBM model to the benchmark, the performance was overall better, the LightGBM model had an RMSE score of 0.330 compared to 0.358 for the Booli model, indicating that there is a potential of using boosted decision trees to improve the predictive accuracy of residence prices on the countryside. / Denna uppsats ämnar utvärdera genomförbarheten hos olika övervakade inlärningsmodeller för att förutse huspriser på landsbygden i Södra Sverige. Det är viktigt för bostadslånsgivare att ha noggranna algoritmer när de värderar bostäder, den nuvarande modellen som Booli erbjuder har dålig precision när det gäller värderingar av bostäder på landsbygden. Olika typer av boostade beslutsträd implementerades för att ta itu med denna fråga och deras prestanda jämfördes med traditionella maskininlärningsmetoder. Dessa olika typer av övervakad inlärningsmodeller implementerades för att hitta den bästa modellen med avseende på relevanta prestationsmått som t.ex. root-mean-squared error (RMSE) och mean absolute percentage error (MAPE). De övervakade inlärningsmodellerna var ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. Samtliga algoritmers prestanda jämförs med Boolis nuvarande bostadsvärderingsalgoritm, som är baserade på en k-NN modell. Resultatet från denna uppsats visar att LightGBM modellen är den optimala modellen för att värdera husen på landsbygden eftersom den hade den bästa totala prestandan med avseende på de utvalda utvärderingsmetoderna. LightGBM modellen jämfördes med Booli modellen där prestandan av LightGBM modellen var i överlag bättre, där LightGBM modellen hade ett RMSE värde på 0.330 jämfört med Booli modellen som hade ett RMSE värde på 0.358. Vilket indikerar att det finns en potential att använda boostade beslutsträd för att förbättra noggrannheten i förutsägelserna av huspriser på landsbygden. Machine Learning Predicting House Prices Shrinkage Methods Random Forest Decision Tree AdaBoost Gradient Boosting LightGBM CatBoost XGBoost Maskininlärning Förutseende av Huspriser Krympningsmetoder Random Forest Beslutsträd AdaBoost Gradient Boosting LightGBM CatBoost XGBoost Probability Theory and Statistics Sannolikhetsteori och statistik
105	A Predictive Analysis of Customer Churn / : En Prediktiv Analys av Kundbortfall Eskils, Olivia, Backman, Anna January 2023 (has links) Churn refers to the discontinuation of a contract; consequently, customer churn occurs when existing customers stop being customers. Predicting customer churn is a challenging task in customer retention, but with the advancements made in the field of artificial intelligence and machine learning, the feasibility to predict customer churn has increased. Prior studies have demonstrated that machine learning can be utilized to forecast customer churn. The aim of this thesis was to develop and implement a machine learning model to predict customer churn and identify the customer features that have a significant impact on churn. This Study has been conducted in cooperation with the Swedish insurance company Bliwa, who expressed interest in gaining an increased understanding of why customers choose to leave. Three models, Logistic Regression, Random Forest, and Gradient Boosting, were used and evaluated. Bayesian optimization was used to optimize the models. After obtaining an indication of their predictive performance during evaluation using Cross-Validation, it was concluded that LightGBM provided the best result in terms of PR-AUC, making it the most effective approach for the problem at hand. Subsequently, a SHAP-analysis was carried out to gain insights into which customer features that have an impact on whether or not a customer churn. The outcome of the SHAP-analysis revealed specific customer features that had a significant influence on churn. This knowledge can be utilized to proactively implement measures aimed at reducing the probability of churn. / Att förutsäga kundbortfall är en utmanande uppgift inom kundbehållning, men med de framsteg som gjorts inom artificiell intelligens och maskininlärning har möjligheten att förutsäga kundbortfall ökat. Tidigare studier har visat att maskinlärning kan användas för att prognostisera kundbortfall. Syftet med denna studie var att utveckla och implementera en maskininlärningsmodell för att förutsäga kundbortfall och identifiera kundegenskaper som har en betydande inverkan på varför en kund väljer att lämna eller inte. Denna studie har genomförts i samarbete med det svenska försäkringsbolaget Bliwa, som uttryckte sitt intresse över att få en ökad förståelse för varför kunder väljer att lämna. Tre modeller, Logistisk Regression, Random Forest och Gradient Boosting användes och utvärderades. Bayesiansk optimering användes för att optimera dessa modeller. Efter att ha utvärderat prediktiv noggrannhet i samband med krossvalidering drogs slutsatsen att LightGBM gav det bästa resultatet i termer av PR-AUC och ansågs därför vara den mest effektiva metoden för det aktuella problemet. Därefter genomfördes en SHAP-analys för att ge insikter om vilka kundegenskaper som påverkar varför en kund riskerar, eller inte riskerar att lämna. Resultatet av SHAP-analysen visade att vissa kundegenskaper stack ut och verkade ha en betydande påverkan på kundbortfall. Denna kunskap kan användas för att vidta proaktiva åtgärder för att minska sannolikheten för kundbortfall. Churn prediction CRM optimization applied mathematics machine learning gradient boosting random forest logistic regression insurance industry Kundbortfall CRM optimering tillämpad matematik maskininlärning gradient boosting random forest logistisk regression försäkringsbranschen Probability Theory and Statistics Sannolikhetsteori och statistik
106	Conception et mise en œuvre d'algorithmes de vision temps-réel pour la vidéo surveillance intelligente Ghorayeb, Hicham 12 September 2007 (has links) (PDF) Notre objectif est d'étudier les algorithmes de vision utilisés aux différents niveaux dans une chaîne de traitement vidéo intelligente. On a prototypé une chaîne de traitement générique dédiée à l'analyse du contenu du flux vidéo. En se basant sur cette chaîne de traitement, on a développé une application de détection et de suivi de piétons. Cette application est une partie intégrante du projet PUVAME. Cette chaîne de traitement générique est composée de plusieurs étapes: détection, classification et suivi d'objets. D'autres étapes de plus haut niveau sont envisagées comme la reconnaissance d'actions, l'identification, la description sémantique ainsi que la fusion des données de plusieurs caméras. On s'est intéressé aux deux premières étapes. On a exploré des algorithmes de segmentation du fond dans un flux vidéo avec caméra fixe. On a implémenté et comparé des algorithmes basés sur la modélisation adaptative du fond. On a aussi exploré la détection visuelle d'objets basée sur l'apprentissage automatique en utilisant la technique du boosting. Cependant, On a développé une librairie intitulée LibAdaBoost qui servira comme un environnement de prototypage d'algorithmes d'apprentissage automatique. On a prototypé la technique du boosting au sein de cette librairie. On a distribué LibAdaBoost sous la licence LGPL. Cette librairie est unique avec les fonctionnalités qu'elle offre. On a exploré l'utilisation des cartes graphiques pour l'accélération des algorithmes de vision. On a effectué le portage du détecteur visuel d'objets basé sur un classifieur généré par le boosting pour qu'il s'exécute sur le processeur graphique. On était les premiers à effectuer ce portage. On a trouvé que l'architecture du processeur graphique est la mieux adaptée pour ce genre d'algorithmes. La chaîne de traitement a été implémentée et intégrée à l'environnement RTMaps. On a évalué ces algorithmes sur des scénarios bien définis. Ces scénarios ont été définis dans le cadre de PUVAME. [MATH] Mathematics Vidéo surveillance Boosting Reconnaissance automatique des formes Système de transport intelligent Apprentissage automatique Détection objet en mouvement méthode Monte Carlo
107	Méthodes d'apprentissage pour l'estimation de la pose de la tête dans des images monoculaires Bailly, Kévin 09 July 2010 (has links) (PDF) Cette thèse s'inscrit dans le cadre de PILE, un projet médical d'analyse du regard, des gestes, et des productions vocales d'enfants en bas âge. Dans ce contexte, nous avons conçu et développé des méthodes de détermination de l'orientation de la tête, pierre angulaire des systèmes d'estimation de la direction du regard. D'un point de vue méthodologique, nous avons proposé BISAR (Boosted Input Selection Algorithm for Regression), une méthode de sélection de caractéristiques adaptée aux problèmes de régression. Elle consiste à sélectionner itérativement les entrées d'un réseau de neurones incrémental. Chaque entrée est associée à un descripteur sélectionné à l'aide d'un critère original qui mesure la dépendance fonctionnelle entre un descripteur et les valeurs à prédire. La complémentarité des descripteurs est assurée par un processus de boosting qui modifie, à chaque itération, la distribution des poids associés aux exemples d'apprentissage. Cet algorithme a été validé expérimentalement au travers de deux méthodes d'estimation de la pose de la tête. La première approche apprend directement la relation entre l'apparence d'un visage et sa pose. La seconde aligne un modèle de visage dans une image, puis estime géométriquement l'orientation de ce modèle. Le processus d'alignement repose sur une fonction de coût qui évalue la qualité de l'alignement. Cette fonction est apprise par BISAR à partir d'exemples de modèles plus ou moins bien alignés. Les évaluations de ces méthodes ont donné des résultats équivalents ou supérieurs aux méthodes de l'état de l'art sur différentes bases présentant de fortes variations de pose, d'identité, d'illumination et de conditions de prise de vues. pose de la tête modèle déformable alignement sélection de descripteurs régression réseau de neurones incrémental apprentissage automatique boosting
108	Classifica??o com algoritmo AdaBoost.M1 : o mito do limiar de erro de treinamento Le?es Neto, Ant?nio do Nascimento 20 November 2017 (has links) Submitted by PPG Ci?ncia da Computa??o (ppgcc@pucrs.br) on 2018-02-16T13:18:07Z No. of bitstreams: 1 Ant?nio_do_Nascimento_Le?es_ Neto_Dis.pdf: 1049012 bytes, checksum: 293046d3be865048cd37706b38494e1a (MD5) / Approved for entry into archive by Caroline Xavier (caroline.xavier@pucrs.br) on 2018-02-22T16:34:51Z (GMT) No. of bitstreams: 1 Ant?nio_do_Nascimento_Le?es_ Neto_Dis.pdf: 1049012 bytes, checksum: 293046d3be865048cd37706b38494e1a (MD5) / Made available in DSpace on 2018-02-22T16:40:19Z (GMT). No. of bitstreams: 1 Ant?nio_do_Nascimento_Le?es_ Neto_Dis.pdf: 1049012 bytes, checksum: 293046d3be865048cd37706b38494e1a (MD5) Previous issue date: 2017-11-20 / The accelerated growth of data repositories, in the different areas of activity, opens space for research in the area of data mining, in particular, with the methods of classification and combination of classifiers. The Boosting method is one of them, which combines the results of several classifiers in order to obtain better results. The main purpose of this dissertation is the experimentation of alternatives to increase the effectiveness and performance of the algorithm AdaBoost.M1, which is the implementation often employed by the Boosting method. An empirical study was perfered taking into account stochastic aspects trying to shed some light on an obscure internal parameter, in which algorithm creators and other researchers assumed that the training error threshold should be correlated with the number of classes in the target data set and logically, most data sets should use a value of 0.5. In this paper, we present an empirical evidence that this is not a fact, but probably a myth originated by the mistaken application of the theoretical assumption of the joint effect. To achieve this goal, adaptations were proposed for the algorithm, focusing on finding a better suggestion to define this threshold in a general case. / O crescimento acelerado dos reposit?rios de dados, nas diversas ?reas de atua??o, abre espa?o para pesquisas na ?rea da minera??o de dados, em espec?fico, com os m?todos de classifica??o e de combina??o de classificadores. O Boosting ? um desses m?todos, e combina os resultados de diversos classificadores com intuito de obter melhores resultados. O prop?sito central desta disserta??o ? responder a quest?o de pesquisa com a experimenta??o de alternativas para aumentar a efic?cia e o desempenho do algoritmo AdaBoost.M1 que ? a implementa??o frequentemente empregada pelo Boosting. Foi feito um estudo emp?rico levando em considera??o aspectos estoc?sticos tentando lan?ar alguma luz sobre um par?metro interno obscuro em que criadores do algoritmo e outros pesquisadores assumiram que o limiar de erro de treinamento deve ser correlacionado com o n?mero de classes no conjunto de dados de destino e, logicamente, a maioria dos conjuntos de dados deve usar um valor de 0.5. Neste trabalho, apresentamos evid?ncias emp?ricas de que isso n?o ? um fato, mas provavelmente um mito originado pela aplica??o da primeira defini??o do algoritmo. Para alcan?ar esse objetivo, foram propostas adapta??es para o algoritmo, focando em encontrar uma sugest?o melhor para definir esse limiar em um caso geral. Minera??o de dados Classifica??o Combina??o de classificadores Classification Boosting AdaBoost.M1 Data Mining Ensemble Methods
109	Nouvelles contributions du boosting en apprentissage automatique Suchier, Henri-Maxime 21 June 2006 (has links) (PDF) L'apprentissage automatique vise la production d'une hypothèse modélisant un concept à partir d'exemples, dans le but notamment de prédire si de nouvelles observations relèvent ou non de ce concept. Parmi les algorithmes d'apprentissage, les méthodes ensemblistes combinent des hypothèses de base (dites ``faibles'') en une hypothèse globale plus performante.<br /><br />Le boosting, et son algorithme AdaBoost, est une méthode ensembliste très étudiée depuis plusieurs années : ses performances expérimentales remarquables reposent sur des fondements théoriques rigoureux. Il construit de manière adaptative et itérative des hypothèses de base en focalisant l'apprentissage, à chaque nouvelle itération, sur les exemples qui ont été difficiles à apprendre lors des itérations précédentes. Cependant, AdaBoost est relativement inadapté aux données du monde réel. Dans cette thèse, nous nous concentrons en particulier sur les données bruitées, et sur les données hétérogènes.<br /><br />Dans le cas des données bruitées, non seulement la méthode peut devenir très lente, mais surtout, AdaBoost apprend par coeur les données, et le pouvoir prédictif des hypothèses globales générées, s'en trouve extrêmement dégradé. Nous nous sommes donc intéressés à une adaptation du boosting pour traiter les données bruitées. Notre solution exploite l'information provenant d'un oracle de confiance permettant d'annihiler les effets dramatiques du bruit. Nous montrons que notre nouvel algorithme conserve les propriétés théoriques du boosting standard. Nous mettons en pratique cette nouvelle méthode, d'une part sur des données numériques, et d'autre part, de manière plus originale, sur des données textuelles.<br /><br />Dans le cas des données hétérogènes, aucune adaptation du boosting n'a été proposée jusqu'à présent. Pourtant, ces données, caractérisées par des attributs multiples mais de natures différentes (comme des images, du son, du texte, etc), sont extrêmement fréquentes sur le web, par exemple. Nous avons donc développé un nouvel algorithme de boosting permettant de les utiliser. Plutôt que de combiner des hypothèses boostées indépendamment, nous construisons un nouveau schéma de boosting permettant de faire collaborer durant l'apprentissage des algorithmes spécialisés sur chaque type d'attribut. Nous prouvons que les décroissances exponentielles des erreurs sont toujours assurées par ce nouveau modèle, aussi bien d'un point de vue théorique qu'expérimental. [INFO] Computer Science Apprentissage automatique méthodes ensemblistes boosting données bruitées données hétérogènes
110	SVM-Based Negative Data Mining to Binary Classification Jiang, Fuhua 03 August 2006 (has links) The properties of training data set such as size, distribution and the number of attributes significantly contribute to the generalization error of a learning machine. A not well-distributed data set is prone to lead to a partial overfitting model. Two approaches proposed in this dissertation for the binary classification enhance useful data information by mining negative data. First, an error driven compensating hypothesis approach is based on Support Vector Machines (SVMs) with (1+k)-iteration learning, where the base learning hypothesis is iteratively compensated k times. This approach produces a new hypothesis on the new data set in which each label is a transformation of the label from the negative data set, further producing the positive and negative child data subsets in subsequent iterations. This procedure refines the base hypothesis by the k child hypotheses created in k iterations. A prediction method is also proposed to trace the relationship between negative subsets and testing data set by a vector similarity technique. Second, a statistical negative example learning approach based on theoretical analysis improves the performance of the base learning algorithm learner by creating one or two additional hypotheses audit and booster to mine the negative examples output from the learner. The learner employs a regular Support Vector Machine to classify main examples and recognize which examples are negative. The audit works on the negative training data created by learner to predict whether an instance is negative. However, the boosting learning booster is applied when audit does not have enough accuracy to judge learner correctly. Booster works on training data subsets with which learner and audit do not agree. The classifier for testing is the combination of learner, audit and booster. The classifier for testing a specific instance returns the learner's result if audit acknowledges learner's result or learner agrees with audit's judgment, otherwise returns the booster's result. The error of the classifier is decreased to O(e^2) comparing to the error O(e) of a base learning algorithm. Data partition Data classification Vector similarity Multiple passes learning Machine learning Bagging Boosting Support vector machines Data preparation Computer Sciences

Search results