Global ETD Search

11	Strojové učení v klasifikaci obrazu / Machine Learning in Image Classification Král, Jiří January 2011 (has links) This project deals vith analysis and testing of algorithms and statistical models, that could potentionaly improve resuts of FIT BUT in ImageNet Large Scale Visual Recognition Challenge and TRECVID. Multinomial model was tested. Phonotactic Intersession Variation Compensation (PIVCO) model was used for reducing random e ffects in image representation and for dimensionality reduction. PIVCO - dimensionality reduction achieved the best mean average precision while reducing to one-twenyth of original dimension. KPCA model was tested to approximate Kernel SVM. All statistical models were tested on Pascal VOC 2007 dataset.
12	Generalized Modeling and Estimation of Rating Classes and Default Probabilities Considering Dependencies in Cross and Longitudinal Section Tillich, Daniel 30 March 2017 (has links) Our sample (Xit; Yit) consists of pairs of variables. The real variable Xit measures the creditworthiness of individual i in period t. The Bernoulli variable Yit is the default indicator of individual i in period t. The objective is to estimate a credit rating system, i.e. to particularly divide the range of the creditworthiness into several rating classes, each with a homogeneous default risk. The field of change point analysis provides a way to estimate the breakpoints between the rating classes. As yet, the literature only considers models without dependencies or with dependence only in cross section. This contribution proposes multi-period models including dependencies in cross section as well as in longitudinal section. Furthermore, estimators for the model parameters are suggested. The estimators are applied to a data set of a German credit bureau. info:eu-repo/classification/ddc/330 ddc:330
13	An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing / En undersökning av kodningstekniker för diskreta variabler inom maskininlärning: binär mot one-hot och feature hashing Seger, Cedric January 2018 (has links) Machine learning methods can be used for solving important binary classification tasks in domains such as display advertising and recommender systems. In many of these domains categorical features are common and often of high cardinality. Using one-hot encoding in such circumstances lead to very high dimensional vector representations, causing memory and computability concerns for machine learning models. This thesis investigated the viability of a binary encoding scheme in which categorical values were mapped to integers that were then encoded in a binary format. This binary scheme allowed for representing categorical features using log2(d)-dimensional vectors, where d is the dimension associated with a one-hot encoding. To evaluate the performance of the binary encoding, it was compared against one-hot and feature hashed representations with the use of linear logistic regression and neural networks based models. These models were trained and evaluated using data from two publicly available datasets: Criteo and Census. The results showed that a one-hot encoding with a linear logistic regression model gave the best performance according to the PR-AUC metric. This was, however, at the expense of using 118 and 65,953 dimensional vector representations for Census and Criteo respectively. A binary encoding led to a lower performance but used only 35 and 316 dimensions respectively. For Criteo, binary encoding suffered significantly in performance and feature hashing was perceived as a more viable alternative. It was also found that employing a neural network helped mitigate any loss in performance associated with using binary and feature hashed representations. / Maskininlärningsmetoder kan användas för att lösa viktiga binära klassificeringsuppgifter i domäner som displayannonsering och rekommendationssystem. I många av dessa domäner är kategoriska variabler vanliga och ofta av hög kardinalitet. Användning av one-hot-kodning under sådana omständigheter leder till väldigt högdimensionella vektorrepresentationer. Detta orsakar minnesoch beräkningsproblem för maskininlärningsmodeller. Denna uppsats undersökte användbarheten för ett binärt kodningsschema där kategoriska värden var avbildade på heltalvärden som sedan kodades i ett binärt format. Detta binära system tillät att representera kategoriska värden med hjälp av log2(d) -dimensionella vektorer, där d är dimensionen förknippad med en one-hot kodning. För att utvärdera prestandan för den binära kodningen jämfördes den mot one-hot och en hashbaserad kodning. En linjär logistikregression och ett neuralt nätverk tränades med hjälp av data från två offentligt tillgängliga dataset: Criteo och Census, och den slutliga prestandan jämfördes. Resultaten visade att en one-hot kodning med en linjär logistisk regressionsmodell gav den bästa prestandan enligt PR-AUC måttet. Denna metod använde dock 118 och 65,953 dimensionella vektorrepresentationer för Census respektive Criteo. En binär kodning ledde till en lägre prestanda generellt, men använde endast 35 respektive 316 dimensioner. Den binära kodningen presterade väsentligt sämre specifikt för Criteo datan, istället var hashbaserade kodningen en mer attraktiv lösning. Försämringen i prestationen associerad med binär och hashbaserad kodning kunde mildras av att använda ett neuralt nätverk. Computer and Information Sciences Data- och informationsvetenskap
14	Estimation in discontinuous Bernoulli mixture models applicable in credit rating systems with dependent data Tillich, Daniel, Lehmann, Christoph 30 March 2017 (has links) (PDF) Objective: We consider the following problem from credit risk modeling: Our sample (Xi; Yi), 1 < i < n, consists of pairs of variables. The first variable Xi measures the creditworthiness of individual i. The second variable Yi is the default indicator of individual i. It has two states: Yi = 1 indicates a default, Yi = 0 a non-default. A default occurs, if individual i cannot meet its contractual credit obligations, i. e. it cannot pay back its outstandings regularly. In afirst step, our objective is to estimate the threshold between good and bad creditworthiness in the sense of dividing the range of Xi into two rating classes: One class with good creditworthiness and a low probability of default and another class with bad creditworthiness and a high probability of default. Methods: Given observations of individual creditworthiness Xi and defaults Yi, the field of change point analysis provides a natural way to estimate the breakpoint between the rating classes. In order to account for dependency between the observations, the literature proposes a combination of three model classes: These are a breakpoint model, a linear one-factor model for the creditworthiness Xi, and a Bernoulli mixture model for the defaults Yi. We generalize the dependency structure further and use a generalized link between systematic factor and idiosyncratic factor of creditworthiness. So the systematic factor cannot only change the location, but also the form of the distribution of creditworthiness. Results: For the case of two rating classes, we propose several estimators for the breakpoint and for the default probabilities within the rating classes. We prove the strong consistency of these estimators in the given non-i.i.d. framework. The theoretical results are illustrated by a simulation study. Finally, we give an overview of research opportunities. Regression mit Sprung Änderungspunkt Splitpunkt Kreditrisiko Ratingklassifizierung Ausfallwahrscheinlichkeit Abhängigkeit starke Konsistenz regression with jump change point split point credit risk rating classi - cation default probability dependence strong consistency ddc:330 rvk:QH 400
15	Klasifikace testovacích manévrů z letových dat / Classification of Testing Maneuvers from Flight Data Funiak, Martin January 2015 (has links) Zapisovač letových údajů je zařízení určené pro zaznamenávání letových dat z různých senzorů v letadlech. Analýza letových údajů hraje důležitou roli ve vývoji a testování avioniky. Testování a hodnocení charakteristik letadla se často provádí pomocí testovacích manévrů. Naměřená data z jednoho letu jsou uložena v jednom letovém záznamu, který může obsahovat několik testovacích manévrů. Cílem této práce je identi kovat základní testovací manévry s pomocí naměřených letových dat. Teoretická část popisuje letové manévry a formát měřených letových dat. Analytická část popisuje výzkum v oblasti klasi kace založené na statistice a teorii pravděpodobnosti potřebnou pro pochopení složitých Gaussovských směšovacích modelů. Práce uvádí implementaci, kde jsou Gaussovy směšovací modely použité pro klasifi kaci testovacích manévrů. Navržené řešení bylo testováno pro data získána z letového simulátoru a ze skutečného letadla. Ukázalo se, že Gaussovy směšovací modely poskytují vhodné řešení pro tento úkol. Další možný vývoj práce je popsán v závěrečné kapitole.
16	Estimation in discontinuous Bernoulli mixture models applicable in credit rating systems with dependent data Tillich, Daniel, Lehmann, Christoph 30 March 2017 (has links) Objective: We consider the following problem from credit risk modeling: Our sample (Xi; Yi), 1 < i < n, consists of pairs of variables. The first variable Xi measures the creditworthiness of individual i. The second variable Yi is the default indicator of individual i. It has two states: Yi = 1 indicates a default, Yi = 0 a non-default. A default occurs, if individual i cannot meet its contractual credit obligations, i. e. it cannot pay back its outstandings regularly. In afirst step, our objective is to estimate the threshold between good and bad creditworthiness in the sense of dividing the range of Xi into two rating classes: One class with good creditworthiness and a low probability of default and another class with bad creditworthiness and a high probability of default. Methods: Given observations of individual creditworthiness Xi and defaults Yi, the field of change point analysis provides a natural way to estimate the breakpoint between the rating classes. In order to account for dependency between the observations, the literature proposes a combination of three model classes: These are a breakpoint model, a linear one-factor model for the creditworthiness Xi, and a Bernoulli mixture model for the defaults Yi. We generalize the dependency structure further and use a generalized link between systematic factor and idiosyncratic factor of creditworthiness. So the systematic factor cannot only change the location, but also the form of the distribution of creditworthiness. Results: For the case of two rating classes, we propose several estimators for the breakpoint and for the default probabilities within the rating classes. We prove the strong consistency of these estimators in the given non-i.i.d. framework. The theoretical results are illustrated by a simulation study. Finally, we give an overview of research opportunities. info:eu-repo/classification/ddc/330 ddc:330
17	Fouille de données d'usage du Web : Contributions au prétraitement de logs Web Intersites et à l'extraction des motifs séquentiels avec un faible support Tanasa, Doru 03 June 2005 (has links) (PDF) Les quinze dernières années ont été marquées par une croissance exponentielle du domaine du Web tant dans le nombre de sites Web disponibles que dans le nombre d'utilisateurs de ces sites. Cette croissance a généré de très grandes masses de données relatives aux traces d'usage duWeb par les internautes, celles-ci enregistrées dans des fichiers logs Web. De plus, les propriétaires de ces sites ont exprimé le besoin de mieux comprendre leurs visiteurs afin de mieux répondre à leurs attentes. Le Web Usage Mining (WUM), domaine de recherche assez récent, correspond justement au processus d'extraction des connaissances à partir des données (ECD) appliqué aux données d'usage sur le Web. Il comporte trois étapes principales : le prétraitement des données, la découverte des schémas et l'analyse (ou l'interprétation) des résultats. Un processus WUM extrait des patrons de comportement à partir des données d'usage et, éventuellement, à partir d'informations sur le site (structure et contenu) et sur les utilisateurs du site (profils). La quantité des données d'usage à analyser ainsi que leur faible qualité (en particulier l'absence de structuration) sont les principaux problèmes en WUM. Les algorithmes classiques de fouille de données appliqués sur ces données donnent généralement des résultats décevants en termes de pratiques des internautes (par exemple des patrons séquentiels évidents, dénués d'intérêt). Dans cette thèse, nous apportons deux contributions importantes pour un processus WUM, implémentées dans notre bo^³te à outils AxisLogMiner. Nous proposons une méthodologie générale de prétraitement des logs Web et une méthodologie générale divisive avec trois approches (ainsi que des méthodes concrètes associées) pour la découverte des motifs séquentiels ayant un faible support. Notre première contribution concerne le prétraitement des données d'usage Web, domaine encore très peu abordé dans la littérature. L'originalité de la méthodologie de prétraitement proposée consiste dans le fait qu'elle prend en compte l'aspect multi-sites du WUM, indispensable pour appréhender les pratiques des internautes qui naviguent de fa»con transparente, par exemple, sur plusieurs sites Web d'une même organisation. Outre l'intégration des principaux travaux existants sur ce thème, nous proposons dans notre méthodologie quatre étapes distinctes : la fusion des fichiers logs, le nettoyage, la structuration et l'agrégation des données. En particulier, nous proposons plusieurs heuristiques pour le nettoyage des robots Web, des variables agrégées décrivant les sessions et les visites, ainsi que l'enregistrement de ces données dans un modèle relationnel. Plusieurs expérimentations ont été réalisées, montrant que notre méthodologie permet une forte réduction (jusqu'à 10 fois) du nombre des requêtes initiales et offre des logs structurés plus riches pour l'étape suivante de fouille de données. Notre deuxième contribution vise la découverte à partir d'un fichier log prétraité de grande taille, des comportements minoritaires correspondant à des motifs séquentiels de très faible support. Pour cela, nous proposons une méthodologie générale visant à diviser le fichier log prétraité en sous-logs, se déclinant selon trois approches d'extraction de motifs séquentiels au support faible (Séquentielle, Itérative et Hiérarchique). Celles-ci ont été implémentées dans des méthodes concrètes hybrides mettant en jeu des algorithmes de classification et d'extraction de motifs séquentiels. Plusieurs expérimentations, réalisées sur des logs issus de sites académiques, nous ont permis de découvrir des motifs séquentiels intéressants ayant un support très faible, dont la découverte par un algorithme classique de type Apriori était impossible. Enfin, nous proposons une boite à outils appelée AxisLogMiner, qui supporte notre méthodologie de prétraitement et, actuellement, deux méthodes concrètes hybrides pour la découverte des motifs séquentiels en WUM. Cette boite à outils a donné lieu à de nombreux prétraitements de fichiers logs et aussi à des expérimentations avec nos méthodes implémentées. Web usage mining (WUM) journaux d'accµes Web méthodologie WUM prétraitement WUM WUM multi-sites fouille de données Web fouille de données extraction des motifs séquentiels support faible classi¯cation non-supervisée méthodologie divisive boîte à outils WUM Apriori-GST AxisLogMiner
18	Sentiment-Driven Topic Analysis Of Song Lyrics Sharma, Govind 08 1900 (has links) (PDF) Sentiment Analysis is an area of Computer Science that deals with the impact a document makes on a user. The very field is further sub-divided into Opinion Mining and Emotion Analysis, the latter of which is the basis for the present work. Work on songs is aimed at building affective interactive applications such as music recommendation engines. Using song lyrics, we are interested in both supervised and unsupervised analyses, each of which has its own pros and cons. For an unsupervised analysis (clustering), we use a standard probabilistic topic model called Latent Dirichlet Allocation (LDA). It mines topics from songs, which are nothing but probability distributions over the vocabulary of words. Some of the topics seem sentiment-based, motivating us to continue with this approach. We evaluate our clusters using a gold dataset collected from an apt website and get positive results. This approach would be useful in the absence of a supervisor dataset. In another part of our work, we argue the inescapable existence of supervision in terms of having to manually analyse the topics returned. Further, we have also used explicit supervision in terms of a training dataset for a classifier to learn sentiment specific classes. This analysis helps reduce dimensionality and improve classification accuracy. We get excellent dimensionality reduction using Support Vector Machines (SVM) for feature selection. For re-classification, we use the Naive Bayes Classifier (NBC) and SVM, both of which perform well. We also use Non-negative Matrix Factorization (NMF) for classification, but observe that the results coincide with those of NBC, with no exceptions. This drives us towards establishing a theoretical equivalence between the two. Song Lyrics Non-negative Matrix Factorization (NMF) Music Information Retrival Music Recommendation Engine Support Vector Machine (SVM) Naive Bayes Classifier (NBC) Sentiment Analysis Emotion Analysis Latent Dirichlet Allocation (LDA) Sentiment Clustering Sentiment Classification k-Nearest Neighbour Classi er (k-NNC) Computer Science
19	Bambine e ragazzi bilingui nelle classi multietniche di Torino / Il sistema scolastico a confronto con opportunità, complessità e sfide del plurilinguismo Ritucci, Raffaella 24 October 2018 (has links) Das Schulregister des Kultusministeriums MIUR verzeichnet, dass mehr als jede/r zehnte aller Schüler/innen in Italien keine italienische Staatsbürgerschaft hat, obwohl sie mehrheitlich dort geboren wurden. Zahlreiche Erhebungen weisen für sie im Vergleich zu den italienischen Mitschülern/innen geringere Italienischkenntnisse und weniger schulischen Erfolg auf. Innerhalb dieser explorativen Feldforschung haben Einzelinterviews mit 121 Schülern/innen (5.-8. Klasse) in Turiner Schulen und mit 26 Eltern, sowie die Auswertung von 141 an 27 Italienisch- und Herkunftsprachlehrer/innen verteilten Fragebögen ergeben, dass viele Schüler/innen "zweisprachige Natives" sind, da sie mit Italienisch und einer anderen Sprache aufwachsen. Dieser Polyglottismus, den die Interviewten sehr positiv bewerteten, findet jedoch in der Schulpraxis keine Entsprechung: Gezielte Förderung im Italienischen und der Unterricht der Familiensprache sind meist Wunschdenken. In der Kohorte haben die Schüler/innen mit den besten Italienischkenntnissen einen italophonen Elternteil bzw. kamen im Vorschulalter nach Italien und besuchten dort den Kindergarten. Dagegen sind, wie auch bei den INVALSI-Tests, die in Italien geborenen und die dann die Krippe besuchten, leicht benachteiligt. Was die Familiensprache angeht, verbessert ihr Erlernen die Kompetenzen darin, ohne dem Italienischen zu schaden: Im Gegenteil. Diese Ergebnisse bestätigen die wichtige Rolle der "anderen" Sprache für einen gelungen Spracherwerb. Das MIUR sollte also sein Schulregister mit Sprachdaten ergänzen, um die Curricula im Sinn der EU-Vorgaben umzuschreiben und den sprachlich heterogenen Klassen gezielte Ressourcen und definierte Vorgehensweisen zur Verfügung zu stellen. Mit geringeren Mitteln, im Vergleich zu den jetzigen Kosten für Herunterstufung, Klassenwiederholung und Schulabbruch würde man Schulerfolg, Chancengerechtigkeit und Mehrsprachigkeit fördern, mit positiven Folgen für den Einzelnen sowie für die Volkswirtschaft. / L'anagrafe studenti del MIUR registra come oggi in Italia più di uno studente su dieci non è cittadino italiano, pur essendo la maggioranza di loro nata in questo paese. Numerose indagini statistiche mostrano come gli allievi "stranieri" presentino, rispetto a quelli italiani, ridotte competenze in italiano e minore successo scolastico. Questa ricerca esplorativa svolta in alcune scuole di Torino (V elementare-III media) ha analizzato dati ottenuti tramite interviste semi-strutturate a 121 studenti e 26 genitori e 141 questionari compilati da 27 insegnanti di italiano e di lingua di famiglia. Da essa è emerso che molti studenti sono "nativi bilingui", poiché crescono usando l'italiano e un'altra lingua. Questo poliglottismo, valutato dagli intervistati assai positivamente, non si rispecchia però nella prassi scolastica: un supporto mirato in italiano e l'insegnamento della lingua di famiglia sono di regola una chimera. All'interno del campione le più ampie competenze in italiano si trovano fra chi ha un genitore italofono e chi è arrivato in Italia in età prescolare frequentandovi la scuola materna; come constatato anche nei test INVALSI, chi è nato in Italia e vi ha frequentato l'asilo nido è leggermente svantaggiato. Rispetto alla lingua di famiglia risulta che il suo studio porta a migliori competenze in essa, senza nuocere all'italiano: anzi. Emerge quindi il ruolo significativo della lingua "altra" per un'educazione linguistica efficace. L'invito al MIUR è quindi di integrare la propria anagrafe con dati linguistici, così da ridefinire i propri curricula secondo le Linee Guida Comunitarie, individuando procedure e risorse specifiche per le classi multilingui. Con un investimento ridotto, paragonato con il costo attuale dato da retrocessioni, ripetenze e abbandono scolastico, si riuscirebbe a sostenere il successo scolastico, le pari opportunità e il plurilinguismo, con conseguenze positive per i singoli e per l'economia nazionale. / The Italian Ministry of Education (MIUR) student register records that today in Italy more than one out of ten students is not an Italian citizen, although the majority of them were born there. Several statistical surveys indicate that "foreign" students, when compared to native students, show a poorer performance in Italian and in academic achievement. This exploratory fieldwork carried out in schools in Turin (5th to 8th grade) analyzed data obtained through semi-structured interviews with 121 students and 26 parents as well as 141 questionnaires filled in by 27 teachers of Italian and family language. It showed that many students are "bilingual natives", as they grow up acquiring both Italian and another language; however, despite the fact that the interviewees rate polyglottism positively, schools don't usually offer targeted support in either language. Within the cohort the broadest range of competences in Italian are found first among those with an Italian-speaking parent, then among those who arrived in Italy at pre-school age attending kindergarten there; this latter group shows higher competences than those born in Italy attending nursery there, as also in the INVALSI tests. As far as family language is concerned, data illustrate that its teaching increases its competences without affecting those in Italian: quite the opposite in fact. These results confirm the remarkable role played by the "other" language in successful language education. MIUR is therefore called upon to include also linguistic data in its student register, so as to redefine its curricula according to EU Guidelines, and to identify specific procedures and resources for multilingual classes. This new policy would reduce the current cost of placing students in a lower grade, grade retention and drop-outs, and would promote school success, equal opportunities and multilingualism, with positive consequences both for the individuals and for the national economy. istruzione svantaggio bilingue bilinguismo didattica ricerca sul campo lingua di famiglia promozione lingua di origine integrazione INVALSI Italia bocciatura ripetenza classi costi Ministero dell'Istruzione multilingue multietnico sistema scolastico migranti migrazione MIUR multiculturale allievi studenti multilinguismo ingua madre poliglotta abbandono curricula scuola alunni scolarizzazione successo rendimento prassi insegnamento diversità linguistico varietà acquisizione uso competenze cittadinanza Torino italiano "nativi bilingui" seconda lingua prima lingua anagrafe retrocessione opportunità economia nazionale stranieri supporto L1 L2 parlante nativo linee guida plurilingue plurilinguismo sociolinguistica linguistica di contatto contatto linguistico Bildung Bildungsbenachteiligung Bildungserfolg bilingual Bilinguismus Didaktik Studie Feldforschung Familiensprache Förderung Herkunftssprache Integration INVALSI Italien Klassenwiederholung Klassen Kosten Kultusministerium mehrsprachig Mehrsprachigkeit Migrantenkinder Migration MIUR multiethnisch multikulturell Multilinguismus Muttersprache Erstsprache polyglott Schulabbruch Schulcurricula Schule Schüler Schulsprache Schulerfolg Schulpraxis Schulunterricht Sitzenbleiben Sprachenvielfalt Spracherwerb Sprachgebrauch Sprachkenntnisse Staatsbürgerschaft Turin Italienienisch Unterricht Schulsystem zweisprachig "zweisprachige Natives" Zweisprachigkeit Zweitsprache Zweitspracherwerb Schulregister Herunterstufung Chancengerechtigkeit Volkswirtschaft Ausländer Sprecher Schulleistung Vorgabe Richtlinie Soziolinguistik Kontaktlinguistik Sprachenkontakt education educational success disadvantage bilingual bilingualism didactics study field research family language support language of origin heritage language integration INVALSI Italy grade retention classes costs Ministry of Education MIUR multilingual multilingualism migrant migration multiethnic multicultural mother tongue polyglot dropout school curricula school pupils students children school language success performance practice teaching diversity acquisition use skills competences citizenship Turin school system "bilingual natives" second language school register downgrading equal opportunities national economy foreign student register native academic achievement first language speaker Italian guidelines sociolinguistics contact linguistics language contact 370 Bildung und Erziehung 375 Curricula 379 Bildungspolitik 400 Sprache 410 Linguistik 372 Primarbildung 373 Sekundarbildung 408 Personengruppen ER 930 ddc:370 ddc:375 ddc:379 ddc:400 ddc:410 ddc:450 ddc:325 ddc:363 ddc:371 ddc:372 ddc:373 ddc:408 ddc:409 ddc:418 ddc:457 ddc:458

Search results