Global ETD Search

261	Aplicação de métodos não supervisionados: estudo empírico com os dados de segurança pública do estado do Rio de Janeiro Nascimento, Otto Tavares 20 December 2016 (has links) Submitted by Otto Tavares Nascimento (otavares93@gmail.com) on 2017-05-12T09:14:03Z No. of bitstreams: 1 Dissertação_Otto_Tavares_Nascimento.pdf: 9875781 bytes, checksum: fe5bb21c41c1cb3b1dc79d84841fe938 (MD5) / Approved for entry into archive by Leiliane Silva (leiliane.silva@fgv.br) on 2017-05-12T20:37:41Z (GMT) No. of bitstreams: 1 Dissertação_Otto_Tavares_Nascimento.pdf: 9875781 bytes, checksum: fe5bb21c41c1cb3b1dc79d84841fe938 (MD5) / Made available in DSpace on 2017-05-30T14:11:36Z (GMT). No. of bitstreams: 1 Dissertação_Otto_Tavares_Nascimento.pdf: 9875781 bytes, checksum: fe5bb21c41c1cb3b1dc79d84841fe938 (MD5) Previous issue date: 2016-12-20 / Este trabalho é uma abordagem multidisciplinar, o qual aplica-se a metodologia de matemática aplicada, em específico, aprendizagem não supervisionada, a dados de segurança pública. Busca-se identificar a semelhança entre batalhões da polícia, utilizando métodos de clusterização de modo a otimizar numericamente o critério de avaliação de McClain. Além da otimização, aborda-se intuitivamente o modelo de clusterização hierárquica, para posteriormente extrair ordem no padrão criminal dos clusters e, finalmente, aplicar o modelo de classificação OLogit, utilizando variáveis características desses clusters. Encontramos evidência de clusterização dos dados e significância na utilização de dados socioeconômicos e de policiamento na ordenação dos clusters. Resumindo, quanto maior o efetivo policial por habitante e o IDH de renda mínima em determinado batalhão maior a probabilidade de se estar em um cluster de menor incidência criminal. / This multidisciplinary work use an applied math methodology, especially unsupervised learning, in public security data. We seek to find the similiarity beetwen policies battalions, using clustering methods, while otimizing numerically the McCLain index. Besides that, we extract learning from data, using OLogit models in cluster's order with feature variables. We find data clustering evidence and extract significance of socioeconomic and policing data in cluster's order. In summary, a higher police force per inhabitant and a higher minimum income HDI in a given batallion results in a greater probability of being in a cluster of lower criminal incidence. Aprendizagem não supervisionada Segurança pública Clusters Similaridade Aprendizagem dos Dados Índice de McCLain Clusters Learning from Data Unsupervised learning Public security Similiarity McClain index Matemática Matemática Sociologia - Aspectos econômicos
262	Automatic Segmentation of Swedish Medical Words with Greek and Latin Morphemes : A Computational Morphological Analysis Lindström, Mathias January 2015 (has links) Raw text data online has increased the need for designing artificial systems capable of processing raw data efficiently and at a low cost in the field of natural language processing (NLP). A well-developed morphological analysis is an important cornerstone of NLP, in particular when word look-up is an important stage of processing. Morphological analysis has many advantages, including reducing the number of word forms to be stored computationally, as well as being cost-efficient and time-efficient. NLP is relevant in the field of medicine, especially in automatic text analysis, which is a relatively young field in Swedish medical texts. Much of the stored information is highly unstructured and disorganized. Using raw corpora, this paper aims to contribute to automatic morphological segmentation by experimenting with state-of-art-tools for unsupervised and semi-supervised word segmentation of Swedish words in medical texts. The results show that a reasonable segmentation is more dependent on a high number of word types, rather than a special type of corpora. The results also show that semi-supervised word segmentation in the form of annotated training data greatly increases the performance. / Rå textdata online har ökat behovet för artificiella system som klarar av att processa rå data effektivt och till en låg kostnad inom språkteknologi (NLP). En välutvecklad morfologisk analys är en viktig hörnsten inom NLP, speciellt när ordprocessning är ett viktigt steg. Morfologisk analys har många fördelar, bland annat reducerar den antalet ordformer som ska lagras teknologiskt, samt så är det kostnadseffektivt och tidseffektivt. NLP är av relevans för det medicinska ämnet, speciellt inom textanalys som är ett relativt ungt område inom svenska medicinska texter. Mycket av den lagrade informationen är väldigt ostrukturerat och oorganiserat. Genom att använda råa korpusar ämnar denna uppsats att bidra till automatisk morfologisk segmentering genom att experimentera med de för närvarande bästa verktygen för oövervakad och semi-övervakad ordsegmentering av svenska ord i medicinska texter. Resultaten visar att en acceptabel segmentering beror mer på ett högt antal ordtyper, och inte en speciell sorts korpus. Resultaten visar också att semi-övervakad ordsegmentering, dvs. annoterad träningsdata, ökar prestandan markant. automatic word segmentation Swedish medical word segmentation morpheme segmentation morphology induction morphological analysis unsupervised learning natural language processing automatisk ordsegmentering svensk medicinsk ordsegmentering morfemsegmentering morfeminduktion morfologisk analys oövervakad inlärning språkteknologi General Language Studies and Linguistics
263	Abordagem semi-supervisionada para detecção de módulos de software defeituosos OLIVEIRA, Paulo César de 31 August 2015 (has links) Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2017-07-24T12:11:04Z No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Dissertação Mestrado Paulo César de Oliveira.pdf: 2358509 bytes, checksum: 36436ca63e0a8098c05718bbee92d36e (MD5) / Made available in DSpace on 2017-07-24T12:11:04Z (GMT). No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Dissertação Mestrado Paulo César de Oliveira.pdf: 2358509 bytes, checksum: 36436ca63e0a8098c05718bbee92d36e (MD5) Previous issue date: 2015-08-31 / Com a competitividade cada vez maior do mercado, aplicações de alto nível de qualidade são exigidas para a automação de um serviço. Para garantir qualidade de um software, testá-lo visando encontrar falhas antecipadamente é essencial no ciclo de vida de desenvolvimento. O objetivo do teste de software é encontrar falhas que poderão ser corrigidas e consequentemente, aumentar a qualidade do software em desenvolvimento. À medida que o software cresce, uma quantidade maior de testes é necessária para prevenir ou encontrar defeitos, visando o aumento da qualidade. Porém, quanto mais testes são criados e executados, mais recursos humanos e de infraestrutura são necessários. Além disso, o tempo para realizar as atividades de teste geralmente não é suficiente, fazendo com que os defeitos possam escapar. Cada vez mais as empresas buscam maneiras mais baratas e efetivas para detectar defeitos em software. Muitos pesquisadores têm buscado nos últimos anos, mecanismos para prever automaticamente defeitos em software. Técnicas de aprendizagem de máquina vêm sendo alvo das pesquisas, como uma forma de encontrar defeitos em módulos de software. Tem-se utilizado muitas abordagens supervisionadas para este fim, porém, rotular módulos de software como defeituosos ou não para fins de treinamento de um classificador é uma atividade muito custosa e que pode inviabilizar a utilização de aprendizagem de máquina. Neste contexto, este trabalho propõe analisar e comparar abordagens não supervisionadas e semisupervisionadas para detectar módulos de software defeituosos. Para isto, foram utilizados métodos não supervisionados (de detecção de anomalias) e também métodos semi-supervisionados, tendo como base os classificadores AutoMLP e Naive Bayes. Para avaliar e comparar tais métodos, foram utilizadas bases de dados da NASA disponíveis no PROMISE Software Engineering Repository. / Because the increase of market competition then high level of quality applications are required to provide automate services. In order to achieve software quality testing is essential in the development lifecycle with the purpose of finding defect as earlier as possible. The testing purpose is not only to find failures that can be fixed, but improve software correctness and quality. Once software gets more complex, a greater number of tests will be necessary to prevent or find defects. Therefore, the more tests are designed and exercised, the more human and infrastructure resources are needed. However, time to run the testing activities are not enough, thus, as a result, it causes escape defects. Companies are constantly trying to find cheaper and effective ways to software defect detection in earlier stages. In the past years, many researchers are trying to finding mechanisms to automatically predict these software defects. Machine learning techniques are being a research target, as a way of finding software modules detection. Many supervised approaches are being used with this purpose, but labeling software modules as defective or not defective to be used in training phase is very expensive and it can make difficult machine learning use. Considering that this work aims to analyze and compare unsupervised and semi-supervised approaches to software module defect detection. To do so, unsupervised methods (of anomaly detection) and semi-supervised methods using AutoMLP and Naive Bayes algorithms were used. To evaluate and compare these approaches, NASA datasets were used at PROMISE Software Engineering Repository. Aprendizagem de Máquina Aprendizagem Semi-Supervisionada Aprendizagem Não Supervisionada Teste de Software Detecção de Anomalias Machine Learning Software Defect Detection Semi-Supervised Learning Unsupervised Learning Software Testing Anomaly Detection
264	Exploration of an Automated Motivation Letter Scoring System to Emulate Human Judgement Munnecom, Lorenna, Pacheco, Miguel Chaves de Lemos January 2020 (has links) As the popularity of the master’s in data science at Dalarna University increases, so does the number of applicants. The aim of this thesis was to explore different approaches to provide an automated motivation letter scoring system which could emulate the human judgement and automate the process of candidate selection. Several steps such as image processing and text processing were required to enable the authors to retrieve numerous features which could lead to the identification of the factors graded by the program managers. Grammatical based features and Advanced textual features were extracted from the motivation letters followed by the application of Topic Modelling methods to extract the probability of each topics occurring within a motivation letter. Furthermore, correlation analysis was applied to quantify the association between the features and the different factors graded by the program managers, followed by Ordinal Logistic Regression and Random Forest to build models with the most impactful variables. Finally, Naïve Bayes Algorithm, Random Forest and Support Vector Machine were used, first for classification and then for prediction purposes. These results were not promising as the factors were not accurately identified. Nevertheless, the authors suspected that the factors may be strongly related to the highlight of specific topics within a motivation letter which can lead to further research. Natural Language Processing Machine Learning Supervised Learning Unsupervised Learning Automation Feature Extraction Image Processing Text Processing Text Exploration Motivation Letter Dalarna University Student Application Topic Modelling Business Intelligence Data Science Computer and Information Sciences Data- och informationsvetenskap
265	Inference and applications for topic models / Inférence et applications pour les modèles thématiques Dupuy, Christophe 30 June 2017 (has links) La plupart des systèmes de recommandation actuels se base sur des évaluations sous forme de notes (i.e., chiffre entre 0 et 5) pour conseiller un contenu (film, restaurant...) à un utilisateur. Ce dernier a souvent la possibilité de commenter ce contenu sous forme de texte en plus de l'évaluer. Il est difficile d'extraire de l'information d'un texte brut tandis qu'une simple note contient peu d'information sur le contenu et l'utilisateur. Dans cette thèse, nous tentons de suggérer à l'utilisateur un texte lisible personnalisé pour l'aider à se faire rapidement une opinion à propos d'un contenu. Plus spécifiquement, nous construisons d'abord un modèle thématique prédisant une description de film personnalisée à partir de commentaires textuels. Notre modèle sépare les thèmes qualitatifs (i.e., véhiculant une opinion) des thèmes descriptifs en combinant des commentaires textuels et des notes sous forme de nombres dans un modèle probabiliste joint. Nous évaluons notre modèle sur une base de données IMDB et illustrons ses performances à travers la comparaison de thèmes. Nous étudions ensuite l'inférence de paramètres dans des modèles à variables latentes à grande échelle, incluant la plupart des modèles thématiques. Nous proposons un traitement unifié de l'inférence en ligne pour les modèles à variables latentes à partir de familles exponentielles non-canoniques et faisons explicitement apparaître les liens existants entre plusieurs méthodes fréquentistes et Bayesiennes proposées auparavant. Nous proposons aussi une nouvelle méthode d'inférence pour l'estimation fréquentiste des paramètres qui adapte les méthodes MCMC à l'inférence en ligne des modèles à variables latentes en utilisant proprement un échantillonnage de Gibbs local. Pour le modèle thématique d'allocation de Dirichlet latente, nous fournissons une vaste série d'expériences et de comparaisons avec des travaux existants dans laquelle notre nouvelle approche est plus performante que les méthodes proposées auparavant. Enfin, nous proposons une nouvelle classe de processus ponctuels déterminantaux (PPD) qui peut être manipulée pour l'inférence et l'apprentissage de paramètres en un temps potentiellement sous-linéaire en le nombre d'objets. Cette classe, basée sur une factorisation spécifique de faible rang du noyau marginal, est particulièrement adaptée à une sous-classe de PPD continus et de PPD définis sur un nombre exponentiel d'objets. Nous appliquons cette classe à la modélisation de documents textuels comme échantillons d'un PPD sur les phrases et proposons une formulation du maximum de vraisemblance conditionnel pour modéliser les proportions de thèmes, ce qui est rendu possible sans aucune approximation avec notre classe de PPD. Nous présentons une application à la synthèse de documents avec un PPD sur 2 à la puissance 500 objets, où les résumés sont composés de phrases lisibles. / Most of current recommendation systems are based on ratings (i.e. numbers between 0 and 5) and try to suggest a content (movie, restaurant...) to a user. These systems usually allow users to provide a text review for this content in addition to ratings. It is hard to extract useful information from raw text while a rating does not contain much information on the content and the user. In this thesis, we tackle the problem of suggesting personalized readable text to users to help them make a quick decision about a content. More specifically, we first build a topic model that predicts personalized movie description from text reviews. Our model extracts distinct qualitative (i.e., which convey opinion) and descriptive topics by combining text reviews and movie ratings in a joint probabilistic model. We evaluate our model on an IMDB dataset and illustrate its performance through comparison of topics. We then study parameter inference in large-scale latent variable models, that include most topic models. We propose a unified treatment of online inference for latent variable models from a non-canonical exponential family, and draw explicit links between several previously proposed frequentist or Bayesian methods. We also propose a novel inference method for the frequentist estimation of parameters, that adapts MCMC methods to online inference of latent variable models with the proper use of local Gibbs sampling.~For the specific latent Dirichlet allocation topic model, we provide an extensive set of experiments and comparisons with existing work, where our new approach outperforms all previously proposed methods. Finally, we propose a new class of determinantal point processes (DPPs) which can be manipulated for inference and parameter learning in potentially sublinear time in the number of items. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous DPPs and DPPs defined on exponentially many items. We apply this new class to modelling text documents as sampling a DPP of sentences, and propose a conditional maximum likelihood formulation to model topic proportions, which is made possible with no approximation for our class of DPPs. We present an application to document summarization with a DPP on 2 to the power 500 items, where the summaries are composed of readable sentences. Modèles thèmatiques Apprentissage en ligne Modèles à variables latentes Apprentissage non supervisé Processus ponctuels determinantaux Allocation de Dirichlet latente Topic models Online learning Latent variable models Unsupervised learning Determinantal point processes Latent Dirichlet allocation 006.3
266	Description des variétés berbères en danger du Sud-Oranais (Algérie) - Étude dialectologique, phonologique et phonétique du système consonantique / Description of endangered Berber varieties of Sud-Oranais (Algeria) - A Dialectological, phonetic and phonological study of the consonantic system El Idrissi, Mohamed 08 December 2017 (has links) Il existe dans le sud ouest algérien plusieurs variétés de berbère. Certaines d'entre elles sont situées dans la région dite du Sud-Oranais et peuvent être cataloguées comme des langues en danger. Nous avons donc entrepris de décrire ces variétés avant qu'elles ne disparaissent. Cela a été mené à bien en réalisant plusieurs enquêtes de terrain. Par ailleurs, ce travail de documentation linguistique et de conservation du patrimoine culturel n'est qu'un des aspects de cette thèse. Ce travail académique est à la croisée de différents domaines disciplinaires. Nous avons eu recours aux méthodes en usage en Sciences de l'Information Géographique (SIG) et en Sciences Des Données (SDD) pour mener une étude dialectologique. Grâce aux SIG, nous avons réalisé une étude géolinguistique qui nous a permis de visualiser sur des cartes linguistiques la distribution de la variation linguistique de certaines consonnes. À partir de ces données, nous avons discuté de la réalité phonologique de ces consonnes simples et géminées. Dans le prolongement, une étude dialectométrique a été effectuée en nous basant sur des méthodes de partitions des données. Nous avons utilisé les méthodes d'Apprentissage Non Supervisé (PHA, k-moyenne, MDS, ...) et les méthodes d'Apprentissage Supervisé (CART) connues en SDD. Les résultats ont été affichés sous la forme de figures (cartes linguistiques, dendogramme, heatmap, arbre, ...) à des fins d'exploration visuelle des données. L'ensemble de ces études a été accompli par le biais d'un traitement informatique (langage R). Puis, nous avons entrepris une analyse phonétique fondée sur une étude acoustique des rhotiques alvéolaires : [ɾ], [r], [ɾˤ] et [rˤ]. Ces unités phoniques se distinguent par leur temporalité et leur réalisation articulatoire. Ainsi, les spectrogrammes nous ont permis d'examiner la distribution de ces sons. Puis, cela nous a aidé à distinguer ce qui relevait de la phonétique et de la phonologie. Ensuite, nous avons achevé cette thèse par une étude phonétique et statistique. Ces dernières ont porté sur l'obstruction réalisée par la pointe de la langue et sur la nature des vocoïdes qui accompagnent les rhotiques alvéolaires dans l'environnement d'une consonne. / There are several Berber languages in the south west of Algeria. Some of them are situated in the so-called Sud-Oranais and they can be categorized as endangered languages. So I have decided to describe them before they disappear. That’s why, I have carried out several fieldworks. But, this linguistic documentation work and cultural heritage conservation are just one of aspects of our thesis. This PhD are transdisciplinary. I have used the methods which are applied in Geographic Information Science (GIS) and in Data Science (DS) to carry out a dialectological study. A geolinguistic study has been undertaken and has enabled to visualize the expansion of the linguistic variation of certain consonants through GIS. Based on these data, I have debated the phonological reality of the simple and geminate consonants. From this research, a dialectometric study was carried out on the basis of data partitioning methods. I have used the Unsupervised Learning Methods (HAC, k-mean, MDS, ...) and the Supervised Learning Methods (CART) known in DS. A visual exploration (linguistic maps, dendogram, heatmap, tree, ...) approach is proposed in order to analyse the results which have been realized through computer processing (R language). Then, I have undertaken a phonetic analysis, which is based on an acoustic study of alveolar rhotics : [ɾ], [r], [ɾˤ] and [rˤ]. These phonic unities are distinguished by their temporality and their articulatory realization. Thus, the spectrograms enabled to examine the distribution of these sounds and to distinguish what was related to phonetic and phonology. Then, this thesis with a phonetic and statistical study has reinforced this research focused on the obstruction made by the tip of the tongue and on the nature of the vocoids which goes along with the alveolar rhotic in the area of the consonants. Apprentissage Non Supervisé Apprentissage Supervisé R Phonétique Phonologie Carte linguistique Sig Informatique Rhotique Tap Trill Unsupervised Learning Supervised Learning R Phonetic Phonology Linguistic map Gis Computer processing Rhotic Trill Tap
267	Machine learning techniques for content-based information retrieval / Méthodes d’apprentissage automatique pour la recherche par le contenu de l’information Chafik, Sanaa 22 December 2017 (has links) Avec l’évolution des technologies numériques et la prolifération d'internet, la quantité d’information numérique a considérablement évolué. La recherche par similarité (ou recherche des plus proches voisins) est une problématique que plusieurs communautés de recherche ont tenté de résoudre. Les systèmes de recherche par le contenu de l’information constituent l’une des solutions prometteuses à ce problème. Ces systèmes sont composés essentiellement de trois unités fondamentales, une unité de représentation des données pour l’extraction des primitives, une unité d’indexation multidimensionnelle pour la structuration de l’espace des primitives, et une unité de recherche des plus proches voisins pour la recherche des informations similaires. L’information (image, texte, audio, vidéo) peut être représentée par un vecteur multidimensionnel décrivant le contenu global des données d’entrée. La deuxième unité consiste à structurer l’espace des primitives dans une structure d’index, où la troisième unité -la recherche par similarité- est effective.Dans nos travaux de recherche, nous proposons trois systèmes de recherche par le contenu de plus proches voisins. Les trois approches sont non supervisées, et donc adaptées aux données étiquetées et non étiquetées. Elles sont basées sur le concept du hachage pour une recherche efficace multidimensionnelle des plus proches voisins. Contrairement aux approches de hachage existantes, qui sont binaires, les approches proposées fournissent des structures d’index avec un hachage réel. Bien que les approches de hachage binaires fournissent un bon compromis qualité-temps de calcul, leurs performances en termes de qualité (précision) se dégradent en raison de la perte d’information lors du processus de binarisation. À l'opposé, les approches de hachage réel fournissent une bonne qualité de recherche avec une meilleure approximation de l’espace d’origine, mais induisent en général un surcoût en temps de calcul.Ce dernier problème est abordé dans la troisième contribution. Les approches proposées sont classifiées en deux catégories, superficielle et profonde. Dans la première catégorie, on propose deux techniques de hachage superficiel, intitulées Symmetries of the Cube Locality sensitive hashing (SC-LSH) et Cluster-Based Data Oriented Hashing (CDOH), fondées respectivement sur le hachage aléatoire et l’apprentissage statistique superficiel. SCLSH propose une solution au problème de l’espace mémoire rencontré par la plupart des approches de hachage aléatoire, en considérant un hachage semi-aléatoire réduisant partiellement l’effet aléatoire, et donc l’espace mémoire, de ces dernières, tout en préservant leur efficacité pour la structuration des espaces hétérogènes. La seconde technique, CDOH, propose d’éliminer l’effet aléatoire en combinant des techniques d’apprentissage non-supervisé avec le concept de hachage. CDOH fournit de meilleures performances en temps de calcul, en espace mémoire et en qualité de recherche.La troisième contribution est une approche de hachage basée sur les réseaux de neurones profonds appelée "Unsupervised Deep Neuron-per-Neuron Hashing" (UDN2H). UDN2H propose une indexation individuelle de la sortie de chaque neurone de la couche centrale d’un modèle non supervisé. Ce dernier est un auto-encodeur profond capturant une structure individuelle de haut niveau de chaque neurone de sortie.Nos trois approches, SC-LSH, CDOH et UDN2H, ont été proposées séquentiellement durant cette thèse, avec un niveau croissant, en termes de la complexité des modèles développés, et en termes de la qualité de recherche obtenue sur de grandes bases de données d'information / The amount of media data is growing at high speed with the fast growth of Internet and media resources. Performing an efficient similarity (nearest neighbor) search in such a large collection of data is a very challenging problem that the scientific community has been attempting to tackle. One of the most promising solutions to this fundamental problem is Content-Based Media Retrieval (CBMR) systems. The latter are search systems that perform the retrieval task in large media databases based on the content of the data. CBMR systems consist essentially of three major units, a Data Representation unit for feature representation learning, a Multidimensional Indexing unit for structuring the resulting feature space, and a Nearest Neighbor Search unit to perform efficient search. Media data (i.e. image, text, audio, video, etc.) can be represented by meaningful numeric information (i.e. multidimensional vector), called Feature Description, describing the overall content of the input data. The task of the second unit is to structure the resulting feature descriptor space into an index structure, where the third unit, effective nearest neighbor search, is performed.In this work, we address the problem of nearest neighbor search by proposing three Content-Based Media Retrieval approaches. Our three approaches are unsupervised, and thus can adapt to both labeled and unlabeled real-world datasets. They are based on a hashing indexing scheme to perform effective high dimensional nearest neighbor search. Unlike most recent existing hashing approaches, which favor indexing in Hamming space, our proposed methods provide index structures adapted to a real-space mapping. Although Hamming-based hashing methods achieve good accuracy-speed tradeoff, their accuracy drops owing to information loss during the binarization process. By contrast, real-space hashing approaches provide a more accurate approximation in the mapped real-space as they avoid the hard binary approximations.Our proposed approaches can be classified into shallow and deep approaches. In the former category, we propose two shallow hashing-based approaches namely, "Symmetries of the Cube Locality Sensitive Hashing" (SC-LSH) and "Cluster-based Data Oriented Hashing" (CDOH), based respectively on randomized-hashing and shallow learning-to-hash schemes. The SC-LSH method provides a solution to the space storage problem faced by most randomized-based hashing approaches. It consists of a semi-random scheme reducing partially the randomness effect of randomized hashing approaches, and thus the memory storage problem, while maintaining their efficiency in structuring heterogeneous spaces. The CDOH approach proposes to eliminate the randomness effect by combining machine learning techniques with the hashing concept. The CDOH outperforms the randomized hashing approaches in terms of computation time, memory space and search accuracy.The third approach is a deep learning-based hashing scheme, named "Unsupervised Deep Neuron-per-Neuron Hashing" (UDN2H). The UDN2H approach proposes to index individually the output of each neuron of the top layer of a deep unsupervised model, namely a Deep Autoencoder, with the aim of capturing the high level individual structure of each neuron output.Our three approaches, SC-LSH, CDOH and UDN2H, were proposed sequentially as the thesis was progressing, with an increasing level of complexity in terms of the developed models, and in terms of the effectiveness and the performances obtained on large real-world datasets Indexation multidimensionnelle Apprentissage non supervisé Hachage Recherche des plus proches voisins Apprentissage profond Multidimensionnal indexing Unsupervised learning Hashing Approximate nearest neighbor search Deep learning
268	A concept of an intent-based contextual chat-bot with capabilities for continual learning Strutynskiy, Maksym January 2020 (has links) Chat-bots are computer programs designed to conduct textual or audible conversations with a single user. The job of a chat-bot is to be able to find the best response for any request the user issues. The best response is considered to answer the question and contain relevant information while following grammatical and lexical rules. Modern chat-bots often have trouble accomplishing all these tasks. State-of-the-art approaches, such as deep learning, and large datasets help chat-bots tackle this problem better. While there is a number of different approaches that can be applied for different kind of bots, datasets of suitable size are not always available. In this work, we introduce and evaluate a method of expanding the size of datasets. This will allow chat-bots, in combination with a good learning algorithm, to achieve higher precision while handling their tasks. The expansion method uses the continual learning approach that allows the bot to expand its own dataset while holding conversations with its users. In this work we test continual learning with IBM Watson Assistant chat-bot as well as a custom case study chat-bot implementation. We conduct the testing using a smaller and a larger datasets to find out if continual learning stays effective as the dataset size increases. The results show that the more conversations the chat-bot holds, the better it gets at guessing the intent of the user. They also show that continual learning works well for larger and smaller datasets, but the effect depends on the specifics of the chat-bot implementation. While continual learning makes good results better, it also turns bad results into worse ones, thus the chat-bot should be manually calibrated should the precision of the original results, measured before the expansion, decrease. Machine learning intent based chat-bot dialogue systems rule based Python TensorFlow TFLearn continual learning online learning supervised learning unsupervised learning IBM Watson Watson Assistant Computer Sciences Datavetenskap (datalogi)
269	Automatická klasifikace obrazů / Automatic image classification Ševčík, Zdeněk January 2020 (has links) The aim of this thesis is to explore clustering algorithms of machine unsupervised learning, which can be used for image database classification by similarity. For chosen clustering algorithms is written up a theoretical basis. For better classification of used database this thesis deals with different methods of image preprocessing. With these methods the features from image are extracted. Next the thesis solves of implementation of preprocessing methods and practical application of clustering algorithms. In practical part is programmed aplication in Python programming language, which classifies the database of images into classes by similarity. The thesis tests all of used methods and at the end of the thesis is processed searches of results.
270	Aplikace metody učení bez učitele na hledání podobných grafů / Application of Unsupervised Learning Methods in Graph Similarity Search Sabo, Jozef January 2021 (has links) Goal of this master's thesis was in cooperation with the company Avast to design a system, which can extract knowledge from a database of graphs. Graphs, used for data mining, describe behaviour of computer systems and they are anonymously inserted into the company's database from systems of the company's products users. Each graph in the database can be assigned with one of two labels: clean or malware (malicious) graph. The task of the proposed self-learning system is to find clusters of graphs in the graph database, in which the classes of graphs do not mix. Graph clusters with only one class of graphs can be interpreted as different types of clean or malware graphs and they are a useful source of further analysis on the graphs. To evaluate the quality of the clusters, a custom metric, named as monochromaticity, was designed. The metric evaluates the quality of the clusters based on how much clean and malware graphs are mixed in the clusters. The best results of the metric were obtained when vector representations of graphs were created by a deep learning model (variational graph autoencoder with two relation graph convolution operators) and the parameterless method MeanShift was used for clustering over vectors.

Search results