Global ETD Search

131	Feature Extraction for the Cardiovascular Disease Diagnosis Tang, Yu January 2018 (has links) Cardiovascular disease is a serious life-threatening disease. It can occur suddenly and progresses rapidly. Finding the right disease features in the early stage is important to decrease the number of deaths and to make sure that the patient can fully recover. Though there are several methods of examination, describing heart activities in signal form is the most cost-effective way. In this case, ECG is the best choice because it can record heart activity in signal form and it is safer, faster and more convenient than other methods of examination. However, there are still problems involved in the ECG. For example, not all the ECG features are clear and easily understood. In addition, the frequency features are not present in the traditional ECG. To solve these problems, the project uses the optimized CWT algorithm to transform data from the time domain into the time-frequency domain. The result is evaluated by three data mining algorithms with different mechanisms. The evaluation proves that the features in the ECG are successfully extracted and important diagnostic information in the ECG is preserved. A user interface is designed increasing efficiency, which facilitates the implementation. ECG Feature Extraction CWT Unsupervised Learning Clustering User interface Feature Visualization ECG Disease Diagnosis. Computer Systems Datorsystem
132	Estabilidade de atividade basal, recuperação e formação de memórias em redes de neurônios / Stability of basal activity, retrieval and formation of memories in networks of spiking neurons Agnes, Everton João January 2014 (has links) O encéfalo, através de complexa atividade elétrica, é capaz de processar diversos tipos de informação, que são reconhecidos, memorizados e recuperados. A base do processamento é dada pela atividade de neurônios, que se comunicam principalmente através de eventos discretos no tempo: os potenciais de ação. Os disparos desses potenciais de ação podem ser observados por técnicas experimentais; por exemplo, é possível medir os instantes dos disparos dos potenciais de ação de centenas de neurônios em camundongos vivos. No entanto, as intensidades das conexões entre esses neurônios não são totalmente acessíveis, o que, além de outros fatores, impossibilita um entendimento mais completo do funcionamento da rede neural. Desse modo, a neurociência computacional tem papel importante para o entendimento dos processos envolvidos no encéfalo, em vários níveis de detalhamento. Dentro da área da neurociência computacional, o presente trabalho aborda a aquisição e recuperação de memórias dadas por padrões espaciais, onde o espaço é definido pelos neurônios da rede simulada. Primeiro utilizamos o conceito da regra de Hebb para construir redes de neurônios com conexões previamente definidas por esses padrões espaciais. Se as memórias são armazenadas nas conexões entre os neurônios, então a inclusão de um período de aprendizado torna necessária a implementação de plasticidade nos pesos sinápticos. As regras de modificação sináptica que permitem memorização (Hebbianas) geralmente causam instabilidades na atividade dos neurônios. Com isso desenvolvemos regras de plasticidade homeostática capazes de estabilizar a atividade basal de redes de neurônios. Finalizamos com o estudo analítico e numérico de regras de plasticidade sináptica que permitam o aprendizado não-supervisionado por elevação da taxa de disparos de potenciais de ação de neurônios. Mostramos que, com uma regra de aprendizado baseada em evidências experimentais, a recuperação de padrões memorizados é possível, com ativação supervisionada ou espontânea. / The brain, through complex electrical activity, is able to process different types of information, which are encoded, stored and retrieved. The processing is based on the activity of neurons that communicate primarily by discrete events in time: the action potentials. These action potentials can be observed via experimental techniques; for example, it is possible to measure the moment of action potentials (spikes) of hundreds of neurons in living mice. However, the strength of the connections among these neurons is not fully accessible, which, among other factors, preclude a more complete understanding of the neural network. Thus, computational neuroscience has an important role in understanding the processes involved in the brain, at various levels of detail. Within the field of computational neuroscience, this work presents a study on the acquisition and retrieval of memories given by spatial patterns, where space is defined by the neurons of the simulated network. First we use Hebb’s rule to build up networks of spiking neurons with static connections chosen from these spatial patterns. If memories are stored in the connections between neurons, then synaptic weights should be plastic so that learning is possible. Synaptic plasticity rules that allow memory formation (Hebbian) usually introduce instabilities on the neurons’ activity. Therefore, we developed homeostatic plasticity rules that stabilize baseline activity regimes in neural networks of spiking neurons. This thesis ends with analytical and numerical studies regarding plasticity rules that allow unsupervised learning by increasing the activity of specific neurons. We show that, with a plasticity rule based on experimental evidences, retrieval of learned patterns is possible, either with supervised or spontaneous recalling. Biofísica Neurociências Redes neurais Sinapses Memoria associativa Homeostase Computational neuroscience Associative memory Synaptic plasticity Homeostasis Unsupervised learning
133	Estabilidade de atividade basal, recuperação e formação de memórias em redes de neurônios / Stability of basal activity, retrieval and formation of memories in networks of spiking neurons Agnes, Everton João January 2014 (has links) O encéfalo, através de complexa atividade elétrica, é capaz de processar diversos tipos de informação, que são reconhecidos, memorizados e recuperados. A base do processamento é dada pela atividade de neurônios, que se comunicam principalmente através de eventos discretos no tempo: os potenciais de ação. Os disparos desses potenciais de ação podem ser observados por técnicas experimentais; por exemplo, é possível medir os instantes dos disparos dos potenciais de ação de centenas de neurônios em camundongos vivos. No entanto, as intensidades das conexões entre esses neurônios não são totalmente acessíveis, o que, além de outros fatores, impossibilita um entendimento mais completo do funcionamento da rede neural. Desse modo, a neurociência computacional tem papel importante para o entendimento dos processos envolvidos no encéfalo, em vários níveis de detalhamento. Dentro da área da neurociência computacional, o presente trabalho aborda a aquisição e recuperação de memórias dadas por padrões espaciais, onde o espaço é definido pelos neurônios da rede simulada. Primeiro utilizamos o conceito da regra de Hebb para construir redes de neurônios com conexões previamente definidas por esses padrões espaciais. Se as memórias são armazenadas nas conexões entre os neurônios, então a inclusão de um período de aprendizado torna necessária a implementação de plasticidade nos pesos sinápticos. As regras de modificação sináptica que permitem memorização (Hebbianas) geralmente causam instabilidades na atividade dos neurônios. Com isso desenvolvemos regras de plasticidade homeostática capazes de estabilizar a atividade basal de redes de neurônios. Finalizamos com o estudo analítico e numérico de regras de plasticidade sináptica que permitam o aprendizado não-supervisionado por elevação da taxa de disparos de potenciais de ação de neurônios. Mostramos que, com uma regra de aprendizado baseada em evidências experimentais, a recuperação de padrões memorizados é possível, com ativação supervisionada ou espontânea. / The brain, through complex electrical activity, is able to process different types of information, which are encoded, stored and retrieved. The processing is based on the activity of neurons that communicate primarily by discrete events in time: the action potentials. These action potentials can be observed via experimental techniques; for example, it is possible to measure the moment of action potentials (spikes) of hundreds of neurons in living mice. However, the strength of the connections among these neurons is not fully accessible, which, among other factors, preclude a more complete understanding of the neural network. Thus, computational neuroscience has an important role in understanding the processes involved in the brain, at various levels of detail. Within the field of computational neuroscience, this work presents a study on the acquisition and retrieval of memories given by spatial patterns, where space is defined by the neurons of the simulated network. First we use Hebb’s rule to build up networks of spiking neurons with static connections chosen from these spatial patterns. If memories are stored in the connections between neurons, then synaptic weights should be plastic so that learning is possible. Synaptic plasticity rules that allow memory formation (Hebbian) usually introduce instabilities on the neurons’ activity. Therefore, we developed homeostatic plasticity rules that stabilize baseline activity regimes in neural networks of spiking neurons. This thesis ends with analytical and numerical studies regarding plasticity rules that allow unsupervised learning by increasing the activity of specific neurons. We show that, with a plasticity rule based on experimental evidences, retrieval of learned patterns is possible, either with supervised or spontaneous recalling. Biofísica Neurociências Redes neurais Sinapses Memoria associativa Homeostase Computational neuroscience Associative memory Synaptic plasticity Homeostasis Unsupervised learning
134	A General Framework for Discovering Multiple Data Groupings Sweidan, Dirar January 2018 (has links) Clustering helps users gain insights from their data by discovering hidden structures in an unsupervised way. Unlike classification tasks that are evaluated using well-defined target labels, clustering is an intrinsically subjective task as it depends on the interpretation, need and interest of users. In many real-world applications, multiple meaningful clusterings can be hidden in the data, and different users are interested in exploring different perspectives and use cases of this same data. Despite this, most existing clustering techniques only attempt to produce a single clustering of the data, which can be too strict. In this thesis, a general method is proposed to discover multiple alternative clusterings of the data, and let users select the clustering(s) they are most interested in. In order to cover a large set of possible clustering solutions, a diverse set of clusterings is first generated based on various projections of the data. Then, similar clusterings are found, filtered, and aggregated into one representative clustering, allowing the user to only explore a small set of non-redundant representative clusterings. We compare the proposed method against others and analyze its advantages and disadvantages, based on artificial and real-world datasets, as well as on images enabling a visual assessment of the meaningfulness of the discovered clustering solutions. On the other hand, extensive studies and analysis concerning a variety of techniques used in the method are made. Results show that the proposed method is able to discover multiple interesting and meaningful clustering solutions. machine learning unsupervised learning data mining clustering multiple-clusterings clustering algorithm Engineering and Technology Teknik och teknologier Computer Systems Datorsystem
135	Arcabouço para reconhecimento de locutor baseado em aprendizado não supervisionado / Speaker recognition framework based on unsupervised learning Campos, Victor de Abreu [UNESP] 31 August 2017 (has links) Submitted by Victor de Abreu Campos null (victorde.ac@gmail.com) on 2017-09-27T02:41:28Z No. of bitstreams: 1 dissertacao.pdf: 5473435 bytes, checksum: 1e76ecc15a4499dc141983740cc79e5a (MD5) / Approved for entry into archive by Monique Sasaki (sayumi_sasaki@hotmail.com) on 2017-09-28T13:43:21Z (GMT) No. of bitstreams: 1 campos_va_me_sjrp.pdf: 5473435 bytes, checksum: 1e76ecc15a4499dc141983740cc79e5a (MD5) / Made available in DSpace on 2017-09-28T13:43:21Z (GMT). No. of bitstreams: 1 campos_va_me_sjrp.pdf: 5473435 bytes, checksum: 1e76ecc15a4499dc141983740cc79e5a (MD5) Previous issue date: 2017-08-31 / Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) / A quantidade vertiginosa de conteúdo multimídia acumulada diariamente tem demandado o desenvolvimento de abordagens eficazes de recuperação. Nesse contexto, ferramentas de reconhecimento de locutor capazes de identificar automaticamente um indivíduo pela sua voz são de grande relevância. Este trabalho apresenta uma nova abordagem de reconhecimento de locutor modelado como um cenário de recuperação e usando algoritmos de aprendizado não supervisionado recentes. A abordagem proposta considera Coeficientes Cepstrais de Frequência Mel (MFCCs) e Coeficientes de Predição Linear Perceptual (PLPs) como características de locutor, em combinação com múltiplas abordagens de modelagem probabilística, especificamente Quantização Vetorial, Modelos por Mistura de Gaussianas e i-vectors, para calcular distâncias entre gravações de áudio. Em seguida, métodos de aprendizado não supervisionado baseados em ranqueamento são utilizados para aperfeiçoar a eficácia dos resultados de recuperação e, com a aplicação de um classificador de K-Vizinhos Mais Próximos, toma-se uma decisão quanto a identidade do locutor. Experimentos foram conduzidos considerando três conjuntos de dados públicos de diferentes cenários e carregando ruídos de diversas origens. Resultados da avaliação experimental demonstram que a abordagem proposta pode atingir resultados de eficácia altos. Adicionalmente, ganhos de eficácia relativos de até +318% foram obtidos pelo procedimento de aprendizado não supervisionado na tarefa de recuperação de locutor e ganhos de acurácia relativos de até +7,05% na tarefa de identificação entre gravações de domínios diferentes. / The huge amount of multimedia content accumulated daily has demanded the development of effective retrieval approaches. In this context, speaker recognition tools capable of automatically identifying a person through their voice are of great relevance. This work presents a novel speaker recognition approach modelled as a retrieval scenario and using recent unsupervised learning methods. The proposed approach considers Mel-Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Prediction Coefficients (PLPs) as features along with multiple modelling approaches, namely Vector Quantization, Gaussian Mixture Models and i-vector to compute distances among audio objects. Next, rank-based unsupervised learning methods are used for improving the effectiveness of retrieval results and, based on a K-Nearest Neighbors classifier, an identity decision is taken. Several experiments were conducted considering three public datasets from different scenarios, carrying noise from various sources. Experimental results demonstrate that the proposed approach can achieve very high effectiveness results. In addition, effectiveness gains up to +318% were obtained by the unsupervised learning procedure in a speaker retrieval task. Also, accuracy gains up to +7,05% were obtained by the unsupervised learning procedure in a speaker identification task considering recordings from different domains. / FAPESP: 2015/07934-4 MFCC PLP VQ GMM i-vector RL-Sim ReckNN Reconhecimento de locutor Aprendizado não supervisionado Speaker recognition Unsupervised learning
136	Estabilidade de atividade basal, recuperação e formação de memórias em redes de neurônios / Stability of basal activity, retrieval and formation of memories in networks of spiking neurons Agnes, Everton João January 2014 (has links) O encéfalo, através de complexa atividade elétrica, é capaz de processar diversos tipos de informação, que são reconhecidos, memorizados e recuperados. A base do processamento é dada pela atividade de neurônios, que se comunicam principalmente através de eventos discretos no tempo: os potenciais de ação. Os disparos desses potenciais de ação podem ser observados por técnicas experimentais; por exemplo, é possível medir os instantes dos disparos dos potenciais de ação de centenas de neurônios em camundongos vivos. No entanto, as intensidades das conexões entre esses neurônios não são totalmente acessíveis, o que, além de outros fatores, impossibilita um entendimento mais completo do funcionamento da rede neural. Desse modo, a neurociência computacional tem papel importante para o entendimento dos processos envolvidos no encéfalo, em vários níveis de detalhamento. Dentro da área da neurociência computacional, o presente trabalho aborda a aquisição e recuperação de memórias dadas por padrões espaciais, onde o espaço é definido pelos neurônios da rede simulada. Primeiro utilizamos o conceito da regra de Hebb para construir redes de neurônios com conexões previamente definidas por esses padrões espaciais. Se as memórias são armazenadas nas conexões entre os neurônios, então a inclusão de um período de aprendizado torna necessária a implementação de plasticidade nos pesos sinápticos. As regras de modificação sináptica que permitem memorização (Hebbianas) geralmente causam instabilidades na atividade dos neurônios. Com isso desenvolvemos regras de plasticidade homeostática capazes de estabilizar a atividade basal de redes de neurônios. Finalizamos com o estudo analítico e numérico de regras de plasticidade sináptica que permitam o aprendizado não-supervisionado por elevação da taxa de disparos de potenciais de ação de neurônios. Mostramos que, com uma regra de aprendizado baseada em evidências experimentais, a recuperação de padrões memorizados é possível, com ativação supervisionada ou espontânea. / The brain, through complex electrical activity, is able to process different types of information, which are encoded, stored and retrieved. The processing is based on the activity of neurons that communicate primarily by discrete events in time: the action potentials. These action potentials can be observed via experimental techniques; for example, it is possible to measure the moment of action potentials (spikes) of hundreds of neurons in living mice. However, the strength of the connections among these neurons is not fully accessible, which, among other factors, preclude a more complete understanding of the neural network. Thus, computational neuroscience has an important role in understanding the processes involved in the brain, at various levels of detail. Within the field of computational neuroscience, this work presents a study on the acquisition and retrieval of memories given by spatial patterns, where space is defined by the neurons of the simulated network. First we use Hebb’s rule to build up networks of spiking neurons with static connections chosen from these spatial patterns. If memories are stored in the connections between neurons, then synaptic weights should be plastic so that learning is possible. Synaptic plasticity rules that allow memory formation (Hebbian) usually introduce instabilities on the neurons’ activity. Therefore, we developed homeostatic plasticity rules that stabilize baseline activity regimes in neural networks of spiking neurons. This thesis ends with analytical and numerical studies regarding plasticity rules that allow unsupervised learning by increasing the activity of specific neurons. We show that, with a plasticity rule based on experimental evidences, retrieval of learned patterns is possible, either with supervised or spontaneous recalling. Biofísica Neurociências Redes neurais Sinapses Memoria associativa Homeostase Computational neuroscience Associative memory Synaptic plasticity Homeostasis Unsupervised learning
137	Higher-Ordered Feedback Architectures : a Comparison Jason, Henrik January 2002 (has links) This dissertation aim is to investigate the application of higher-ordered feedback architectures, as a control system for an autonomous robot, on delayed response task problems in the area of evolutionary robotics. For the two architectures of interest a theoretical and practical experiment study is conducted to elaborate how these architectures cope with the road-sign problem, and extended versions of the same. In the theoretical study conducted in this dissertation focus is on the features of the architectures, how they behave and act in different kinds of road-sign problem environments in earlier work. Based on this study two problem environments are chosen for practical experiments. The two experiments that are tested are the three-way and multiple stimuli road-sign problems. Both architectures seams to be cope with the three-way road-sign problem. Although, both architectures are shown to have difficulties solving the multiple stimuli road-sign problem with the current experimental setting used. This work leads to two insights in the way these architectures cope with and behave in the three-way road-sign problem environment and delayed response tasks. The robot seams to learn to explicitly relate its actions to the different stimuli settings that it is exposed to. Firstly, both architectures forms higher abstracted representations of the inputs from the environment. These representations are used to guide the robots actions in the environment in those situations were the raw input not was enough to do the correct actions. Secondly, it seams to be enough to have two internal representations of stimuli setting and offloading some stimuli settings, relying on the raw input from the environment, to solve the three-way road-sign problem. The dissertation works as an overview for new researchers on the area and also as take-off for the direction to which further investigations should be conducted of using higher-ordered feedback architectures. Evolutionary Robotics Higher-ordered feedback architectures Virtual modulation Self-organization Unsupervised Learning Information Systems
138	Locally Optimized Mapping of Slum Conditions in a Sub-Saharan Context: A Case Study of Bamenda, Cameroon Anchang, Julius 18 November 2016 (has links) Despite being an indicator of modernization and macro-economic growth, urbanization in regions such as Sub-Saharan Africa is tightly interwoven with poverty and deprivation. This has manifested physically as slums, which represent the worst residential urban areas, marked by lack of access to good quality housing and basic services. To effectively combat the slum phenomenon, local slum conditions must be captured in quantitative and spatial terms. However, there are significant hurdles to this. Slum detection and mapping requires readily available and reliable data, as well as a proper conceptualization of measurement and scale. Using Bamenda, Cameroon, as a test case, this dissertation research was designed as a three-pronged attack on the slum mapping problematic. The overall goal was to investigate locally optimized slum mapping strategies and methods that utilize high resolution satellite image data, household survey data, simple machine learning and regionalization theory. The first major objective of the study was to tackle a "measurement" problem. The aim was to explore a multi-index approach to measure and map local slum conditions. The rationale behind this was that prior sub-Saharan slum research too often used simplified measurement techniques such as a single unweighted composite index to represent diverse local slum conditions. In this study six household indicators relevant to the United Nations criteria for defining slums were extracted from a 2013 Bamenda household survey data set and aggregated for 63 local statistical areas. The extracted variables were the percent of households having the following attributes: more than two residents per room, non-owner, occupying a single room or studio, having no flush toilet, having no piped water, having no drainage. Hierarchical variable clustering was used as a surrogate for exploratory factor analysis to determine fewer latent slum factors from these six variables. Variable groups were classified such that the most correlated variables fell in the same group while non-correlated variables fell in separate groups. Each group membership was then examined to see if the group suggested a conceptually meaningful slum factor which could quantified as a stand-alone "high" and "low" binary slum index. Results showed that the slum indicators in the study area could be replaced by at least two meaningful and statistically uncorrelated latent factors. One factor reflected the home occupancy conditions (tenancy status, overcrowded and living space conditions) and was quantified using K-means clustering of units as an ‘occupancy disadvantage index’ (Occ_D). The other reflected the state of utilities access (piped water and flush toilet) and was quantified as utilities disadvantage index (UT_D). Location attributes were used to examine/validate both indices. Independent t-tests showed that units with high Occ_D were on average closer to nearest town markets and major roads when compared with units of low Occ_D. This was consistent with theory as it is expected that typical slum residents (in this case overcrowded and non-owner households) will favor accessibility to areas of high economic activity. However, this situation was not the same with UT_D as shown by lack of such as a strong pattern. The second major objective was to tackle a "learning" problem. The purpose was to explore the potential of unsupervised machine learning to detect or "learn" slum conditions from image data. The rationale was that such an approach would be efficient, less reliant on prior knowledge and expertise. A 2012 GeoEye image scene of the study area was subjected to image classification from which the following physical settlement attributes were quantified for each of the 63 statistical areas: per cent roof area, percent open space area, per cent bare soil, per cent paved road surface, per cent dirt road surface, building shadow-roof area ratio. The shadow-roof ratio was an innovative measure used to capture the size and density attributes of buildings. In addition to the 6 image derived variables, the mean slope of each area was calculated from a digital elevation dataset. All 7 attributes were subject to principal component analysis from which the first 2 components were extracted and used for hierarchical clustering of statistical areas to derive physical types. Results show that area units could be optimally classified into 4 physical types labelled generically as Categories 1 – 4, each with at least one defining physical characteristic. Kruskal Wallis tests comparing physical types in terms of household and locations attributes showed that at least two physical types were different in terms of aggregated household slum conditions and location attributes. Category 4 areas, located on steep slopes and having high shadow-to-roof ratio, had the highest distribution of non-owner households. They were also located close to nearest town markets. They were thus the most likely candidates of slums in the city. Category 1 units on other hand located at the outskirts and having abundant open space were least likely to have slum conditions. The third major objective was to tackle the problem of "spatial scale". Neighborhoods, by their very nature of contiguity and homogeneity, represent an ideal scale for urban spatial analysis and mapping. Unfortunately, in most areas, neighborhoods are not objectively defined and slum mapping often relies in the use of arbitrary spatial units which do not capture the true extent of the phenomenon. The objective was thus to explore the use of analytic regionalization to quantitatively derive the neighborhood unit for mapping slums. Analytic neighborhoods were created by spatially constrained clustering of statistical areas using the minimum spanning tree algorithm. Unlike previous studies that relied on socio-economic and/or demographic information, this study innovatively used multiple land cover and terrain attributes as neighborhood homogenizing factors. Five analytic neighborhoods (labeled Regions 1-5) were created this way and compared using Kruskal Wallis tests for differences in household slum attributes. This was to determine largest possible contiguous areas that could be labeled as slum or non-slum neighborhoods. The results revealed that at least two analytic regions were significantly different in terms of aggregated household indicators. Region 1 stood apart as having significantly higher distributions of overcrowded and non-owner households. It could thus be viewed as the largest potential slum neighborhood in the city. In contrast, regions 3 (located at higher elevation and separated from rest of city by a steep escarpment) was generally associated with low distribution of household slum attributes and could be considered the strongest model of a non-slum or formal neighborhood. Both Regions 1 and 3 were also qualitatively correlated with two locally recognized (vernacular) neighborhoods. These neighborhoods, "Sisia" (for Region 1) and "Up Station" (for Region 3), are commonly perceived by local folk as occupying opposite ends of the socio-economic spectrum. The results obtained by successfully carrying the three major objectives have major implication for future research and policy. In the case of multi-index analysis of slum conditions, it affirms the notion the that slum phenomenon is diverse in the local context and that remediation efforts must be compartmentalized to be effective. The results of image based unsupervised mapping of slums from imagery show that it is a tool with high potential for rapid slum assessment even when there is no supporting field data. Finally, the results of analytic regionalization showed that the true extent of contiguous slum neighborhoods can be delineated objectively using land cover and terrain attributes. It thus presents an opportunity for local planning and policy actors to consider redesigning the city neighborhood districts as analytic units. Quantitively derived neighborhoods are likely to be more useful in the long term, be it for spatial sampling, mapping or planning purposes. multi-index slum measurement variable clustering unsupervised learning satellite imagery analytic regionalization Geographic Information Sciences Geography Urban Studies and Planning
139	Sélection de corpus en traduction automatique statistique / Efficient corpus selection for statistical machine translation Abdul Rauf, Sadaf 17 January 2012 (has links) Dans notre monde de communications au niveau international, la traduction automatique est devenue une technologie clef incontournable. Plusieurs approches existent, mais depuis quelques années la dite traduction automatique statistique est considérée comme la plus prometteuse. Dans cette approche, toutes les connaissances sont extraites automatiquement à partir d'exemples de traductions, appelés textes parallèles, et des données monolingues en langue cible. La traduction automatique statistique est un processus guidé par les données. Ceci est communément avancé comme un grand avantage des approches statistiques puisque l'intervention d'être humains bilingues n'est pas nécessaire, mais peut se retourner en un problème lorsque ces données nécessaires au développement du système ne sont pas disponibles, de taille insuffisante ou dont le genre ne convient pas. Les recherches présentées dans cette thèse sont une tentative pour surmonter un des obstacles au déploiement massif de systèmes de traduction automatique statistique : le manque de corpus parallèles. Un corpus parallèle est une collection de phrases en langues source et cible qui sont alignées au niveau de la phrase. La plupart des corpus parallèles existants ont été produits par des traducteurs professionnels. Ceci est une tâche coûteuse, en termes d'argent, de ressources humaines et de temps. Dans la première partie de cette thèse, nous avons travaillé sur l'utilisation de corpus comparables pour améliorer les systèmes de traduction statistique. Un corpus comparable est une collection de données en plusieurs langues, collectées indépendamment, mais qui contiennent souvent des parties qui sont des traductions mutuelles. La taille et la qualité des contenus parallèles peuvent variées considérablement d'un corpus comparable à un autre, en fonction de divers facteurs, notamment la méthode de construction du corpus. Dans tous les cas, il n'est pas aisé d'identifier automatiquement des parties parallèles. Dans le cadre de cette thèse, nous avons développé une telle approche qui est entièrement basée sur des outils librement disponibles. L'idée principale de notre approche est l'utilisation d'un système de traduction automatique statistique pour traduire toutes les phrases en langue source du corpus comparable. Chacune de ces traductions est ensuite utilisée en tant que requête afin de trouver des phrases potentiellement parallèles. Cette recherche est effectuée à l'aide d'un outil de recherche d'information. En deuxième étape, les phrases obtenues sont comparées aux traductions automatiques afin de déterminer si elles sont effectivement parallèles à la phrase correspondante en langue source. Plusieurs critères ont été évalués tels que le taux d'erreur de mots ou le «translation edit rate (TER)». Nous avons effectué une analyse expérimentale très détaillée afin de démontrer l'intérêt de notre approche. Les corpus comparables utilisés se situent dans le domaine des actualités, plus précisément, des dépêches d'actualités des agences de presse telles que «Agence France Press (AFP)», «Associate press» ou «Xinua News». Ces agences publient quotidiennement des actualités en plusieurs langues. Nous avons pu extraire des textes parallèles à partir de grandes collections de plus de trois cent millions de mots pour les paires de langues français/anglais et arabe/anglais. Ces textes parallèles ont permis d'améliorer significativement nos systèmes de traduction statistique. Nous présentons également une comparaison théorique du modèle développé dans cette thèse avec une autre approche présentée dans la littérature. Diverses extensions sont également étudiées : l'extraction automatique de mots inconnus et la création d'un dictionnaire, la détection et suppression 1 d'informations supplémentaires, etc. Dans la deuxième partie de cette thèse, nous avons examiné la possibilité d'utiliser des données monolingues afin d'améliorer le modèle de traduction d'un système statistique... / In our world of international communications, machine translation has become a key technology essential. Several pproaches exist, but in recent years the so-called Statistical Machine Translation (SMT) is considered the most promising. In this approach, knowledge is automatically extracted from examples of translations, called parallel texts, and monolingual data in the target language. Statistical machine translation is a data driven process. This is commonly put forward as a great advantage of statistical approaches since no human intervention is required, but this can also turn into a problem when the necessary development data are not available, are too small or the domain is not appropriate. The research presented in this thesis is an attempt to overcome barriers to massive deployment of statistical machine translation systems: the lack of parallel corpora. A parallel corpus is a collection of sentences in source and target languages that are aligned at the sentence level. Most existing parallel corpora were produced by professional translators. This is an expensive task in terms of money, human resources and time. This thesis provides methods to overcome this need by exploiting the easily available huge comparable and monolingual data collections. We present two effective architectures to achieve this.In the first part of this thesis, we worked on the use of comparable corpora to improve statistical machine translation systems. A comparable corpus is a collection of texts in multiple languages, collected independently, but often containing parts that are mutual translations. The size and quality of parallel contents may vary considerably from one comparable corpus to another, depending on various factors, including the method of construction of the corpus. In any case, itis not easy to automatically identify the parallel parts. As part of this thesis, we developed an approach which is entirely based on freely available tools. The main idea of our approach is the use of a statistical machine translation system to translate all sentences in the source language comparable corpus to the target language. Each of these translations is then used as query to identify potentially parallel sentences from the target language comparable corpus. This research is carried out using an information retrieval toolkit. In the second step, the retrieved sentences are compared to the automatic translation to determine whether they are parallel to the corresponding sentence in source language. Several criteria wereevaluated such as word error rate or the translation edit rate (TER) and TERp. We conducted a very detailed experimental analysis to demonstrate the interest of our approach. We worked on comparable corpora from the news domain, more specifically, multilingual news agencies such as, "Agence France Press (AFP)", "Associate Press" or "Xinua News." These agencies publish daily news in several languages. We were able to extract parallel texts from large collections of over three hundred million words for French-English and Arabic-English language pairs. These parallel texts have significantly improved our statistical translation systems. We also present a theoretical comparison of the model developed in this thesis with another approach presented in the literature. Various extensions are also discussed: automatic extraction of unknown words and the creation of a dictionary, detection and suppression of extra information, etc.. In the second part of this thesis, we examined the possibility of using monolingual data to improve the translation model of a statistical system. The idea here is to replace parallel data by monolingual source or target language data. This research is thus placed in the context of unsupervised learning, since missing translations are produced by an automatic translation system, and after various filtering, reinjected into the system... Traduction automatique statistique Corpus comparable Recherche d'information Statistical machine translation Comparable corpus Information retrieval Unsupervised learning WER TER TERp
140	Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos / Evaluation of unsupervised feature selection methods for Text Mining Bruno Magalhães Nogueira 27 March 2009 (has links) Selecionar atributos é, por vezes, uma atividade necessária para o correto desenvolvimento de tarefas de aprendizado de máquina. Em Mineração de Textos, reduzir o número de atributos em uma base de textos é essencial para a eficácia do processo e a compreensibilidade do conhecimento extraído, uma vez que se lida com espaços de alta dimensionalidade e esparsos. Quando se lida com contextos nos quais a coleção de textos é não-rotulada, métodos não-supervisionados de redução de atributos são utilizados. No entanto, não existe forma geral predefinida para a obtenção de medidas de utilidade de atributos em métodos não-supervisionados, demandando um esforço maior em sua realização. Assim, este trabalho aborda a seleção não-supervisionada de atributos por meio de um estudo exploratório de métodos dessa natureza, comparando a eficácia de cada um deles na redução do número de atributos em aplicações de Mineração de Textos. Dez métodos são comparados - Ranking porTerm Frequency, Ranking por Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Método de Luhn, Método LuhnDF, Método de Salton e Zone-Scored Term Frequency - sendo dois deles aqui propostos - Método LuhnDF e Zone-Scored Term Frequency. A avaliação se dá em dois focos, supervisionado, pelo medida de acurácia de quatro classificadores (C4.5, SVM, KNN e Naïve Bayes), e não-supervisionado, por meio da medida estatística de Expected Mutual Information Measure. Aos resultados de avaliação, aplica-se o teste estatístico de Kruskal-Wallis para determinação de significância estatística na diferença de desempenho dos diferentes métodos de seleção de atributos comparados. Seis bases de textos são utilizadas nas avaliações experimentais, cada uma relativa a um grande domínio e contendo subdomínios, os quais correspondiam às classes usadas para avaliação supervisionada. Com esse estudo, este trabalho visa contribuir com uma aplicação de Mineração de Textos que visa extrair taxonomias de tópicos a partir de bases textuais não-rotuladas, selecionando os atributos mais representativos em uma coleção de textos. Os resultados das avaliações mostram que não há diferença estatística significativa entre os métodos não-supervisionados de seleção de atributos comparados. Além disso, comparações desses métodos não-supervisionados com outros supervisionados (Razão de Ganho e Ganho de Informação) apontam que é possível utilizar os métodos não-supervisionados em atividades supervisionadas de Mineração de Textos, obtendo eficiência compatível com os métodos supervisionados, dado que não detectou-se diferença estatística nessas comparações, e com um custo computacional menor / Feature selection is an activity sometimes necessary to obtain good results in machine learning tasks. In Text Mining, reducing the number of features in a text base is essential for the effectiveness of the process and the comprehensibility of the extracted knowledge, since it deals with high dimensionalities and sparse contexts. When dealing with contexts in which the text collection is not labeled, unsupervised methods for feature reduction have to be used. However, there aren\'t any general predefined feature quality measures for unsupervised methods, therefore demanding a higher effort for its execution. So, this work broaches the unsupervised feature selection through an exploratory study of methods of this kind, comparing their efficacies in the reduction of the number of features in the Text Mining process. Ten methods are compared - Ranking by Term Frequency, Ranking by Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Luhn\'s Method, LuhnDF Method, Salton\'s Method and Zone-Scored Term Frequency - and two of them are proposed in this work - LuhnDF Method and Zone-Scored Term Frequency. The evaluation process is done in two ways, supervised, through the accuracy measure of four classifiers (C4.5, SVM, KNN and Naïve Bayes), and unsupervised, using the Expected Mutual Information Measure. The evaluation results are submitted to the statistical test of Kruskal-Wallis in order to determine the statistical significance of the performance difference of the different feature selection methods. Six text bases are used in the experimental evaluation, each one related to one domain and containing sub domains, which correspond to the classes used for supervised evaluation. Through this study, this work aims to contribute with a Text Mining application that extracts topic taxonomies from unlabeled text collections, through the selection of the most representative features in a text collection. The evaluation results show that there is no statistical difference between the unsupervised feature selection methods compared. Moreover, comparisons of these unsupervised methods with other supervised ones (Gain Ratio and Information Gain) show that it is possible to use unsupervised methods in supervised Text Mining activities, obtaining an efficiency compatible with supervised methods, since there isn\'t any statistical difference the statistical test detected in these comparisons, and with a lower computational effort Aprendizado de máquina Aprendizado não-supervisionado Mineração de textos Seleção de atributos Feature selection Machine learning Text mining Unsupervised learning

Search results