Spelling suggestions: "subject:"2distance metrics"" "subject:"4distance metrics""
1 |
Reordering metrics for statistical machine translationBirch, Alexandra January 2011 (has links)
Natural languages display a great variety of different word orders, and one of the major challenges facing statistical machine translation is in modelling these differences. This thesis is motivated by a survey of 110 different language pairs drawn from the Europarl project, which shows that word order differences account for more variation in translation performance than any other factor. This wide ranging analysis provides compelling evidence for the importance of research into reordering. There has already been a great deal of research into improving the quality of the word order in machine translation output. However, there has been very little analysis of how best to evaluate this research. Current machine translation metrics are largely focused on evaluating the words used in translations, and their ability to measure the quality of word order has not been demonstrated. In this thesis we introduce novel metrics for quantitatively evaluating reordering. Our approach isolates the word order in translations by using word alignments. We reduce alignment information to permutations and apply standard distance metrics to compare the word order in the reference to that of the translation. We show that our metrics correlate more strongly with human judgements of word order quality than current machine translation metrics. We also show that a combined lexical and reordering metric, the LRscore, is useful for training translation model parameters. Humans prefer the output of models trained using the LRscore as the objective function, over those trained with the de facto standard translation metric, the BLEU score. The LRscore thus provides researchers with a reliable metric for evaluating the impact of their research on the quality of word order.
|
2 |
Machine Learning-Based Ontology Mapping Tool to Enable Interoperability in Coastal Sensor NetworksBheemireddy, Shruthi 11 December 2009 (has links)
In today’s world, ontologies are being widely used for data integration tasks and solving information heterogeneity problems on the web because of their capability in providing explicit meaning to the information. The growing need to resolve the heterogeneities between different information systems within a domain of interest has led to the rapid development of individual ontologies by different organizations. These ontologies designed for a particular task could be a unique representation of their project needs. Thus, integrating distributed and heterogeneous ontologies by finding semantic correspondences between their concepts has become the key point to achieve interoperability among different representations. In this thesis, an advanced instance-based ontology matching algorithm has been proposed to enable data integration tasks in ocean sensor networks, whose data are highly heterogeneous in syntax, structure, and semantics. This provides a solution to the ontology mapping problem in such systems based on machine-learning methods and string-based methods.
|
3 |
APPROCHE INTELLIGENTE À BASE DE RAISONNEMENT À PARTIR DE CAS POUR LE DIAGNOSTIC EN LIGNE DES SYSTÈMES AUTOMATISÉS DE PRODUCTION / Intelligent case based reasoning approach for online diagnosis of automated production systemsBen Rabah, Nourhène 14 December 2018 (has links)
Les systèmes automatisés de production (SAP) représentent une classe importante des systèmes industriels qui sont de plus en plus complexes vue le grand nombre d’interaction et d’interconnexion entre leurs différents composants. En conséquence, ils sont plus sensibles aux dysfonctionnements dont les conséquences peuvent être importantes en termes de productivité, de sécurité et de qualité de production. Un défi majeur est alors de développer une approche intelligente qui peut être utilisée pour le diagnostic de ces systèmes afin de garantir leurs suretés de fonctionnement. Dans le cadre de cette thèse, nous nous intéressons seulement au diagnostic des SAP ayant une dynamique discrète. Nous présentons dans le premier chapitre ces systèmes, les dysfonctionnements possibles et la terminologie du diagnostic utilisée. Ensuite, nous présentons un état de l’art de différentes méthodes et approches existantes et aussi une synthèse de ces méthodes. Cette synthèse nous a motivé de choisir une approche à base de donnée qui s’appuie sur une technique d’apprentissage automatique, qui est le raisonnement à partir de cas (RàPC). Pour cela, nous avons présenté dans le deuxième chapitre un état de l’art sur l’apprentissage automatique et ses différentes méthodes en mettant l’accent essentiellement sur le RàPC et ses utilisations pour le diagnostic des systèmes industriels. Cette étude nous a permis de proposer dans le chapitre 3 une approche d’aide au diagnostic qui se base sur le RàPC. Cette approche s’appuie sur une phase hors ligne et une phase en ligne. La phase hors ligne permet de définir un format de représentation de cas et de construire une base de cas normaux (BCN) et une base de cas défaillants (BCD) à partir d’une base de données d’historique. La phase en ligne permet d’aider les opérateurs humains de surveillance à la prise de la décision du diagnostic la plus adéquate. Les résultats des expérimentations sur un système de tri de caisses ont présentés les piliers de cette approche qui résident au niveau du format de représentation de cas proposé et au niveau de la base de cas utilisé. Pour résoudre ces problèmes et améliorer les résultats, un nouveau format de représentation de cas est proposé dans le chapitre 4. Selon ce format et à partir des données issues du système simulé après son émulation en mode normal et fautif, les cas de la base de cas initiale sont construits. Ensuite, une phase de raisonnement et d’apprentissage incrémental est présentée. Cette phase permet non seulement le diagnostic du système surveillé mais aussi d’enrichir la base de cas suite à l’apparition des nouveaux comportements inconnus. Les expérimentations présentées dans le chapitre 5 sur « le plateau tournant » qui est un sous système du système « tri de caisses » ont permis de montrer l’amélioration des résultats et aussi d’évaluer et de comparer les performances de l’approche proposée vis-à-vis certaines approches d’apprentissage automatique et vis-à-vis une approche à base de modèle pour le diagnostic du plateau tournant. / Automated production systems (APS) represents an important class of industrial systems that are increasingly complex given the large number of interactions and interconnections between their different components. As a result, they are more susceptible to malfunctions, whose consequences can be significant in terms of productivity, safety and quality of production. A major challenge is to develop an intelligent approach that can be used to diagnose these systems to ensure their operational safety. In this thesis, we are only interested in the diagnosis of APS with discrete dynamics. We present in the first chapter these systems, the possible malfunctions and the used terminology for the diagnosis. Then, we present a state of the art of the existing methods for the diagnosis of this class of systems and also a synthesis of these methods. This synthesis motivated us to choose a data-based approach that relies on a machine learning technique, which is Case-Based Reasoning (CBR). For this reason, we presented in the second chapter a state of the art on machine learning and its different methods with a focus mainly on the CBR and its uses for the diagnosis of industrial systems. This study allowed us to propose in Chapter 3 a Case Based Decision Support System for the diagnosis of APS. This system is based on an online block and an offline block. The Offline block is used to define a case representation format and to build a Normal Case Base (NCB) and a Faulty Case Base (FCB) from a historical database. The online block helps human operators of monitoring to make the most appropriate diagnosis decision. The experiments results perform on a sorting system presented the pillars of this approach, which reside in the proposed case representation format and in the used case base. To solve these problems and improve the results, a new case representation format is proposed in chapter 4. According to this format and from the data acquired from the simulated system after its emulation in normal and faulty mode, cases of the initial case base are build. Then, a reasoning and incremental learning phase is presented. This phase allows the system diagnosis and the enrichment of the case base following the appearance of new unknown behaviors. The experiments presented in Chapter 5 and perform on the 'turntable' which is a subsystem of the 'sorting system” allowed to show the improvement of the results and also to evaluate and compare the performances of the proposed approach with some automatic learning approaches and with a model-based approach to turntable diagnosis.
|
4 |
Indexing presentations using multiple media streamsRuddarraju, Ravikrishna 15 August 2006 (has links)
This thesis presents novel techniques to index multiple media streams in a digi-
tally captured presentation. These media streams are related by the common content in
a presentation. We use relevance curves to represent these relationships. These relevance
curves are generated by using a mix of text processing techniques and distance measures for
sparse vocabularies. These techniques are used to automatically detect slide boundaries in
a presentation. Accuracy of detecting these boundaries is evaluated as a function of word
error rates.
|
5 |
Parametric kernels for structured data analysisShin, Young-in 04 May 2015 (has links)
Structured representation of input physical patterns as a set of local features has been useful for a veriety of robotics and human computer interaction (HCI) applications. It enables a stable understanding of the variable inputs. However, this representation does not fit the conventional machine learning algorithms and distance metrics because they assume vector inputs. To learn from input patterns with variable structure is thus challenging. To address this problem, I propose a general and systematic method to design distance metrics between structured inputs that can be used in conventional learning algorithms. Based on the observation of the stability in the geometric distributions of local features over the physical patterns across similar inputs, this is done combining the local similarities and the conformity of the geometric relationship between local features. The produced distance metrics, called “parametric kernels”, are positive semi-definite and require almost linear time to compute. To demonstrate the general applicability and the efficacy of this approach, I designed and applied parametric kernels to handwritten character recognition, on-line face recognition, and object detection from laser range finder sensor data. Parametric kernels achieve recognition rates competitive to state-of-the-art approaches in these tasks. / text
|
6 |
Aprendizado semi-supervisionado utilizando modelos de caminhada de partículas em grafos / Semi-supervised learning using walking particles model in graphsGuerreiro, Lucas [UNESP] 01 September 2017 (has links)
Submitted by Lucas Guerreiro null (lucasg@rc.unesp.br) on 2017-10-16T22:03:24Z
No. of bitstreams: 1
LucasGuerreiro_dissertacao.pdf: 2072249 bytes, checksum: 03cb08b42175616dd567a364cf201bcd (MD5) / Approved for entry into archive by Monique Sasaki (sayumi_sasaki@hotmail.com) on 2017-10-18T18:42:00Z (GMT) No. of bitstreams: 1
guerreiro_l_me_sjrp.pdf: 2072249 bytes, checksum: 03cb08b42175616dd567a364cf201bcd (MD5) / Made available in DSpace on 2017-10-18T18:42:00Z (GMT). No. of bitstreams: 1
guerreiro_l_me_sjrp.pdf: 2072249 bytes, checksum: 03cb08b42175616dd567a364cf201bcd (MD5)
Previous issue date: 2017-09-01 / O Aprendizado de Máquina é uma área que vem crescendo nos últimos anos e é um dos destaques dentro do campo de Inteligência Artificial. Atualmente, uma das subáreas mais estudadas é o Aprendizado Semi-Supervisionado, principalmente pela sua característica de ter um menor custo na rotulação de dados de exemplo. A categoria de modelos baseados em grafos é a mais ativa nesta subárea, fazendo uso de estruturas de redes complexas. O algoritmo de competição e cooperação entre partículas é uma das técnicas deste domínio. O algoritmo provê acurácia de classificação compatível com a de algoritmos do estado da arte, e oferece um custo computacional inferior à maioria dos métodos da mesma categoria. Neste trabalho é apresentado um estudo sobre Aprendizado Semi-Supervisionado, com ênfase em modelos baseados em grafos e, em particular, no Algoritmo de Competição e Cooperação entre Partículas (PCC). O objetivo deste trabalho é propor um novo algoritmo de competição e cooperação entre partículas baseado neste modelo, com mudanças na caminhada pelo grafo, com informações de dominância sendo mantidas nas arestas ao invés dos nós; as quais possam melhorar a acurácia de classificação ou ainda o tempo de execução em alguns cenários. É proposta também uma metodologia de avaliação da rede obtida com o modelo de competição e cooperação entre partículas, para se identificar a melhor métrica de distância a ser aplicada em cada caso. Nos experimentos apresentados neste trabalho, pode ser visto que o algoritmo proposto teve melhor acurácia do que o PCC em algumas bases de dados, enquanto o método de avaliação de métricas de distância atingiu também bom nível de precisão na maioria dos casos. / Machine Learning is an increasing area over the last few years and it is one of the highlights in Artificial Intelligence area. Nowadays, one of the most studied areas is Semi-supervised learning, mainly due to its characteristic of lower cost in labeling sample data. The most active category in this subarea is that of graph-based models, using complex networks concepts. The Particle Competition and Cooperation in Networks algorithm (PCC) is one of the techniques in this field. The algorithm provides accuracy compatible with state of the art algorithms, and it presents a lower computational cost when compared to most techniques in the same category. In this project, it is presented a research about semi-supervised learning, with focus on graphbased models and, in special, the Particle Competition and Cooperation in Networks algorithm. The objective of this study is to base proposals of new particle competition and cooperation algorithms based on this model, with new dynamics on the graph walking, keeping dominance information on the edges instead of the nodes; which may improve the accuracy classification or yet the runtime in some situations. It is also proposed a method of evaluation of the network built with the Particle Competition and Cooperation model, in order to infer the best distance metric to be used in each case. In the experiments presented in this work, it can be seen that the proposed algorithm presented better accuracy when compared to the PCC for some datasets, while the proposed distance metrics evaluation achieved a high precision level in most cases.
|
7 |
Machine Learning Methods For Using Network Based Information In Microrna Target PredictionSualp, Merter 01 February 2013 (has links) (PDF)
Computational microRNA (miRNA) target identification in animal genomes is a challenging problem due to the imperfect pairing of the miRNA with the target site. Techniques based on sequence alone are prone to produce many false positive interactions. Therefore, integrative techniques have been developed to utilize additional genomic, structural features, and evolu- tionary conservation information for reducing the high false positive rate. We propose that the context of a putative miRNA target in a protein-protein interaction (PPI) network can be used as an additional filter in a computational miRNA target pr ediction algorithm. We compute several graph theoretic measures on human PPI network as indicators of network context. We assess the performance of individual and combined contextual measures in increasing the precision of a popular miRNA target prediction tool, TargetScan, using low throughput and high throughput datasets of experimentally verified human miRNA targets. We used clas- sification algorithms for that assessment. Since there exists only miRNA targets as training samples, this problem becomes a One Class Classification (OCC) problem. We devised a novel OCC method, DiVo, based on simple distance metrics and voting. Comparative analysis with the state of the art methods show that, DiVo attains better classification performance. Our eventual results indicate that topological properties of target gene products in PPI networks are valuable sources of information for filtering out false positive miRNA target genes. We show that, for targets of a number of miRNAs, netwo rk context correlates better with being a target compared to a sequence based score provided by the prediction tool.
|
8 |
Machine Learning Methods For Using Network Based Information In Microrna Target PredictionSualp, Merter 01 February 2013 (has links) (PDF)
Computational microRNA (miRNA) target identification in animal genomes is a challenging problem due to the imperfect pairing of the miRNA with the target site. Techniques based on sequence alone are prone to produce many false positive interactions. Therefore, integrative techniques have been developed to utilize additional genomic, structural features, and evolu- tionary conservation information for reducing the high false positive rate. We propose that the context of a putative miRNA target in a protein-protein interaction (PPI) network can be used as an additional filter in a computational miRNA target prediction algorithm. We compute several graph theoretic measures on human PPI network as indicators of network context. We assess the performance of individual and combined contextual measures in increasing the precision of a popular miRNA target prediction tool, TargetScan, using low throughput and high throughput datasets of experimentally verified human miRNA targets. We used clas- sification algorithms for that assessment. Since there exists only miRNA targets as training samples, this problem becomes a One Class Classification (OCC) problem. We devised a novel OCC method, DiVo, based on simple distance metrics and voting. Comparative analysis with the state of the art methods show that, DiVo attains better classification performance. Our eventual results indicate that topological properties of target gene products in PPI networks are valuable sources of information for filtering out false positive miRNA target genes. We show that, for targets of a number of miRNAs, network context correlates better with being a target compared to a sequence based score provided by the prediction tool.
|
9 |
Diff pro multimediální dokumenty / Multimedia Document Type DiffLang, Jozef January 2012 (has links)
Development of Internet and its massive spread resulted in increased volume of multimedia data. The increase in the amount of multimedia data raises the need for efficient similarity detection between multimedia files for the purpose of preventing and detecting violations of copyright licenses or for detection of similar or duplicate files. This thesis discusses the current options in the field of the content-based image and video comparison and focuses on the feature extraction techniques, distance metrics, design and implementation of the mediaDiff application module for the content-based comparison of video files.
|
10 |
Viewership forecast on a Twitch broadcast : Using machine learning to predict viewers on sponsored Twitch streamsMalm, Jonas, Friberg, Martin January 2022 (has links)
Today, the video game industry is larger than the sports and film industries combined, and the largest streaming platform Twitch with an average of 2.8 million concurrent viewers offers the possibility for gaming and non-gaming brands to market their products. Estimating streamers’ viewership is central in these marketing campaigns, but no large-scale studies have been conducted to predict viewership previously. This paper evaluates three different machine learning algorithms with regard to the three different error metrics MAE, MAPE and RMSE; and presents novel features for predicting viewership. Different models are chosen through recursive feature elimination using k-fold cross-validation with respect to both MAE and MAPE separately. The models are evaluated on an independent test and show promising results, on par with manual expert predictions. None of the models can be said to be significantly better than another. XGBoost optimized for MAPE obtained the lowest MAE error score of 282.54 and lowest MAPE error score of 41.36% on the test set, in comparison to expert predictions with 288.06 MAE and 83.05% MAPE. Furthermore, the study illustrates the importance of past viewership and streamer variety to predict future viewership.
|
Page generated in 0.0941 seconds