241 |
Exploring the Usage of Neural Networks for Repairing Static Analysis Warnings / Utforsking av användningen av neurala nätverk för att reparera varningar för statisk analysLohse, Vincent Paul January 2021 (has links)
C# provides static analysis libraries for template-based code analysis and code fixing. These libraries have been used by the open-source community to generate numerous NuGet packages for different use-cases. However, due to the unstructured vastness of these packages, it is difficult to find the ones required for a project and creating new analyzers and fixers take time and effort to create. Therefore, this thesis proposes a neural network, which firstly imitates existing fixers and secondly extrapolates to fixes of unseen diagnostics. To do so, the state-of-the-art of static analysis NuGet packages is examined and further used to generate a dataset with diagnostics and corresponding code fixes for 24,622 data points. Since many C# fixers apply formatting changes, all formatting is preserved in the dataset. Furthermore, since the fixers also apply identifier changes, the tokenization of the dataset is varied between splitting identifiers by camelcase and preserving them. The neural network uses a sequence-to-sequence learning approach with the Transformer model and takes file context, diagnostic message and location as input and predicts a diff as output. It is capable of imitating 46.3% of the fixes, normalized by diagnostic type, and for data points with unseen diagnostics, it is able to extrapolate to 11.9% of normalized data points. For both experiments, splitting identifiers by camelcase produces the best results. Lastly, it is found that a higher proportion of formatting tokens in input has minimal positive impact on prediction success rates, whereas the proportion of formatting in output has no impact on success rates. / C# tillhandahåller statiska analysbibliotek för mallbaserad kodanalys och kodfixering. Dessa bibliotek har använts av open source-gemenskapen för att generera många NuGet-paket för olika användningsfall. Men på grund av mängden av dessa paket är det svårt att hitta de som krävs för ett projekt och att skapa nya analysatorer och fixare tar tid och ansträngning att skapa. Därför föreslår denna avhandling ett neuralt nätverk, som för det första imiterar befintliga korrigeringar och för det andra extrapolerar till korrigeringar av osynlig diagnostik. För att göra det har det senaste inom statisk analys NuGetpaketen undersökts och vidare använts för att generera en datauppsättning med diagnostik och motsvarande kodfixar för 24 622 datapunkter. Eftersom många C# fixers tillämpar formateringsändringar, bevaras all formatering i datasetet. Dessutom, eftersom fixarna också tillämpar identifieringsändringar, varieras tokeniseringen av datamängden mellan att dela upp identifierare efter camelcase och att bevara dem. Det neurala nätverket använder en sekvenstill- sekvens-inlärningsmetod med Transformer-modellen och tar filkontext, diagnostiskt meddelande och plats som indata och förutsäger en skillnad som utdata. Den kan imitera 46,3% av korrigeringarna, normaliserade efter diagnostisk typ, och för datapunkter med osynlig diagnostik kan den extrapolera till 11,9% av normaliserade datapunkter. För båda experimenten ger uppdelning av identifierare efter camelcase de bästa resultaten. Slutligen har det visat sig att en högre andel formateringstokens i indata har minimal positiv inverkan på åndelen korrekta förutsägelser, medan andelen formatering i utdata inte har någon inverkan på åndelen korrekta förutsägelser.
|
242 |
[pt] O JOGO DA AVALIAÇÃO: UM ESTUDO PRÁTICO SOBRE TRADUÇÃO AUTOMÁTICA / [en] THE ASSESSMENT GAME: A PRACTICAL STUDY ON MACHINE TRANSLATION17 July 2019 (has links)
[pt] O presente trabalho discute práticas de avaliação de tradução, notadamente no contexto de sistemas de tradução automática (TA). É apresentada uma descrição de trabalhos relevantes de avaliação de TA, que abrangem a análise de sistemas e também a criação de taxonomias para avaliação de erros de TA. Em seguida, é realizada uma série de três jogos de avaliação de qualidade de tradução automática, baseados nos jogos de linguagem descritos nas Investigações fiilosóficas de Ludwig Wittgenstein. O jogo 1 abrangeu a pós-edição da tradução automática de pedidos de patentes por um tradutor profissional em dois sistemas distintos, com posterior avaliação de produtividade e análise das edições efetuadas. Os resultados quantitativos do jogo 1 revelam o impacto positivo do uso de TA na produtividade. Para uma avaliação qualitativa dos resultados, foi desenvolvida uma taxonomia das edições feitas durante a tarefa de pós-edição, e os resultados indicaram que o aumento de produtividade da TA não se deu às expensas da qualidade final do texto. O jogo 2 envolveu a comparação de duas traduções do mesmo texto, feitas pelo mesmo tradutor em um intervalo de vários anos, a primeira com memória de tradução, a segunda com pós-edição de TA, e serviu de base para uma discussão sobre a qualidade de tradução produzida em diferentes sistemas e condições de trabalhos. O jogo 3 compara a versão final dos dois textos-alvo editados no jogo 1 aos pedidos de patentes correspondentes, depositados do Instituto Nacional de Propriedade Industrial (INPI), para avaliar se o uso de TA teve algum impacto na qualidade final dos textos-alvo. Os resultados dos jogos 2 e 3 apontaram para um nível de qualidade superior das traduções produzidas por pós-edição de TA para este estudo, quando comparadas a traduções feitas em contextos profissionais. / [en] This work discusses translation assessment practices, most importantly machine translation (MT) assessment. Relevant literature for MT assessment is discussed, from MT systems assessment to the design of error typologies for MT assessment. The study encompasses a series of three MT quality assessment experiments based on the language games as described by Wittgenstein in his work Philosophical Investigations. Game 1 encompassed post-editing two patent documents using different MT systems, and further evaluating translator s productivity and analyzing the edits made. The quantitative results showed the positive impact of using MT in productivity. The qualitative results included a typology of the edits made during the PE task in both MT systems, and the results achieved showed that increased productivity did not affect translation quality level. Game 2 compares two translations of the same text produced several years apart by the same translator. The former was produced using a CAT tool; the latter, a TA system. Both served as the basis for discussing the quality of translations produced in different systems and under different work conditions. Game 3 compares the post-edited documents produced in the PE task of Game 1 to the actual patent documents as filed in Brazil s Patent Office (INPI) in order to evaluate the impact of MT over the quality of post-edited documents. The results of both games 2 and 3 pointed to a superior quality of the post-edited translations of the present study over translations produced in professional contexts.
|
243 |
Exploiting Multilingualism and Transfer Learning for Low Resource Machine Translation / 低リソース機械翻訳における多言語性と転移学習の活用Prasanna, Raj Noel Dabre 26 March 2018 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21210号 / 情博第663号 / 新制||情||114(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 黒橋 禎夫, 教授 河原 達也, 教授 森 信介 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
244 |
An Initial Investigation of Neural Decompilation for WebAssembly / En Första Undersökning av Neural Dekompilering för WebAssemblyBenali, Adam January 2022 (has links)
WebAssembly is a new standard of the World Wide Web that is used as a compilation target and which is meant to enable high-performance applications. As it becomes more popular, the need for corresponding decompilers increases, for security reasons for instance. However, building an accurate decompiler capable of restoring the original source code is a complicated task. Recently, Neural Machine Translation (NMT) has been proposed as an alternative to traditional decompilers which involve a lot of manual and laborious work. We investigate the viability of Neural Machine Translation for decompiling WebAssembly binaries to C source code. The state-of-the-art transformer and LSTM sequence-to-sequence (Seq2Seq) models with attention are experimented with. We build a custom randomly-generated dataset ofWebAssembly to C pairs of source code and use different metrics to quantitatively evaluate the performance of the models. Our implementation consists of several processing steps that have the WebAssembly input and the C output as endpoints. The results show that the transformer outperforms the LSTM based neural model. Besides, while the model restores the syntax and control-flow structure with up to 95% of accuracy, it is incapable of recovering the data-flow. The different benchmarks on which we run our evaluation indicate a drop of decompilation accuracy as the cyclomatic complexity and the nesting of the programs increase. Nevertheless, our approach has a lot of potential, encouraging its usage in future works. / WebAssembly est un nouveau standard du World Wide Web utilisé comme cible de compilation et qui est principalement destiné à exécuter des applications dans un navigateur Web avec des performances supérieures. À mesure que le langage devient populaire, le besoin en rétro-ingénierie des fichiers WebAssembly binaires se ressent. Toutefois, la construction d’un bon décompilateur capable de restaurer du code source plus aussi proche que possible de l’original est une tâche compliquée. Récemment, la traduction automatique neuronale a été proposée comme alternative aux décompilateurs traditionnels qui impliquent du travail fastidieux, coûteux et difficilement adaptable à d’autres langages. Nous investiguons les chances de succès de la traduction automatique neuronale pour décompiler des fichiers binaires WebAssembly en code source C. Les modèles du transformeur et du LSTM séquence-à-séquence (Seq2Seq) sont utilisés. Nous construisons un jeu de données généré de manière aléatoire constitué de paires de code source WebAssembly et C et nous utilisons différentes métriques pour évaluer quantitativement les performances des deux modèles. Notre implémentation consiste en plusieurs phases de traitement qui reçoivent en entrée le code WebAssembly et produisent en sortie le code source C. Les résultats montrent que le transformeur est plus performant que le modèle basé sur les LSTMs. De plus, bien que le modèle puisse restaurer la syntaxe ainsi que la structure de contrôle du programme avec jusqu’à 95% de précision, il est incapable de produire un flux de données équivalent. Les différents jeux de données produits indiquent une chute de performance à mesure que la complexité cyclomatique ainsi que le niveau d’imbrication augmentent. Nous estimons, toutefois, que cette approche possède du potentiel. / WebAssembluy är en ny standard för World Wide Web som används som ett kompileringsmål och som är tänkt att möjliggöra högpresterande applikationer i webbläsaren. När det blir mer populärt ökar behovet av motsvarande dekompilatorer. Att bygga en exakt dekompilator som kan återställa den ursprungliga källkoden är dock en komplicerad uppgift. Nyligen har Neural Maskinöversättning (NMT) föreslagits som ett alternativ till traditionella dekompilatorer som innebär mycket manuellt och mödosamt arbete. Vi undersöker genomförbarheten hos Neural Maskinöversättning för dekompilering av WebAssembly -binärer till C -källkod. De toppmoderna transformer och LSTM sequence-to-sequence (Seq2Seq) modellerna med attention experimenteras med. Vi bygger en anpassad slumpmässigt genererad dataset för WebAssembly till C-källkodspar och använder olika mätvärden för att kvantitativt utvärdera modellernas prestanda. Vår implementering består av flera bearbetningssteg som har WebAssembly -ingången och C -utgången som slutpunkter. Resultaten visar att transformer överträffar den LSTM -baserade neuralmodellen. Även om modellen återställer syntaxen och kontrollflödesstrukturen med upp till 95 % noggrannhet, är den oförmögen att återställa dataflödet. De olika benchmarks som vi använder vår utvärdering på indikerar en minskning av dekompilationsnoggrannheten när den cyklomatiska komplexiteten och häckningen av programmen ökar. Vi tror dock att detta tillvägagångssätt har stor potential.
|
245 |
[en] AUTOMATIC TRANSLATION IN TRANSLATION MEMORY SYSTEMS: A COMPARATIVE STUDY OF TWO WORK METHODS / [pt] TRADUÇÃO AUTOMÁTICA EM AMBIENTES DE MEMÓRIA DE TRADUÇÃO: UM ESTUDO COMPARATIVO DE DOIS MÉTODOS DE TRABALHOJORGE MARIO DAVIDSON 26 October 2021 (has links)
[pt] Esta dissertação discute a utilização de sistemas de tradução automática em ambientes de memória de tradução (CAT), uma modalidade de trabalho cada vez mais presente no mercado de tradução especializada atual. Foi realizado um estudo experimental envolvendo quatro tradutores profissionais especializados na área de informática. Cada um dos profissionais traduziu dois textos, um deles de marketing de tecnologia e o outro altamente técnico, utilizando diferentes modalidades de trabalho. O objetivo do estudo foi verificar a existência de diferenças entre o uso de tradução automática com pós-edição no nível de segmento e o uso de tradução automática como sugestão no nível de subsegmento. As traduções foram analisadas utilizando recursos de linguística computacional por meio das seguintes métricas: variedade lexical, densidade lexical, distância de edição, considerando sequências de classes gramaticais, e produtividade. Para efeitos comparativos, foram incluídas no estudo experimental traduções 100 por cento humanas e traduções automáticas sem pós-edição. As métricas utilizadas permitiram observar diferenças nos resultados atribuíveis às modalidades de trabalho, bem como comparar os efeitos nos diferentes tipos de textos traduzidos. Finalmente, as diversas traduções de um dos textos foram submetidas à avaliação de leitores para determinar as preferências. / [en] This dissertation addresses the use of automatic translation in translation memory systems (CAT), a fast-growing modality of work in today s specialized translation market. An experimental study was conducted with four professional translators specializing in the field of computing. Each professional translated two texts, one about technology marketing and the other, a highly technical document, using different modalities of work. The purpose of the study was to identify any differences resulting from the use of automatic translation, with segment-based post-editing, and the use of automatic translation as sub-segment translation suggestions. The resources of computational linguistics were employed to analyze the translations, considering the following metrics: lexical diversity, lexical density, edit distance, taking into account grammatical sequences, and productivity. For comparative purposes, the experimental study included 100 percent human translations and automatic translations that were not submitted to post-editing. The metrics employed turned out differing results attributable to the modalities of work, and allowed for the comparison of the effects on the different types of texts translated. Finally, the multiple translations of one of the texts were submitted to the evaluation of the readers, to determine their preferences.
|
246 |
Проблемы эквивалентного машинного перевода фразеологизмов (на материале вьетнамского и русского языков) : магистерская диссертация / Problems of equivalent machine translation of phraseological units (on the material of the Vietnamese and Russian languagesНгуен, Т. Т. Х., Nguyen, T. T. H. January 2019 (has links)
В работе рассматриваются проблемы, связанные с достижением эквивалентности при переводе фразеологизмов в языковой паре вьетнамский-русский. Любой человек использует фразеологизмы, как в своей речи, так и на письме. Необходимо, чтобы системы машинного перевода корректно и понятно переводили фразеологизмы, если это возможно, подбирая соответствующие эквиваленты. В работе представлены инструменты машинного перевода фразеологизмов, проанализировано прошлое, настоящее и будущее машинного перевода, а также очерчены перспективы перевода фразеологизмов при помощи машины. / The paper deals with the problems associated with the achievement of equivalence when translating phraseological units in the Vietnamese-Russian language pair. Any person uses phraseological units, both in his speech and in writing. It is necessary that machine translation systems correctly and clearly translate phraseological units, if possible, by selecting appropriate equivalents. The paper presents the tools of machine translation of phraseological units, analyzes the past, present and future of machine translation, and outlines the prospects for the translation of phraseological units using a machine.
|
247 |
Head-to-head Transfer Learning Comparisons made Possible : A Comparative Study of Transfer Learning Methods for Neural Machine Translation of the Baltic LanguagesStenlund, Mathias January 2023 (has links)
The struggle of training adequate MT models using data-hungry NMT frameworks for low-resource language pairs has created a need to alleviate the scarcity of sufficiently large parallel corpora. Different transfer learning methods have been introduced as possible solutions to this problem, where a new model for a target task is initialized using parameters learned from some other high-resource task. Many of these methods are claimed to increase the translation quality of NMT systems in some low-resource environments, however, they are often proven to do so using different parent and child language pairs, a variation in data size, NMT frameworks, and training hyperparameters, which makes comparing them impossible. In this thesis project, three such transfer learning methods are put head-to-head in a controlled environment where the target task is to translate from the under-resourced Baltic languages Lithuanian and Latvian to English. In this controlled environment, the same parent language pairs, data sizes, data domains, transformer framework, and training parameters are used to ensure fair comparisons between the three transfer learning methods. The experiments involve training and testing models using all different combinations of transfer learning methods, parent language pairs, and either in-domain or out-domain data for an extensive study where different strengths and weaknesses are observed. The results display that Multi-Round Transfer Learning improves the overall translation quality the most but, at the same time, requires the longest training time by far. The Parameter freezing method provides a marginally lower overall improvement of translation quality but requires only half the training time, while Trivial Transfer learning improves quality the least. Both Polish and Russian work well as parents for the Baltic languages, while web-crawled data improves out-domain translations the most. The results suggest that all transfer learning methods are effective in a simulated low-resource environment, however, none of them can compete with simply having a larger target language pair data set, due to none of them overcoming the strong higher-resource baseline.
|
248 |
Towards Digitization and Machine learning Automation for Cyber-Physical System of SystemsJaved, Saleha January 2022 (has links)
Cyber-physical systems (CPS) connect the physical and digital domains and are often realized as spatially distributed. CPS is built on the Internet of Things (IoT) and Internet of Services, which use cloud architecture to link a swarm of devices over a decentralized network. Modern CPSs are undergoing a foundational shift as Industry 4.0 is continually expanding its boundaries of digitization. From automating the industrial manufacturing process to interconnecting sensor devices within buildings, Industry 4.0 is about developing solutions for the digitized industry. An extensive amount of engineering efforts are put to design dynamically scalable and robust automation solutions that have the capacity to integrate heterogeneous CPS. Such heterogeneous systems must be able to communicate and exchange information with each other in real-time even if they are based on different underlying technologies, protocols, or semantic definitions in the form of ontologies. This development is subject to interoperability challenges and knowledge gaps that are addressed by engineers and researchers, in particular, machine learning approaches are considered to automate costly engineering processes. For example, challenges related to predictive maintenance operations and automatic translation of messages transmitted between heterogeneous devices are investigated using supervised and unsupervised machine learning approaches. In this thesis, a machine learning-based collaboration and automation-oriented IIoT framework named Cloud-based Collaborative Learning (CCL) is developed. CCL is based on a service-oriented architecture (SOA) offering a scalable CPS framework that provides machine learning-as-a-Service (MLaaS). Furthermore, interoperability in the context of the IIoT is investigated. I consider the ontology of an IoT device to be its language, and the structure of that ontology to be its grammar. In particular, the use of aggregated language and structural encoders is investigated to improve the alignment of entities in heterogeneous ontologies. Existing techniques of entity alignment are based on different approaches to integrating structural information, which overlook the fact that even if a node pair has similar entity labels, they may not belong to the same ontological context, and vice versa. To address these challenges, a model based on a modification of the BERT_INT model on graph triples is developed. The developed model is an iterative model for alignment of heterogeneous IIoT ontologies enabling alignments within nodes as well as relations. When compared to the state-of-the-art BERT_INT, on DBPK15 language dataset the developed model exceeds the baseline model by (HR@1/10, MRR) of 2.1%. This motivated the development of a proof-of-concept for conducting an empirical investigation of the developed model for alignment between heterogeneous IIoT ontologies. For this purpose, a dataset was generated from smart building systems and SOSA and SSN ontologies graphs. Experiments and analysis including an ablation study on the proposed language and structural encoders demonstrate the effectiveness of the model. The suggested approach, on the other hand, highlights prospective future studies that may extend beyond the scope of a single thesis. For instance, to strengthen the ablation study, a generalized IIoT ontology that is designed for any type of IoT devices (beyond sensors), such as SAREF can be tested for ontology alignment. Next potential future work is to conduct a crowdsourcing process for generating a validation dataset for IIoT ontology alignment and annotations. Lastly, this work can be considered as a step towards enabling translation between heterogeneous IoT sensor devices, therefore, the proposed model can be extended to a translation module in which based on the ontology graphs of any device, the model can interpret the messages transmitted from that device. This idea is at an abstract level as of now and needs extensive efforts and empirical study for full maturity.
|
249 |
Classroom Translanguaging Practices and Secondary Multilingual Learners in IndianaWoongsik Choi (16624299) 20 July 2023 (has links)
<p>Many multilingual learners who use a language other than English at home face academic challenges from English monolingualism prevalent in the U.S. school system. English as a New Language (ENL) programs teach English to these learners while playing a role in reinforcing English monolingualism. For educational inclusivity and equity for multilingual learners, it is imperative to center their holistic language repertoires in ENL classrooms; however, this can be challenging due to individual and contextual factors. Using translanguaging as a conceptual framework, this qualitative case study explores how high school multilingual learners’ languages are flexibly used in ENL classes and how the students think about such classroom translanguaging practices. I used ethnographic methods to observe ENL classroom activities and instructional practices, interview the participants, and collect photos and documents in a high school in Indiana for a semester. The participants were an English-Spanish proficient ENL teacher and four students from Puerto Rico, Mexico, Honduras, and the Democratic Republic of the Congo, whose language repertoires included Spanish, Lingala, French, Arabic, and English. The findings describe the difficulties and possibilities of incorporating all students’ multilingual-multisemiotic repertoires in ENL classes. The classroom language practices primarily constituted of Spanish and drawing; some instructional activities and practices, such as the multigenre identity project and the teacher’s use of Google Translate, well integrated the students’ multilingual-multisemiotic repertoires. When the students engaged in English writing, they frequently used machine translation, such as Google Translate, through dynamic processes involving evaluation. While the students perceived such classroom translanguaging practices generally positively, they considered using machine translation as a problem, a resource, or an opportunity. With these findings, I argue that multilingual learners’ competence to use their own languages and machine translation technology freely and flexibly is a valuable resource for learning and should be encouraged and developed in ENL classrooms. To do so, ENL teachers should use instructional activities and practices considering students’ dynamic multilingualism. TESOL teacher education should develop such competence in teachers, and more multilingual resources should be provided to teachers. In the case of a multilingual classroom with singleton students, building mutual understanding, empathy, and equity-mindedness among class members should be prioritized. Finally, I recommend that the evolving multilingual technologies, such as machine translation, be actively used as teaching and learning resources for multilingual learners.</p>
|
250 |
Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automáticaZamora Martínez, Francisco Julián 07 December 2012 (has links)
El procesamiento del lenguaje natural es un área de aplicación de la inteligencia artificial, en
particular, del reconocimiento de formas que estudia, entre otras cosas, incorporar información
sintáctica (modelo de lenguaje) sobre cómo deben juntarse las palabras de una determinada lengua,
para así permitir a los sistemas de reconocimiento/traducción decidir cual es la mejor hipótesis �con
sentido común�. Es un área muy amplia, y este trabajo se centra únicamente en la parte relacionada
con el modelado de lenguaje y su aplicación a diversas tareas: reconocimiento de secuencias
mediante modelos ocultos de Markov y traducción automática estadística.
Concretamente, esta tesis tiene su foco central en los denominados modelos conexionistas de
lenguaje, esto es, modelos de lenguaje basados en redes neuronales. Los buenos resultados de estos
modelos en diversas áreas del procesamiento del lenguaje natural han motivado el desarrollo de este
estudio.
Debido a determinados problemas computacionales que adolecen los modelos conexionistas de
lenguaje, los sistemas que aparecen en la literatura se construyen en dos etapas totalmente
desacopladas. En la primera fase se encuentra, a través de un modelo de lenguaje estándar, un
conjunto de hipótesis factibles, asumiendo que dicho conjunto es representativo del espacio de
búsqueda en el cual se encuentra la mejor hipótesis. En segundo lugar, sobre dicho conjunto, se
aplica el modelo conexionista de lenguaje y se extrae la hipótesis con mejor puntuación. A este
procedimiento se le denomina �rescoring�.
Este escenario motiva los objetivos principales de esta tesis:
� Proponer alguna técnica que pueda reducir drásticamente dicho coste computacional
degradando lo mínimo posible la calidad de la solución encontrada.
� Estudiar el efecto que tiene la integración de los modelos conexionistas de lenguaje en el
proceso de búsqueda de las tareas propuestas.
� Proponer algunas modificaciones del modelo original que permitan mejorar su calidad / Zamora Martínez, FJ. (2012). Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/18066
|
Page generated in 0.1132 seconds