371 |
Interactive Machine Assistance: A Case Study in Linking Corpora and DictionariesBlack, Kevin P 01 November 2015 (has links) (PDF)
Machine learning can provide assistance to humans in making decisions, including linguistic decisions such as determining the part of speech of a word. Supervised machine learning methods derive patterns indicative of possible labels (decisions) from annotated example data. For many problems, including most language analysis problems, acquiring annotated data requires human annotators who are trained to understand the problem and to disambiguate among multiple possible labels. Hence, the availability of experts can limit the scope and quantity of annotated data. Machine-learned pre-annotation assistance, which suggests probable labels for unannotated items, can enable expert annotators to work more quickly and thus to produce broader and larger annotated resources more cost-efficiently. Yet, because annotated data is required to build the pre-annotation model, bootstrapping is an obstacle to utilizing pre-annotation assistance, especially for low-resource problems where little or no annotated data exists. Interactive pre-annotation assistance can mitigate bootstrapping costs, even for low-resource problems, by continually refining the pre-annotation model with new annotated examples as the annotators work. In practice, continually refining models has seldom been done except for the simplest of models which can be trained quickly. As a case study in developing sophisticated, interactive, machine-assisted annotation, this work employs the task of corpus-dictionary linkage (CDL), which is to link each word token in a corpus to its correct dictionary entry. CDL resources, such as machine-readable dictionaries and concordances, are essential aids in many tasks including language learning and corpus studies. We employ a pipeline model to provide CDL pre-annotations, with one model per CDL sub-task. We evaluate different models for lemmatization, the most significant CDL sub-task since many dictionary entry headwords are usually lemmas. The best performing lemmatization model is a hybrid which uses a maximum entropy Markov model (MEMM) to handle unknown (novel) word tokens and other component models to handle known word tokens. We extend the hybrid model design to the other CDL sub-tasks in the pipeline. We develop an incremental training algorithm for the MEMM which avoids wasting previous computation as would be done by simply retraining from scratch. The incremental training algorithm facilitates the addition of new dictionary entries over time (i.e., new labels) and also facilitates learning from partially annotated sentences which allows annotators to annotate words in any order. We validate that the hybrid model attains high accuracy and can be trained sufficiently quickly to provide interactive pre-annotation assistance by simulating CDL annotation on Quranic Arabic and classical Syriac data.
|
372 |
[en] AUTOMATIC INFORMATION EXTRACTION: A DISTANT READING OF THE BRAZILIAN HISTORICAL-BIOGRAPHICAL DICTIONARY (DHBB) / [pt] EXTRAÇÃO AUTOMÁTICA DE INFORMAÇÕES: UMA LEITURA DISTANTE DO DICIONÁRIO HISTÓRICO-BIOGRÁFICO BRASILEIRO (DHBBSUEMI HIGUCHI 10 September 2021 (has links)
[pt] A pesquisa aplica algumas técnicas de processamento de linguagem natural (PLN) ao domínio da história, tendo como objeto de investigação o Dicionário Histórico-Biográfico Brasileiro (DHBB), obra de estilo enciclopédico concebida pelo Centro de Pesquisa e Documentação de História Contemporânea do Brasil (CPDOC) da Fundação Getulio Vargas (FGV). O objetivo foi criar, a partir do DHBB, um corpus anotado para fins de extração automática de informações, relevante para as Humanidades Digitais, capaz de viabilizar ‘leituras distantes’ da política contemporânea brasileira. O processo completo passa pelas etapas de análise morfossintática do material, identificação de entidades relevantes ao domínio, inclusão de anotação no corpus, definição de relações semânticas de interesse para a pesquisa e mapeamento dos padrões léxico-sintáticos existentes nestas relações. Busca-se com estas etapas preparar os textos para a identificação de estruturas de interesse, isolando as informações relevantes e apresentando-as de forma estruturada. Para testar e avaliar um conjunto de padrões quanto à sua produtividade, foram selecionados como temas de interesse idade de entrada dos biografados na carreira política, formação acadêmica e vínculos familiares. O pressuposto é que utilizando padrões léxico-sintáticos é possível extrair informação de qualidade direcionada ao domínio da História, a partir de um corpus anotado do gênero enciclopédico. Na avaliação dos padrões para a extração do ano de nascimento dos biografados a medida-F foi de 99 por cento, para a extração de relações familiares a medida-F foi de 84% e para informações sobre formação acadêmica o índice de acertos alcançou 99,1 por cento. Essas extrações, por sua vez, permitiram uma leitura distante dos dados do DHBB que nos mostra i) queda da média de idade no que se refere à entrada dos políticos na carreira pública, que passam a se posicionar cada vez mais abaixo dos 40 anos, principalmente os nascidos a partir da década de 1960; ii) declínio acentuado na formação militar, sobretudo para as gerações pós 1920, demonstrando que o treinamento civil estava substituindo o militar enquanto caminho para atingir cargos políticos importantes; e iii) vínculos familiares na política como um fenômeno que se mantêm ao longo do tempo em índices bastante significativos, muitas vezes representando mais de 50 por cento do total de membros de determinadas categorias. As principais contribuições da tese são: criação de um corpus de gênero enciclopédico anotado e disponibilizado para estudos linguísticos e das humanidades; apresentação de metodologia baseada em uma filosofia de enriquecimento cíclico, em que à medida que se vai obtendo mais informações, elas são adicionadas ao próprio corpus melhorando a extração; e compilação de um conjunto de padrões passível de ser adaptado para quaisquer corpora contendo o mesmo tipo de anotações. / [en] The research applies some natural language processing techniques (NLP) to the domain of history, having as object of investigation the Brazilian Historical-Biographical Dictionary (DHBB), an encyclopedic style work conceived by the Centro de Pesquisa e Documentação de História Contemporânea do Brasil (CPDOC) of Fundação Getulio Vargas (FGV). The target is to create, from the DHBB, an annotated corpus for automatic information extraction s purpose, relevant to the Digital Humanities, enabling distant readings of Brazilian contemporary political history. The complete process goes through the morphosyntactic analysis of the material, identification of entities relevant to the domain, inclusion of semantic annotation in the corpus, definition of semantic relations of interest and mapping of lexical-syntactic patterns existing in these relations. These steps seek to prepare the texts for the identification of structures of interest, isolating the relevant information and presenting them in a structured way. To test and evaluate a set of textual patterns regarding their productivity in relation to DHBB, some specific topics were selected: age of the politician when entering public life, academic training and family ties. The assumption is that using lexical-syntactic patterns it is possible to extract high quality information from the domain of History, from an annotated corpus of the encyclopedic genre. In the evaluation of the patterns for extraction of the year of birth of the biographees, the F-measure was 99 per cent, for the extraction of family relationships, the F-measure was 84 per cent and for information on academic training, the correctness index reached 99.1 per cent. These extractions, in turn, allowed us to make a distant reading of the data in the DHBB that shows us i) a drop in the average age with regard to the entry of politicians into the public career, who start to position themselves more and more under 40 years of age, mainly those born from the 1960s; ii) sharp decline in military training, especially for the post-1920 generations, demonstrating that civilian training was replacing military training as a way to reach important political positions; and iii) family ties in politics as a phenomenon that remain over time at very significant rates, often representing more than 50 per cent of the total members of certain categories. The main contributions of the thesis are: creation of an encyclopedic genre corpus annotated and made available for linguistic and humanities studies; presentation of a methodology based on a philosophy of cyclic enrichment, in which, as more information is obtained, they are added to the corpus itself, improving extraction; and compilation of a set of productive patterns that can be adapted for any corpora containing the same type of annotations.
|
373 |
Информационно-коммуникационные технологии как средства преподавания основ православной культуры младшим подросткам : магистерская диссертация / Information and communication technologies in teaching Foundations of orthodox culture to junior teenagersПотапова, А. В., Potapova, A. V. January 2016 (has links)
The dissertation focuses on the challenges of teaching “Foundations of Orthodox Culture” course to junior teenagers in general schools. Based on the analysis of age-specific psychological, pedagogical and cultural traits of younger teenagers as a specific cultural group, the author explores the main challenges faced by teachers of “Foundations of Orthodox Culture” curriculum in Russian general education schools. As a potential approach to overcoming some of these challenges, the author has designed a digital encyclopedic dictionary to facilitate modern (i.e. utilizing information and communication technologies) presentation of course materials both during the lessons and in extracurriculum activities. / Диссертация посвящена проблемам преподавания предмета «Основы православной культуры» младшим подросткам в общеобразовательной школе. На основе анализа возрастных психолого-педагогических и культурных особенностей младших подростков (как особой культурной группы) анализируются основные проблемы преподавания предмета «Основы православной культуры» в российских общеобразовательных школах. Как вариант разрешения ряда проблем предлагается авторский электронный энциклопедический словарь, ориентированный на современное (с использованием информационно-коммуникационных технологий) изложение материала на уроках и во внеурочной деятельности.
|
374 |
Разработка англо-русского частотного словаря стоматологических терминов : магистерская диссертация / The development of an English-Russian frequency dictionary of dental termsГрачева, Е. В., Gracheva, E. V. January 2022 (has links)
Работа посвящена созданию англо-русского частотного словаря стоматологических терминов на базе корпуса текстов с применением метода лемматизации. Теоретическая часть исследования, содержащая информацию об особенностях составления частотных словарей, подробный анализ терминосистемы в сфере стоматологии и её перевода, послужила базой для практической работы. Благодаря лемматизации и последующей ручной работы в частотный словарь попала только специализированная лексика в сфере стоматологии, что способствовало сокращению объёма частотного словаря и повышению его качества. / The paper is devoted to the development of an English-Russian frequency dictionary of dental terms using the lemmatization method based on the corpus of texts developed in the Sketch Engine program. The theoretical part containing information about the peculiarities of compiling frequency dictionaries, a detailed analysis of the dental terminology and its translation served as the basis for the practical work. Applying the method of lemmatization, we included only dental terminology in the frequency dictionary, without taking into account the commonly used vocabulary, which contributed to reducing the volume of the frequency dictionary and improving its quality.
|
375 |
La terminologia catalana dels incendis forestals. Recerca, anàlisi i proposta de diccionari especialitzat català-castellà-anglèsGil Puig, Adriana 06 September 2022 (has links)
[ES] Los incendios forestales son un tema de flamante actualidad en la cuenca mediterránea, han pasado de ser un elemento tradicional de gestión agroforestal a devenir la principal amenaza para la supervivencia de los bosques como consecuencia del cambio climático y del abandono de las zonas rurales. Esta tesis aborda el estudio de la terminología catalana de este campo con tres objetivos: compilarla, caracterizarla y ofrecer una aplicación terminológica adaptada a las necesidades profesionales. Desde el punto de vista teórico, se enmarca en la Teoría Comunicativa de la Terminología (Cabré, 1999) y se fundamenta en las investigaciones sobre los lenguajes de especialidad, la terminología y el trabajo terminológico, que tienen como objeto de estudio principal los términos y como producto prototípico los diccionarios. Desde el punto de vista empírico, acomete una investigación terminográfica sistemática monolingüe con equivalencias siguiendo una metodología ampliamente contrastada en terminología catalana. Primero, se analizan las necesidades terminológicas del personal especialista en incendios forestales y su contexto profesional. Segundo, se elabora un corpus de textos de la especialidad, a partir del cual se realiza la extracción semiautomatizada de los términos mediante la estación de trabajo terminológico Terminus, atendiendo a un árbol de campo previamente preestablecido. A continuación, se configura un fichero terminológico de más de un millar de términos seleccionados del campo de estudio y se completa con definiciones, subcampos, contextos, equivalencias y otras informaciones. Tercero, se analiza el listado de términos obtenido desde un punto de vista formal, semántico, neológico y contrastivo, para establecer los rasgos singulares. Finalmente, se desarrolla una propuesta de diccionario terminológico de incendios forestales catalán-castellano-inglés dirigida al personal experto. / [CA] Els incendis forestals són un tema de flamant actualitat a la conca mediterrània, han passat de ser un element tradicional de gestió agroforestal a esdevenir la principal amenaça per a la supervivència dels boscos com a conseqüència del canvi climàtic i de l'abandonament de les zones rurals. Aquesta tesi aborda l'estudi de la terminologia catalana d'aquest camp amb tres objectius: compilar-la, caracteritzar-la i oferir una aplicació terminològica adaptada a les necessitats professionals. Des del punt de vista teòric, s'emmarca en la Teoria Comunicativa de la Terminologia (Cabré, 1999) i té les bases en les recerques sobre els llenguatges d'especialitat, la terminologia i el treball terminològic, que tenen com a objecte d'estudi principal els termes i com a producte prototípic els diccionaris. Des del punt de vista empíric, acompleix una recerca terminogràfica sistemàtica monolingüe amb equivalències seguint una metodologia àmpliament contrastada en terminologia catalana. Primer, s'analitzen les necessitats terminològiques del personal especialista en incendis forestals i el seu context professional. Segon, s'elabora un corpus de textos de l'especialitat, a partir del qual s'escomet l'extracció semiautomatitzada dels termes mitjançant l'estació de treball terminològic Terminus, tot atenent a un arbre de camp prèviament preestablert. A continuació, es configura un fitxer terminològic de més d'un miler de termes seleccionats de l'especialitat i es completa amb definicions, subcamps, contextos, equivalències i altres informacions. Tercer, s'analitza el llistat de termes obtingut des d'un punt de vista formal, semàntic, neològic i contrastiu, per tal d'establir-ne els trets singulars. Finalment, es desenvolupa una proposta de diccionari terminològic d'incendis forestals català-castellà-anglès adreçada a personal expert. / [EN] Forest fires are currently a topical issue in the Mediterranean basin. They have ceased to be a traditional element of agroforest management to become the main threat for the survival of the forests because of climatic change and abandonment of rural areas. This thesis addresses the study of the Catalan terminology of this field with three main aims: compile it, characterize it and offer a terminological application adapted to the needs of the specialists. From the theoretical point of view, it fits in the framework of the Communicative Theory of Terminology (Cabré, 1999) and is based on the studies on specialized languages, terminology and the terminological work, that have the terms themselves as the main object of study, and the dictionaries as a main product. From an empirical point of view, it accomplished a terminographic monolingual systematic research with equivalences following a widely contrasted methodology in Catalan terminology. First, the terminological needs of the specialists in forest fires and their professional context are analyzed. Second, a corpus of specialized texts is created, from which a semiautomated extraction of the terms by means of the terminological workstation Terminus is done, attending to a preestablished field tree. Following this, a terminological file is compiled, with more than a thousand terms selected by field and completed with definitions, subfields, contexts, equivalences and other information. Third, the list of terms obtained is analyzed from a formal, semantic, neologistic and contrasted point of view, in order to establish their distinctive traits. Finally, a proposal for a terminological forest fires dictionary in Catalan-Spanish-English addressed to experts is presented. / Gil Puig, A. (2022). La terminologia catalana dels incendis forestals. Recerca, anàlisi i proposta de diccionari especialitzat català-castellà-anglès [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/186183
|
376 |
Slovotvorba v německo-českém slovníku / Word Formation in German-Czech DictionaryŠemelík, Martin January 2014 (has links)
The present thesis is based on my experience as one of the contributors to The Large Academic Dictionary German-Czech. It attempts to discuss the role of word formation in German-Czech dictionaries in that it focuses on presentation of word formation in outer texts, macrostructural ordering procedures, treatment of word forming elements, special word formation parts of dictionary entries and possibilities of typography as a means word formation description in a bilingual dictionary. The approach taken is both contemplative and transformative. The thesis rests on the study of existing German-Czech dictionaries published mostly after 1945, partly between 1802 and 1945 as well. Concrete function-based proposals centred on the supposed target users of the LADGC are discussed here. A considerable part of the thesis deals with German derived nouns in Ge-...(-e) seen from a corpus linguistic view.
|
377 |
K problémům členění na významy v překladovém slovníku / On Sense Division in a Bilingual DictionaryHagenhoferová, Lucie January 2014 (has links)
This thesis deals with the dividing the dictionary entry into several "sub-meanings", i.e. with the sense division, and with the closely related ordering of these "sub-meanings", i.e. with the sense ordering, with the bilingual passive German-Czech dictionary in the centre of interest. This thesis deals shortly also with the discriminating the senses by different means, i.e. with the sense discrimination. With these three subjects chronologically following the lexicographic decisions the matter of equivalence and the understanding of the term of meaning are correlated. In the theoretical part of this thesis the specifics of the sense division and the sense ordering in the monolingual and in the bilingual lexicography are introduced, in the practical part of this thesis the possibilities of the sense division and the sense ordering are exemplified with ten chosen substantive lemmas prepared for the Large German-Czech Academic Dictionary in progress. For every lemma the most suitable arrangement of the lemma is suggested, which is then compared with the corresponding dictionary entry in the source dictionary Duden - Deutsches Universalwörterbuch. The differences between the arrangements of the microstructure illustrate the necessity of the revision and eventual modification of the adopted structure...
|
378 |
Systémové a překladové ekvivalenty německých privativ na -frei a -los / Systemic and Translation Equivalents of German Adjectives Ending on -frei and -losBernasová, Mariana January 2016 (has links)
This thesis uses language corpora to analyze Czech translations of a linguistically asymmetric phenomenon of German privatives ending in -frei and -los from three perspectives: translation typology (micro-stylistics and macro-stylistics), Popovič's stylistic adequacy (shifts of expression: intensification of expression, attenuation of expression, correspondence of expression) and potentially intrinsic feature of German privatives to perceive the fact of absence ("privation") as positive or negative. Privatives are adjectives that express the absence of substance or quality that is represented in their first (left) component; in the context of this work they are limited to adjectives ending in -frei, -los, -arm and -leer. In the Introduction the current state of research is outlined and its time and local limitations are explained. It is emphasized that corpus is here used not just for translation as such but also for theory of translation. The hypothesis starts from the assumption that German privatives as phenomena of grammar have no equivalent on this level in Czech; therefore a direct translation equivalent is often missing. For this reason it is also probable that the translator will have to decide for such Czech translation of one German privative that comprises more words or even a whole...
|
379 |
Méthodes modernes d'analyse de données en biophysique analytique : résolution des problèmes inverses en RMN DOSY et SM / New methods of data analysis in analytical biophysics : solving the inverse ill-posed problems in DOSY NMR and MSCherni, Afef 20 September 2018 (has links)
Cette thèse s’intéresse à la création de nouvelles approches algorithmiques pour la résolution du problème inverse en biophysiques. Dans un premier temps, on vise l’application RMN de type DOSY: une nouvelle approche de régularisation hybride a été proposée avec un nouvel algorithme PALMA (http://palma.labo.igbmc.fr/). Cet algorithme permet d’analyser des données réelles DOSY avec une précision importante quelque soit leur type. Dans un deuxième temps, notre intérêt s’est tourné vers l’application de spectrométrie de masse. Nous avons proposé une nouvelle approche par dictionnaire dédiée à l’analyse protéomique en utilisant le modèle averagine et une stratégie de minimisation sous contraintes d'une pénalité de parcimonie. Afin d’améliorer la précision de l’information obtenue, nous avons proposé une nouvelle méthode SPOQ, basée sur une nouvelle fonction de pénalisation, résolue par un nouvel algorithme Forward-Backward à métrique variable localement ajustée. Tous nos algorithmes bénéficient de garanties théoriques de convergence, et ont été validés expérimentalement sur des spectres synthétisés et des données réelles / This thesis aims at proposing new approaches to solve the inverse problem in biophysics. Firstly, we study the DOSY NMR experiment: a new hybrid regularization approach has been proposed with a novel PALMA algorithm (http://palma.labo.igbmc.fr/). This algorithm ensures the efficient analysis of real DOSY data with a high precision for all different type. In a second time, we study the mass spectrometry application. We have proposed a new dictionary based approach dedicated to proteomic analysis using the averagine model and the constrained minimization approach associated with a sparsity inducing penalty. In order to improve the accuracy of the information, we proposed a new SPOQ method based on a new penalization, solved with a new Forward-Backward algorithm with a variable metric locally adjusted. All our algorithms benefit from sounded convergence guarantees, and have been validated experimentally on synthetics and real data.
|
380 |
Non-linear dimensionality reduction and sparse representation models for facial analysis / Réduction de la dimension non-linéaire et modèles de la représentations parcimonieuse pour l’analyse du visageZhang, Yuyao 20 February 2014 (has links)
Les techniques d'analyse du visage nécessitent généralement une représentation pertinente des images, notamment en passant par des techniques de réduction de la dimension, intégrées dans des schémas plus globaux, et qui visent à capturer les caractéristiques discriminantes des signaux. Dans cette thèse, nous fournissons d'abord une vue générale sur l'état de l'art de ces modèles, puis nous appliquons une nouvelle méthode intégrant une approche non-linéaire, Kernel Similarity Principle Component Analysis (KS-PCA), aux Modèles Actifs d'Apparence (AAMs), pour modéliser l'apparence d'un visage dans des conditions d'illumination variables. L'algorithme proposé améliore notablement les résultats obtenus par l'utilisation d'une transformation PCA linéaire traditionnelle, que ce soit pour la capture des caractéristiques saillantes, produites par les variations d'illumination, ou pour la reconstruction des visages. Nous considérons aussi le problème de la classification automatiquement des poses des visages pour différentes vues et différentes illumination, avec occlusion et bruit. Basé sur les méthodes des représentations parcimonieuses, nous proposons deux cadres d'apprentissage de dictionnaire pour ce problème. Une première méthode vise la classification de poses à l'aide d'une représentation parcimonieuse active (Active Sparse Representation ASRC). En fait, un dictionnaire est construit grâce à un modèle linéaire, l'Incremental Principle Component Analysis (Incremental PCA), qui a tendance à diminuer la redondance intra-classe qui peut affecter la performance de la classification, tout en gardant la redondance inter-classes, qui elle, est critique pour les représentations parcimonieuses. La seconde approche proposée est un modèle des représentations parcimonieuses basé sur le Dictionary-Learning Sparse Representation (DLSR), qui cherche à intégrer la prise en compte du critère de la classification dans le processus d'apprentissage du dictionnaire. Nous faisons appel dans cette partie à l'algorithme K-SVD. Nos résultats expérimentaux montrent la performance de ces deux méthodes d'apprentissage de dictionnaire. Enfin, nous proposons un nouveau schéma pour l'apprentissage de dictionnaire adapté à la normalisation de l'illumination (Dictionary Learning for Illumination Normalization: DLIN). L'approche ici consiste à construire une paire de dictionnaires avec une représentation parcimonieuse. Ces dictionnaires sont construits respectivement à partir de visages illuminées normalement et irrégulièrement, puis optimisés de manière conjointe. Nous utilisons un modèle de mixture de Gaussiennes (GMM) pour augmenter la capacité à modéliser des données avec des distributions plus complexes. Les résultats expérimentaux démontrent l'efficacité de notre approche pour la normalisation d'illumination. / Face analysis techniques commonly require a proper representation of images by means of dimensionality reduction leading to embedded manifolds, which aims at capturing relevant characteristics of the signals. In this thesis, we first provide a comprehensive survey on the state of the art of embedded manifold models. Then, we introduce a novel non-linear embedding method, the Kernel Similarity Principal Component Analysis (KS-PCA), into Active Appearance Models, in order to model face appearances under variable illumination. The proposed algorithm successfully outperforms the traditional linear PCA transform to capture the salient features generated by different illuminations, and reconstruct the illuminated faces with high accuracy. We also consider the problem of automatically classifying human face poses from face views with varying illumination, as well as occlusion and noise. Based on the sparse representation methods, we propose two dictionary-learning frameworks for this pose classification problem. The first framework is the Adaptive Sparse Representation pose Classification (ASRC). It trains the dictionary via a linear model called Incremental Principal Component Analysis (Incremental PCA), tending to decrease the intra-class redundancy which may affect the classification performance, while keeping the extra-class redundancy which is critical for sparse representation. The other proposed work is the Dictionary-Learning Sparse Representation model (DLSR) that learns the dictionary with the aim of coinciding with the classification criterion. This training goal is achieved by the K-SVD algorithm. In a series of experiments, we show the performance of the two dictionary-learning methods which are respectively based on a linear transform and a sparse representation model. Besides, we propose a novel Dictionary Learning framework for Illumination Normalization (DL-IN). DL-IN based on sparse representation in terms of coupled dictionaries. The dictionary pairs are jointly optimized from normally illuminated and irregularly illuminated face image pairs. We further utilize a Gaussian Mixture Model (GMM) to enhance the framework's capability of modeling data under complex distribution. The GMM adapt each model to a part of the samples and then fuse them together. Experimental results demonstrate the effectiveness of the sparsity as a prior for patch-based illumination normalization for face images.
|
Page generated in 0.0832 seconds