Spelling suggestions: "subject:"[een] CORPUS ANNOTATION"" "subject:"[enn] CORPUS ANNOTATION""
11 |
Annotating figurative language: another perspective for digital AltertumswissenschaftenBeyer, Stefan, Di Biase-Dyson, Camilla, Wagenknecht, Nina January 2016 (has links)
Whereas past and current digital projects in ancient language studies have been concerned with the annotation of linguistic elements and metadata, there is now an increased interest in the annotation of elements above the linguistic level that are determined by context – like figurative language. Such projects bring their own set of problems (the automatisation of annotation is more difficult, for instance), but also allow us to develop new ways of examining the data. For this reason, we have attempted to take an already annotated database of Ancient Egyptian texts and develop a complementary tagging layer rather than starting from scratch with a new database. In this paper, we present our work in developing a metaphor annotation layer for the Late Egyptian text database of Projet Ramsès (Université de Liège) and in so doing address more general questions: 1) How to ‚tailor-make’ annotation layers to fit other databases? (Workflow) 2) How to make annotations that are flexible enough to be altered in the course of the annotation process? (Project design) 3) What kind of potential do such layers have for integration with existing and future annotations? (Sustainability)
|
12 |
[en] BUILDING AND EVALUATING A GOLD-STANDARD TREEBANK / [pt] CONSTRUÇÃO E AVALIAÇÃO DE UM TREEBANK PADRÃO OUROELVIS ALVES DE SOUZA 29 May 2023 (has links)
[pt] Esta dissertação apresenta o processo de desenvolvimento do PetroGold,
um corpus anotado com informação morfossintática – um treebank – padrão
ouro para o domínio do petróleo. O desenvolvimento do recurso é abordado sob
duas lentes: do lado linguístico, estudamos a literatura gramatical e tomamos
decisões linguisticamente motivadas para garantir a qualidade da anotação
do corpus; do lado computacional, avaliamos o recurso considerando a sua
utilidade para o processamento de linguagem natural (PLN). Recursos como
o PetroGold recebem relevância especial no contexto atual, em que o PLN
estatístico tem se beneficiado de recursos padrão ouro de domínios específicos
para alimentar o aprendizado automático. No entanto, o treebank é útil também
para tarefas como a avaliação de sistemas de anotação baseados em regras e
para os estudos linguísticos. O PetroGold foi anotado segundo as diretivas
do projeto Universal Dependencies, tendo como pressupostos a ideia de que a
anotação de um corpus é um processo interpretativo, por um lado, e utilizando
o paradigma da linguística empírica, por outro. Além de descrever a anotação
propriamente, aplicamos alguns métodos para encontrar erros na anotação de
treebanks e apresentamos uma ferramenta criada especificamente para busca,
edição e avaliação de corpora anotados. Por fim, avaliamos o impacto da revisão
de cada uma das categorias linguísticas do treebank no aprendizado automático
de um modelo alimentado pelo PetroGold e disponibilizamos publicamente a
terceira versão do corpus, a qual, quando submetida à avaliação intrínseca de
um modelo, alcança métricas até 2,55 por cento melhores que a versão anterior. / [en] This thesis reports on the development process of PetroGold, a goldstandard annotated corpus with morphosyntactic information – a treebank
– for the oil and gas domain. The development of the resource is seen
from two perspectives: on the linguistic side, we study the grammatical
literature and make linguistically motivated decisions to ensure the quality
of corpus annotation; on the computational side, we evaluate the resource
considering its usefulness for natural language processing (NLP). Resources like
PetroGold receive special importance in the current context, where statistical
NLP has benefited from domain-specific gold-standard resources to train
machine learning models. However, the treebank is also useful for tasks such as
evaluating rule-based annotation systems and for linguistic studies. PetroGold
was annotated according to the guidelines of the Universal Dependencies
project, having as theoretical assumptions the idea that the annotation of
a corpus is an interpretative process, on the one hand, and using the empirical
linguistics paradigm, on the other. In addition to describing the annotation
itself, we apply some methods to find errors in the annotation of treebanks
and present a tool created specifically for searching, editing and evaluating
annotated corpora. Finally, we evaluate the impact of revising each of the
treebank linguistic categories on the automatic learning of a model powered
by PetroGold and make the third version of the corpus publicly available,
which, when performing an intrinsic evaluation for a model using the corpus,
achieves metrics up to 2.55 perecent better than the previous version.
|
13 |
Broad-domain Quantifier Scoping with RoBERTaRasmussen, Nathan Ellis 10 August 2022 (has links)
No description available.
|
14 |
Sentiment Annotation for Lessing’s Plays: Towards a Language Resource for Sentiment Analysis on German Literary TextsSchmidt, Thomas, Burghardt, Manuel, Dennerlein, Katrin, Wolff, Christian 05 June 2024 (has links)
We present first results of an ongoing research project on sentiment annotation of historical plays
by German playwright G. E. Lessing (1729-1781). For a subset of speeches from six of his most
famous plays, we gathered sentiment annotations by two independent annotators for each play. The
annotators were nine students from a Master’s program of German Literature. Overall, we gathered
annotations for 1,183 speeches. We report sentiment distributions and agreement metrics and put
the results in the context of current research. A preliminary version of the annotated corpus of
speeches is publicly available online and can be used for further investigations, evaluations and
computational sentiment analysis approaches.
|
15 |
Classification automatique de commentaires synchrones dans les vidéos de danmakuPeng, Youyang 01 1900 (has links)
Le danmaku désigne les commentaires synchronisés qui s’affichent et défilent directement en surimpression sur des vidéos au fil du visionnement. Bien que les danmakus proposent à l’audience une manière originale de partager leur sentiments, connaissances, compréhensions et prédictions sur l’histoire d’une série, etc., et d’interagir entre eux, la façon dont les commentaires s’affichent peut nuire à l’expérience de visionnement, lorsqu’une densité excessive de commentaires dissimule complètement les images de la vidéo ou distrait l’audience. Actuellement, les sites de vidéo chinois emploient principalement des méthodes par mots-clés s’appuyant sur des expressions régulières pour éliminer les commentaires non désirés. Ces approches risquent fortement de surgénéraliser en supprimant involontairement des commentaires intéressants contenant certains mots-clés ou, au contraire, de sous-généraliser en étant incapables de détecter ces mots lorsqu’ils sont camouflés sous forme d’homophones. Par ailleurs, les recherches existantes sur la classification automatique du danmaku se consacrent principalement à la reconnaissance de la polarité des sentiments exprimés dans les commentaires. Ainsi, nous avons cherché à regrouper les commentaires par classes fonctionnelles, à évaluer la robustesse d’une telle classification et la possibilité de l’automatiser dans la perspective de développer de meilleurs systèmes de filtrage des commentaires. Nous avons proposé une nouvelle taxonomie pour catégoriser les commentaires en nous appuyant sur la théorie des actes de parole et la théorie des gratifications dans l’usage des médias, que nous avons utilisées pour produire un corpus annoté. Un fragment de ce corpus a été co-annoté pour estimer un accord inter-annotateur sur la classification manuelle. Enfin, nous avons réalisé plusieurs expériences de classification automatique. Celles-ci comportent trois étapes : 1) des expériences de classification binaire où l’on examine si la machine est capable de faire la distinction entre la classe majoritaire et les classes minoritaires, 2) des expériences de classification multiclasses à granularité grosse cherchant à classifier les commentaires selon les catégories principales de notre taxonomie, et 3) des expériences de classification à granularité fine sur certaines sous-catégories. Nous avons expérimenté avec des méthodes d’apprentissage automatique supervisé et semi-supervisé avec différents traits. / Danmaku denotes synchronized comments which are displayed and scroll directly on top of videos as they unfold. Although danmaku offers an innovative way to share their sentiments, knowledge, predictions on the plot of a series, etc., as well as to interact with each other, the way comments display can have a negative impact on the watching experience, when the number of comments displayed in a given timespan is so high that they completely hide the pictures, or distract audience.
Currently, Chinese video websites mainly ressort to keyword approaches based on regular expressions to filter undesired comments. These approaches are at high risk to overgeneralize, thus deleting interesting comments coincidentally containing some keywords, or, to the contrary, undergeneralize due to their incapacity to detect occurrences of these keywords disguised as homophones. On another note, existing research focus essentially on recognizing the polarity of sentiments expressed within comments. Hence, we have sought to regroup comments into functional classes, evaluate the robustness of such a classification and the feasibility of its automation, under an objective of developping better comments filtering systems. Building on the theory of speech acts and the theory of gratification in media usage, we have proposed a new taxonomy of danmaku comments, and applied it to produce an annotated corpus. A fragment of the corpus has been co-annotated to estimate an interannotator agreement for human classification. Finally, we performed several automatic classification experiments. These involved three steps: 1) binary classification experiments evaluating whether the machine can distinguish the most frequent class from all others, 2) coarse-grained multi-class classification experiments aiming at classifying comments within the main categories of our taxonomy, and 3) fine-grained multi-class classification experiments on specific subcategories. We experimented both with supervised and semi-supervised learning algorithms with diffrent features.
|
16 |
A practical approach to the standardisation and elaboration of Zulu as a technical languageVan Huyssteen, Linda 30 November 2003 (has links)
The lack of terminology in Zulu can be overcome if it is developed to meet international scientific and technical demands. This lack of terminology can be traced back to the absence of proper language policy implementation with regard to the African languages. Even though Zulu possesses the basic elements that are necessary for its development, such as orthographical standards, dictionaries, grammars and published literature, a number of problems exist within the technical elaboration and standardisation processes:
* Inconsistencies in the application of standard rules, in relation to both orthography and terminology.
* The lack of standardisation of the (technical) word-formation patterns in Zulu. (Generally the role of culture in elaboration has largely been overlooked).
* The avoidance of exploiting written technical text corpora as a resource for terminology. (Text encoding by means of corpus query tools in term extraction has just begun in Zulu and needs to be properly exemplified).
* The avoidance of introducing oral technical corpora as a resource for improving the acceptability of technical terminology by, for instance, designing a type of reusable corpus annotation.
This study contributes towards solving these problems by offering a practical approach within the context of the real written, standard and oral Zulu language, mainly within the medical terminological domain. This approach offers a reusable methodological foundation with proper language exemplification that can guide terminologists in terminological research, or to some extent even train them, to achieve effective technical elaboration and eventual standardisation.
This thesis aims at attaining consistent standardisation on the orthographical level in order to ease the elaboration task of the terminologist. It also aims at standardising the methods of word- (term-)
formation linking them to cultural factors, such as taboo. However, this thesis also emphasises the significance of using written and oral technical corpora as terminology resource. This, for instance, is made possible through the application of corpus linguistics, in semi-automatic term extraction from a written technical corpus to aid lemmatisation (listing entries) and in corpus annotation to improve the acceptability of terminology, based on the comparison of standard terms with oral terms. / Linguistics / D. Litt et Phil. (Linguistics)
|
17 |
A practical approach to the standardisation and elaboration of Zulu as a technical languageVan Huyssteen, Linda 30 November 2003 (has links)
The lack of terminology in Zulu can be overcome if it is developed to meet international scientific and technical demands. This lack of terminology can be traced back to the absence of proper language policy implementation with regard to the African languages. Even though Zulu possesses the basic elements that are necessary for its development, such as orthographical standards, dictionaries, grammars and published literature, a number of problems exist within the technical elaboration and standardisation processes:
* Inconsistencies in the application of standard rules, in relation to both orthography and terminology.
* The lack of standardisation of the (technical) word-formation patterns in Zulu. (Generally the role of culture in elaboration has largely been overlooked).
* The avoidance of exploiting written technical text corpora as a resource for terminology. (Text encoding by means of corpus query tools in term extraction has just begun in Zulu and needs to be properly exemplified).
* The avoidance of introducing oral technical corpora as a resource for improving the acceptability of technical terminology by, for instance, designing a type of reusable corpus annotation.
This study contributes towards solving these problems by offering a practical approach within the context of the real written, standard and oral Zulu language, mainly within the medical terminological domain. This approach offers a reusable methodological foundation with proper language exemplification that can guide terminologists in terminological research, or to some extent even train them, to achieve effective technical elaboration and eventual standardisation.
This thesis aims at attaining consistent standardisation on the orthographical level in order to ease the elaboration task of the terminologist. It also aims at standardising the methods of word- (term-)
formation linking them to cultural factors, such as taboo. However, this thesis also emphasises the significance of using written and oral technical corpora as terminology resource. This, for instance, is made possible through the application of corpus linguistics, in semi-automatic term extraction from a written technical corpus to aid lemmatisation (listing entries) and in corpus annotation to improve the acceptability of terminology, based on the comparison of standard terms with oral terms. / Linguistics and Modern Languages / D. Litt et Phil. (Linguistics)
|
18 |
A critical investigation of deaf comprehension of signed tv news interpretationWehrmeyer, Jennifer Ella January 2013 (has links)
This study investigates factors hampering comprehension of sign language interpretations rendered on South African TV news bulletins in terms of Deaf viewers’ expectancy norms and corpus analysis of authentic interpretations. The research fills a gap in the emerging discipline of Sign Language Interpreting Studies, specifically with reference to corpus studies. The study presents a new model for translation/interpretation evaluation based on the introduction of Grounded Theory (GT) into a reception-oriented model. The research question is addressed holistically in terms of target audience competencies and expectations, aspects of the physical setting, interpreters’ use of language and interpreting choices. The South African Deaf community are incorporated as experts into the assessment process, thereby empirically grounding the research within the socio-dynamic context of the target audience. Triangulation in data collection and analysis was provided by applying multiple mixed data collection methods, namely questionnaires, interviews, eye-tracking and corpus tools. The primary variables identified by the study are the small picture size and use of dialect. Secondary variables identified include inconsistent or inadequate use of non-manual features, incoherent or non-simultaneous mouthing, careless or incorrect sign execution, too fast signing, loss of visibility against skin or clothing, omission of vital elements of sentence structure, adherence to source language structures, meaningless additions, incorrect referencing, oversimplification and violations of Deaf norms of restructuring, information transfer, gatekeeping and third person interpreting. The identification of these factors allows the construction of a series of testable hypotheses, thereby providing a broad platform for further research. Apart from pioneering corpus-driven sign language interpreting research, the study makes significant contributions to present knowledge of evaluative models, interpreting strategies and norms and systems of transcription and annotation. / Linguistics / Thesis (D. Litt.et Phil. (Linguistics)
|
19 |
A critical investigation of deaf comprehension of signed tv news interpretationWehrmeyer, Jennifer Ella January 2013 (has links)
This study investigates factors hampering comprehension of sign language interpretations rendered on South African TV news bulletins in terms of Deaf viewers’ expectancy norms and corpus analysis of authentic interpretations. The research fills a gap in the emerging discipline of Sign Language Interpreting Studies, specifically with reference to corpus studies. The study presents a new model for translation/interpretation evaluation based on the introduction of Grounded Theory (GT) into a reception-oriented model. The research question is addressed holistically in terms of target audience competencies and expectations, aspects of the physical setting, interpreters’ use of language and interpreting choices. The South African Deaf community are incorporated as experts into the assessment process, thereby empirically grounding the research within the socio-dynamic context of the target audience. Triangulation in data collection and analysis was provided by applying multiple mixed data collection methods, namely questionnaires, interviews, eye-tracking and corpus tools. The primary variables identified by the study are the small picture size and use of dialect. Secondary variables identified include inconsistent or inadequate use of non-manual features, incoherent or non-simultaneous mouthing, careless or incorrect sign execution, too fast signing, loss of visibility against skin or clothing, omission of vital elements of sentence structure, adherence to source language structures, meaningless additions, incorrect referencing, oversimplification and violations of Deaf norms of restructuring, information transfer, gatekeeping and third person interpreting. The identification of these factors allows the construction of a series of testable hypotheses, thereby providing a broad platform for further research. Apart from pioneering corpus-driven sign language interpreting research, the study makes significant contributions to present knowledge of evaluative models, interpreting strategies and norms and systems of transcription and annotation. / Linguistics and Modern Languages / Thesis (D. Litt.et Phil. (Linguistics)
|
Page generated in 0.0584 seconds