Spelling suggestions: "subject:"treebanks"" "subject:"treebank""
1 |
A formal framework for linguistic tree queryLai, Catherine Unknown Date (has links) (PDF)
The analysis of human communication, in all its forms, increasingly depends on large collections of texts and transcribed recordings. These collections, or corpora, are often richly annotated with structural information. These datasets are extremely large so manual analysis is only successful up to a point. As such, significant effort has recently been invested in automatic techniques for extracting and analyzing these massive data sets. However, further progress on analytical tools is confronted by three major challenges. First, we need the right data model. Second, we need to understand the theoretical foundations of query languages on that data model. Finally, we need to know the expressive requirements for general purpose query language with respect to linguistics. This thesis has addressed all three of these issues. / Specifically, this thesis studies formalisms used by linguists and database theorists to describe tree structured data. Specifically, Propositional dynamic logic and monadic second-order logic. These formalisms have been used to reason about a number of tree querying languages and their applicability to the linguistic tree query problem. We identify a comprehensive set of linguistic tree query requirements and the level of expressiveness needed to implement them. The main result of this study is that the required level of expressiveness of linguistic tree query is that of the first-order predicate calculus over trees. / This formal approach has resulted in a convergence between two seemingly disparate fields of study. Further work in the intersection of linguistics and database theory should also pave the way for theoretically well-founded future work in this area. This, in turn, will lead to better tools for linguistic analysis and data management, and more comprehensive theories of human language.
|
2 |
[en] BUILDING AND EVALUATING A GOLD-STANDARD TREEBANK / [pt] CONSTRUÇÃO E AVALIAÇÃO DE UM TREEBANK PADRÃO OUROELVIS ALVES DE SOUZA 29 May 2023 (has links)
[pt] Esta dissertação apresenta o processo de desenvolvimento do PetroGold,
um corpus anotado com informação morfossintática – um treebank – padrão
ouro para o domínio do petróleo. O desenvolvimento do recurso é abordado sob
duas lentes: do lado linguístico, estudamos a literatura gramatical e tomamos
decisões linguisticamente motivadas para garantir a qualidade da anotação
do corpus; do lado computacional, avaliamos o recurso considerando a sua
utilidade para o processamento de linguagem natural (PLN). Recursos como
o PetroGold recebem relevância especial no contexto atual, em que o PLN
estatístico tem se beneficiado de recursos padrão ouro de domínios específicos
para alimentar o aprendizado automático. No entanto, o treebank é útil também
para tarefas como a avaliação de sistemas de anotação baseados em regras e
para os estudos linguísticos. O PetroGold foi anotado segundo as diretivas
do projeto Universal Dependencies, tendo como pressupostos a ideia de que a
anotação de um corpus é um processo interpretativo, por um lado, e utilizando
o paradigma da linguística empírica, por outro. Além de descrever a anotação
propriamente, aplicamos alguns métodos para encontrar erros na anotação de
treebanks e apresentamos uma ferramenta criada especificamente para busca,
edição e avaliação de corpora anotados. Por fim, avaliamos o impacto da revisão
de cada uma das categorias linguísticas do treebank no aprendizado automático
de um modelo alimentado pelo PetroGold e disponibilizamos publicamente a
terceira versão do corpus, a qual, quando submetida à avaliação intrínseca de
um modelo, alcança métricas até 2,55 por cento melhores que a versão anterior. / [en] This thesis reports on the development process of PetroGold, a goldstandard annotated corpus with morphosyntactic information – a treebank
– for the oil and gas domain. The development of the resource is seen
from two perspectives: on the linguistic side, we study the grammatical
literature and make linguistically motivated decisions to ensure the quality
of corpus annotation; on the computational side, we evaluate the resource
considering its usefulness for natural language processing (NLP). Resources like
PetroGold receive special importance in the current context, where statistical
NLP has benefited from domain-specific gold-standard resources to train
machine learning models. However, the treebank is also useful for tasks such as
evaluating rule-based annotation systems and for linguistic studies. PetroGold
was annotated according to the guidelines of the Universal Dependencies
project, having as theoretical assumptions the idea that the annotation of
a corpus is an interpretative process, on the one hand, and using the empirical
linguistics paradigm, on the other. In addition to describing the annotation
itself, we apply some methods to find errors in the annotation of treebanks
and present a tool created specifically for searching, editing and evaluating
annotated corpora. Finally, we evaluate the impact of revising each of the
treebank linguistic categories on the automatic learning of a model powered
by PetroGold and make the third version of the corpus publicly available,
which, when performing an intrinsic evaluation for a model using the corpus,
achieves metrics up to 2.55 perecent better than the previous version.
|
3 |
Školní větné rozbory jako možný zdroj závislostních korpusů (?) / A school analysis as a possible source of treebanks (?)Konárová, Marie January 2012 (has links)
The aim of this thesis is to explore the possibilities of using data from the school sentence analyses for tagging words in the language corpora. For testing of this hypothesis, a set of sentences has been selected from a common czech language textbook. Students of selected primary and secondary schools were asked to perform the syntactical analysis of these sentences. The data collection was carried out using a prototype sentence analysis editor Capek. The editor is still being developed, also based on feedback gained from the students and teachers who used it during the data collecting process. Several transformation rules for converting data from the school sentence analyses into the data structures used within the Prague Dependency corpus were developed. The accuracy of the conversion using the proposed rules was tested together with the accuracy of students' results.
|
4 |
Významová reprezentace elipsy / Semantic representation of ellipsisMikulová, Marie January 2012 (has links)
This dissertation answers the question what is and what is not ellipsis and specifies criteria for identification of elliptical sentences. It reports on an analysis of types of ellipsis from the point of view of semantic (semantico-syntactic) representation of sentences. It does not deal with conditions and causes of the constitution of elliptical positions in sentences (when and why is it possible to omit something in a sentence) but it focuses exclusively on the identification of elliptical positions (if there is something omitted and what) and on their semantic representation, specifically on their representation on the tectogrammatical level of the Prague Dependency Treebanks. In this dissertation, the dependency approach (used in the Prague Dependency Treebanks) is also compared with the generative approach (used in the Penn Treebank). It is possible to utilize this comparison in the (automatic) conversion from constituency trees to dependency trees.
|
5 |
Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT / Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDTMašek, Jan January 2015 (has links)
We studied the treebanks included in HamleDT and partially unified their label sets. Afterwards, we used a method based on variation n-grams to automatically detect errors in morphological and dependency annotation. Then we used the output of a part-of-speech tagger / dependency parser trained on each treebank to correct the detected errors. The performance of both the detection and the correction of errors on both annotation levels was manually evaluated on a randomly selected samples of suspected errors from several treebanks. Powered by TCPDF (www.tcpdf.org)
|
6 |
Construction de ressources linguistiques arabes à l’aide du formalisme de grammaires de propriétés en intégrant des mécanismes de contrôle / Building arabic linguistic resources using the property grammar formalism by integrating control mechanismsBensalem, Raja 14 December 2017 (has links)
La construction de ressources linguistiques arabes riches en informations syntaxiques constitue un enjeu important pour le développement de nouveaux outils de traitement automatique. Cette thèse propose une approche pour la création d’un treebank de l’arabe intégrant des informations d’un type nouveau reposant sur le formalisme des Grammaires de Propriétés. Une propriété syntaxique caractérise une relation pouvant exister entre deux unités d’une certaine structure syntaxique. Cette grammaire est induite automatiquement à partir du treebank arabe ATB, ce qui constitue un enrichissement de cette ressource tout en conservant ses qualités. Cet enrichissement a été également appliqué aux résultats d’analyse d’un analyseur état de l’art du domaine, le Stanford Parser, offrant la possibilité d’une évaluation s’appuyant sur un ensemble de mesures obtenues à partir de cette ressource. Les étiquettes des unités de cette grammaire sont structurées selon une hiérarchie de types permettant la variation de leur degré de granularité, et par conséquent du degré de précision des informations. Nous avons pu ainsi construire, à l’aide de cette grammaire, d’autres ressources linguistiques arabes. En effet, sur la base de cette nouvelle ressource, nous avons développé un analyseur syntaxique probabiliste à base de propriétés syntaxiques, le premier appliqué pour l'arabe. Une grammaire de propriétés lexicalisée probabiliste fait partie de son modèle d’apprentissage pour pouvoir affecter positivement le résultat d’analyse et caractériser ses structures syntaxiques avec les propriétés de ce modèle. Nous avons enfin évalué les résultats obtenus en les comparant à celles du Stanford Parser. / The building of syntactically informative Arabic linguistic resources is a major issue for the development of new machine processing tools. We propose in this thesis to create an Arabic treebank that integrates a new type of information, which is based on the Property Grammar formalism. A syntactic property is a relation between two units of a given syntactic structure. This grammar is automatically induced from the Arabic treebank ATB. We enriched this resource with the property representations of this grammar, while retaining its qualities. We also applied this enrichment to the parsing results of a state-of-the-art analyzer, the Stanford Parser. This provides the possibility of an evaluation using a measure set, which is calculated on this resource. We structured the tags of the units in this grammar according to a type hierarchy. This permit to vary the granularity level of these units, and consequently the accuracy level of the information. We have thus been able to construct, using this grammar, other Arabic linguistic resources. Secondly, based on this new resource, we developed a probabilistic syntactic parser based on syntactic properties. This is the first analyzer of this type that we have applied to Arabic. In the learning model, we integrated a probabilistic lexicalized property grammar that may positively affect the parsing result and describe its syntactic structures with its properties. Finally, we evaluated the parsing results of this approach by comparing them to those of the Stanford Parser.
|
Page generated in 0.0293 seconds