Spelling suggestions: "subject:"corpus 5construction"" "subject:"corpus constructuction""
1 |
A System for Building Corpus Annotated With Semantic RolesRahimi Rastgar, Sanaz, Razavi, Niloufar January 2013 (has links)
Semantic role labelling (SRL) is a natural language processing (NLP) technique that maps sentences to semantic representations. This can be used in different NLP tasks. The goal of this master thesis is to investigate how to support the novel method proposed by He Tan for building corpus annotated with semantic roles. The mentioned goal provides the context for developing a general framework of the work and as a result implementing a supporting system based on the framework. Implementation is followed using Java. Defined features of the system reflect the usage of frame semantics in understanding and explaining the meaning of lexical items. This prototype system has been processed by the biomedical corpus as a dataset for the evaluation. Our supporting environment has the ability to create frames with all related associations through XML, updating frames and related information including definition, elements and example sentences and at last annotating the example sentences of the frame. The output of annotation is a semi structure schema where tokens of a sentence are labelled. We evaluated our system by means of two surveys. The evaluation results showed that our framework and system have fulfilled the expectations of users and has satisfied them in a good scale. Also feedbacks from users have defined new areas of improvement regarding this supporting environment.
|
2 |
Construction de corpus généraux et spécialisés à partir du Web (Ad hoc and general-purpose corpus construction from web sources) / Ad hoc and general-purpose corpus construction from web sourcesBarbaresi, Adrien 19 June 2015 (has links)
Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés. Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité. / At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.
|
3 |
Anchoring Events to the Time Axis toward Storyline Construction / ストーリーライン生成のための時間と事象情報の対応付けSakaguchi, Tomohiro 25 March 2019 (has links)
付記する学位プログラム名: デザイン学大学院連携プログラム / 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21912号 / 情博第695号 / 新制||情||119(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 黒橋 禎夫, 教授 西田 豊明, 教授 楠見 孝 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
4 |
Advancing Dialogue Systems through Corpus Construction Focusing on User Internal States and External Knowledge / ユーザ内部状態と外部知識に着目したコーパス構築による対話システムの高度化Kodama, Takashi 25 March 2024 (has links)
京都大学 / 新制・課程博士 / 博士(情報学) / 甲第25422号 / 情博第860号 / 新制||情||144(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)特定教授 黒橋 禎夫, 教授 河原 達也, 教授 西田 眞也 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
5 |
A Variationist Approach to Cross-register Language Variation and ChangeJankowski, Bridget Lynn 10 January 2014 (has links)
The comparative method of variationist sociolinguistics has demonstrated that frequency changes are not reliable determinants of whether grammatical change is taking place. Frequency changes can be the result of extra-linguistic register changes, changes within the underlying grammar, or a combination (Szmrecsanyi, 2011; Tagliamonte, 2002). This work examines two variables known to vary along the written-to-spoken continuum — relative clause pronouns, and the genitive construction — across three registers of English and 100 years, with the goal of furthering our understanding of the relationship between spoken and written language. The s-genitive (i.e. Canada's government vs. the government of Canada) is on the rise in the 20th century (Hinrichs and Szmrecsanyi, 2007; Rosenbach, 2007). Statistical modeling confirms the press register leads this increase — a register change. Examination of internal linguistic constraints over time indicates simultaneous grammatical change, with the s–genitive increasing with certain inanimate subtypes.
The WH-forms (who, which) of the relative pronouns have become increasingly restricted to written registers (e.g. Romaine, 1982; Tottie, 1997), leaving that as the variant used most for subject function in vernacular speech (D'Arcy and Tagliamonte 2010). Although who continues to be used for animates, which is shown to have lost any grammatical conditioning that it once had and to be undergoing lexical replacement by that for non-human subject antecedents. Unlike the genitives, though, examination of internal linguistic factors reveals no evidence of grammatical change. The methodology employed here provides a way to tease apart grammatical change from register change, with register-internal change shown to be a motivating factor in change from above. While the vernacular is ''the most systematic data for our analysis of linguistic structure'' (Labov, 1972a:208), it is not necessarily the most innovative, nor is it always the locus of change. With that in mind, this work provides a model of language change that integrates change across speech and writing.
|
6 |
A Variationist Approach to Cross-register Language Variation and ChangeJankowski, Bridget Lynn 10 January 2014 (has links)
The comparative method of variationist sociolinguistics has demonstrated that frequency changes are not reliable determinants of whether grammatical change is taking place. Frequency changes can be the result of extra-linguistic register changes, changes within the underlying grammar, or a combination (Szmrecsanyi, 2011; Tagliamonte, 2002). This work examines two variables known to vary along the written-to-spoken continuum — relative clause pronouns, and the genitive construction — across three registers of English and 100 years, with the goal of furthering our understanding of the relationship between spoken and written language. The s-genitive (i.e. Canada's government vs. the government of Canada) is on the rise in the 20th century (Hinrichs and Szmrecsanyi, 2007; Rosenbach, 2007). Statistical modeling confirms the press register leads this increase — a register change. Examination of internal linguistic constraints over time indicates simultaneous grammatical change, with the s–genitive increasing with certain inanimate subtypes.
The WH-forms (who, which) of the relative pronouns have become increasingly restricted to written registers (e.g. Romaine, 1982; Tottie, 1997), leaving that as the variant used most for subject function in vernacular speech (D'Arcy and Tagliamonte 2010). Although who continues to be used for animates, which is shown to have lost any grammatical conditioning that it once had and to be undergoing lexical replacement by that for non-human subject antecedents. Unlike the genitives, though, examination of internal linguistic factors reveals no evidence of grammatical change. The methodology employed here provides a way to tease apart grammatical change from register change, with register-internal change shown to be a motivating factor in change from above. While the vernacular is ''the most systematic data for our analysis of linguistic structure'' (Labov, 1972a:208), it is not necessarily the most innovative, nor is it always the locus of change. With that in mind, this work provides a model of language change that integrates change across speech and writing.
|
Page generated in 0.0726 seconds