Spelling suggestions: "subject:"0,1271 .15494 2014eb"" "subject:"0,1271 .05494 2014eb""
1 |
Towards discourse classication for Chinese, a resource-poor languageJanuary 2014 (has links)
Discourse raises issues about semantics, and especially the nature of coherence and cohesion of texts. Similar to part-of-speech tagging and syntactic parsing, discourse classification is fundamental in computational linguistics. But relatively, this issue is not well studied. The lack of annotated corpora brings limitations to research of discourse classification for most languages other than English (e.g., Chinese). Manual annotation for discourse classification is complex, time consuming and costly. To overcome this predicament, one alternative is to explore unsupervised learning methods. Nevertheless, previous work on English showed that unsupervised methods could only deal with coarse-grained discourse relations and suffered from low precision. Another possible way is to make use of discourse classification capabilities from other languages which have rich discourse corpora. But the problem of cross language discourse classification is still very much open for investigation. Using Chinese as the target, this thesis presents the first study on discourse classification for resource-poor language. Furthermore, we also annotate the first open discourse treebank for Chinese which includes 890 news articles. / At the beginning, we propose a novel bootstrapping unsupervised method based on semantic sequential representation (SSR) for discourse classification. SSR is a new representation for discourse instances which integrate basic bag-of-words information with lexical, semantic and word sequential information. Our method starts with a small set of cue-phrase-based patterns to collect large number of discourse instances which are later converted to SSRs. We then propose an unsupervised SSR learner to generate, weigh and filter new SSRs without cue phrases for recognizing discourse relations. Experimental results showed that our method outperformed previous unsupervised method by 7% in F-score. We also show that SSRs are effective features for supervised learning methods. / The SSR-based method (F-score = 0:63) ignores the ambiguities of discourse connectives. As a result, it suffers from low recall (Recall = 0:49). To discover and eliminate these ambiguities, we further propose a cross-language framework for discourse classification. In our framework, discourse classification for Chinese is achieved in two steps: (1) Discourse connective/trigger identification and (2) Sense classification. English Penn Discourse Treebank 2 (PDTB2) and Chinese-English parallel data are coupled to provide the training data for a co-training based framework. Experimental results showed that our method achieved significant improvement comparing to SSR based method. The proposed framework is practical and effective especially in coping with the inter community problem, which is common in cross-language discourse classification. Moreover, the proposed framework does not integrate any language specific features, making it theoretically applicable for other languages. / Every language has its unique characteristics, our cross-language framework which focuses on the common characteristics between languages is ineffective in detecting Chinese language specific characteristics. As a result, we package the corpus we used in this research to form the Discourse Treebank for Chinese (DTBC). DTBC adopts the principles of PDTB2, and at the same time, it incorporates the linguistic characteristics of Chinese. The annotation work adds a discourse layer to 890 articles from the Penn Chinese Tree Bank 5 (CTB5). DTBC is the first ever open Chinese discourse treebank, which will be an invaluable linguistic resource for future research in Chinese discourse. / 語篇(Discourse)提出了關於語義理解的問題,特別是篇章的銜接與連貫問題。與詞法分析、語法分析相似,語篇分類问题是計算語言學的基本問題之一。較同领域其他問題而言,語篇分類的研究尚處於初級階段。對於除英文外的絕大多數語言,由於缺乏语篇標注資料,語篇分類的研究受到了很大的限制。眾所周知,語篇資料的標注工作複雜度较高而且需要花費大量的時間。為了克服這一困境,一種方法是探索無指導的語篇分類方法。然而,在英文上的先行研究表明,無指導语篇分类方法的缺陷是準確率較低並且僅能處理粗粒度的語篇關係。另一種方法是將語篇分類技術從有大量標注資料的源語言遷移到其他目標語言。然而,當前跨語言語篇分類技術尚不成熟。本文以中文為目標語言,首創了在本地標注資料非常有限(Resource-Poor)的情況下,對中文進行語篇分類的研究。不僅如此,我們還標註了中文第一個公開的,包含890篇新聞文章的語篇樹庫。 / 為了克服以往無指導方法的缺點,我們首先提出了一種新穎的,基於語義有序標記法 (SSR: Semantic Sequential Representation) 的無指導方法。語義有序標記法是一種新的表示語篇實例的方法,它集成了詞袋(bag-of-words)資訊,詞法資訊,語義資訊以及詞序資訊。我們的方法首先從一小組基於語篇連接詞的模式出發,在中文生語料中獲取大量的語篇實例,我們用語義有序標記法表示這些語篇實例。然後,我們提出了一種無指導的,在不考慮語篇連接詞的情況下,對語義有序表示進行挖掘,打分和過濾的方法。實驗結果證明,我們提出的方法比先前的方法在F值上提高了7%。我們還證明了語義有序表示也可以成為有指導語篇分類方法的有效特徵。 / 基於挖掘語義有序表示的無指導方法(F-score=0.63)忽略了語篇連接詞的歧義性。因此,其召回率較低。爲消除歧義,我們進一步提出了一種跨語言的語篇分類框架。在我們的框架中,中文語篇分類任務由兩個步驟組成:(1)語篇連詞/觸發詞的發現;(2)語篇關係分類。我們將英文語篇樹庫(PDTB2: Penn Discourse TreeBank 2.0)和中文樹庫(CTB5: Chinese TreeBank 5.0)結合起來作為訓練資料,作為co-training演算法框架的輸入。實驗結果表明,我們提出的跨語言語篇分類方法比單純使用語義有序表示的方法在F值上有非常顯著的提高。 這說明我們提出的跨語言框架可以有效地通過雙語平行語料的橋樑作用,識別不同語言之間的語篇分類的共通性。值得一提的是,我們提出的演算法框架並不需要特定的,語言相關的特徵,因此,它具有很強的擴展並應用到其他語言的能力。 / 每種語言都有其獨特的特點,我們提出的跨語言方法主要注重於發掘語言之間的共同特點,因此並不能有效地發掘中文篇章分類的獨有特點。我們將實驗中標注過的中文語篇分析資料進行了總結和歸納,形成了中文語篇樹庫(DTBC: Discourse TreeBank for Chinese)。中文語篇樹庫繼承了英文語篇庫的構建原則,與此同時,它針對中文獨有的特點進行了大量的本地化工作。我們的標注工作為中文樹庫 (CTB5: The Chinese TreeBank 5.0)的全部890篇新聞文章添加了語篇資訊層。中文語篇樹庫是第一個開放的、大規模中文語篇樹庫語料。它為未來的中文語篇分析研究提供了至關重要的基礎性標註數據。 / Zhou, Lanjun. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2014. / Includes bibliographical references (leaves 98-104). / Abstracts also in Chinese. / Title from PDF title page (viewed on 20, December, 2016). / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only.
|
Page generated in 0.0351 seconds