Global ETD Search

1	應用平行語料建構中文斷詞組件 / Applications of Parallel Corpora for Chinese Segmentation 王瑞平, Wang, Jui Ping Unknown Date (has links) 在本論文，我們建構一個基於中英平行語料的中文斷詞系統，並透過該系統對不同領域的語料斷詞。提供我們的系統不同領域的中英平行語料後，系統可以自動化地產生品質不錯的訓練語料，以節省透過人工斷詞方式取得訓練語料所耗費的時間、人力。在產生訓練語料時，首先對中英平行語料中的所有中文句，透過查詢中文辭典的方式產生句子的各種斷詞組合，再利用英漢翻譯的資訊處理交集型歧異，將錯誤的斷詞組合去除。此外本研究從中英平行語料中擷取新的中英詞對與未知詞，並分別將其擴充至英漢辭典模組與中文辭典模組，以提升我們的系統之斷詞效能。我們透過兩部分的實驗進行斷詞效能評估，而在實驗中會使用三種不同領域的實驗語料。在第一部分，我們以人工斷詞的測試語料進行斷詞效能評估。在第二部分，我們藉由漢英翻譯的翻譯品質間接地評估我們的系統之斷詞效能。由實驗結果顯示，我們的系統可以有一定的斷詞效能。 / In this paper, we construct a Chinese word segmentation system which based on Chinese-English Parallel Corpus to save time and manpower, and the corpora in different domains can be segmented by our system. By providing Chinese-English Parallel Corpus to our system, training corpus can be automatically produced by our system. Then segmentation model can be trained with the produced training corpus. We use Chinese translation of words in English parallel sentences to solve overlapping ambiguity. We extract translation pairs and unknown words from Chinese-English Parallel Corpus. In evaluation, two different experiments are conducted, and experimental data in three domains are used to evaluate segmentation performance in two experiments. In the first experiment, manually annotated Chinese sentences are used as testing data. In the second experiment, segmentation performance is indirectly indicated by translation quality. Experimental results show that our system achieves acceptable segmentation performance. 中文斷詞中英平行語料未知詞交集型歧異
2	社群媒體新詞偵測系統以PTT八卦版為例 / Chinese new words detection from social media 王力弘, Wang, Li Hung Unknown Date (has links) 近年來網路社群非常活躍,非常多的網民都以社群媒體來分享與討論時事。不傴於此,網路上的群聚力量已經漸漸從虛擬走向現實,社群媒體的傳播力已經可以與大眾傳媒比擬。像台大 PTT 的八卦版就是一個這樣具指標性的社群媒體,許多新聞或是事件都從此版開始討論,然後擴散至主流媒體。透過觀察, 網路鄉民常常會以略帶灰諧的方式,發明新的詞彙去討論時事與人物,例如: 割闌尾、祭止兀、婉君、貫老闆...等。這些新詞的出現,很可能代表一個新的熱門話題的正在醞釀中。但若以傳統的關鍵詞搜索,未必能找到這些含有此類新詞的討論文章。因此,本研究提出一個基於「滑動視窗(Sliding window)」的技巧來輔助中文斷詞,以利找出這些新詞,並進而透過這些新詞對來探詢社群媒體中的新興話題。我們以此技巧修改知名的Jieba 斷詞工具,加上新詞偵測的機制,並以 PTT的八卦版為監測對象,經過長期的的監測後,結果顯示我們的系統可以正確的找出絕大多數的新詞。此外,經過與主流媒體交叉比對,本系統發現的新詞與新話題的確有極高的相關性。 / Internet new residents like to share society current event on the social media website and the influence is propagate to the reality now. For example: On Gossip(八卦版) discussion board of 台大 PTT BBS that had many post are turn into the TV News every day. After some survey we found people like to crate new words to explain society topics, This paper attempt to build up a system to detect the new words from social media. But detect the Chinese new words from unknown words is a thorny problem, on this paper we invent a way – 『Sliding Window』 to elevate the new words detection from Jieba in Chinese words Segmentation, After testing we got 96.94% correct rate and cross valid the detection result by ours system with News and Google Trending we proved the new words detection is a reasonable way to discover new topic. 中文斷詞新詞偵測社群媒體分析 Chinese Words Segmentation New Words Detection Social Media Data Analysis
3	結合中文斷詞系統與雙分群演算法於音樂相關臉書粉絲團之分析：以KKBOX為例 / Combing Chinese text segmentation system and co-clustering algorithm for analysis of music related Facebook fan page: A case of KKBOX 陳柏羽, Chen, Po Yu Unknown Date (has links) 近年智慧型手機與網路的普及，使得社群網站與線上串流音樂蓬勃發展。臉書（Facebook）用戶截至去年止每月總體平均用戶高達18.6億人，粉絲專頁成為公司企業特別關注的行銷手段。粉絲專頁上的貼文能夠在短時間內經過點閱、分享傳播至用戶的頁面，達到比起電視廣告更佳的效果，也節省了許多的成本。本研究提供了一套針對臉書粉絲專頁貼文的分群流程，考量到貼文字詞的複雜性，除了抓取了臉書粉絲專頁的貼文外，也抓取了與其相關的KKBOX網頁資訊，整合KKBOX網頁中的資料，對中文斷詞系統（Jieba）的語料庫進行擴充，以提高斷詞的正確性，接著透過雙分群演算法（Minimum Squared Residue Co-Clustering Algorithm）對貼文進行分群，並利用鑑別率（Discrimination Rate）與凝聚率（Agglomerate Rate）配合主成份分析（Principal Component Analysis）所產生的分佈圖來對分群結果進行評估，選出較佳的分群結果進一步去分析，進而找出分類的根據。在結果中，發現本研究的方法能夠有效的區分出不同類型的貼文，甚至能夠依據使用字詞、語法或編排格式的不同來進行分群。 / In recent years, because both smartphones and the Internet have become more popular, social network sites and music streaming services have grown vigorously. The monthly average of Facebook users hit 1.86 billion last years and Facebook Fan Page has become a popular marketing tool. Posts on Facebook can be broadcasted to millions of people in a short period of time by LIKEing and SHAREing pages. Using Facebook Fan Page as a marketing tool is more effective than advertising on television and can definitely reduce the costs. This study presents a process to cluster posts on Facebook Fan Page. Considering the complicated word usage, we grasped information on Facebook Fan Page and related information on the KKBOX website. First, we integrated the information on the website of KKBOX and expanded the text corpus of Jibea to enhance the accuracy of word segmentation. Then, we clustered the posts into several groups through Minimum Squared Residue Co-Clustering Algorithm and used discrimination Rate and Agglomerate Rate to analyze the distribution chart of Principal Component Analysis. After that, we found the suitable classification and could further analyze it. How posts are classified can then be found. As a result, we found that the method of this study can effectively cluster different kinds of posts and even cluster these posts according to its words, syntax and arrangement. 雙分群中文斷詞臉書粉絲專頁貼文 Co-clustering Chinese text segmentation system Facebook fan page
4	基於領域詞典之詞彙-語義網路建構方法研究－以財務金融領域詞典為例 / The Construction of a Lexical-semantic Network Based on Domain Dictionary: Dictionary of Finance and Banking as an Example 曾建勛, Tzeng,Jian Shuin Unknown Date (has links) 領域詞典包含許多專業的詞彙以及對詞彙的定義，但詞典中詞彙間的關係是被隱藏起來的，本研究運用自然語言處理的相關技術，提出運用領域詞典找出詞彙間關係建構特定領域語義網路的方法。 / A domain dictionary contains many professional words and their definitions. In general, there are many hidden relations among words in a dictionary. In this thesis, we use techniques of natural language processing to find out these relations, and bring up a method to construct a domain specific lexical semantic network. 中文斷詞特徵向量詞空間語義網路語義相似度 Chinese word segmentation Feature vector Word space Semantic network Semantic similarity

1

Page generated in 0.0169 seconds