Global ETD Search

1	一個對單篇中文文章擷取關鍵字之演算法 / A Keyword Extraction Algorithm for Single Chinese Document 吳泰勳, Wu, Tai Hsun Unknown Date (has links) 數位典藏與數位學習國家型科技計畫14年來透過數位化方式典藏國家文物，例如：生物、考古、地質等15項主題，為了能讓數位典藏資料與時事互動故使用關鍵字作為數位典藏資料與時事的橋樑，由於時事資料會出現新字詞，因此，本研究將提出一個演算法在不使用詞庫或字典的情況下對單一篇中文文章擷取主題關鍵字，此演算法是以Bigram的方式斷詞因此字詞最小單位為二個字，例如：「中文」，隨後挑選出頻率詞並採用分群的方式將頻率詞進行分群最後計算每個字詞的卡方值並產生主題關鍵字，在文章中字詞共現的分佈是很重要的，假設一字詞與所有頻率詞的機率分佈中，此字詞與幾個頻率詞的機率分佈偏差較大，則此字詞極有可能為一關鍵字。在字詞的呈現方面，中文句子裡不像英文句子裡有明顯的分隔符號隔開每一個字詞，造成中文在斷詞處理上產生了極大的問題，與英文比較起來中文斷詞明顯比英文來的複雜許多，在本研究將會比較以Bigram、CKIP和史丹佛中文斷詞器為斷詞的工具，分別進行過濾或不過濾字詞與對頻率詞分群或不分群之步驟，再搭配計算卡方值或詞頻後所得到的主題關鍵字之差異，實驗之資料將採用中央研究院數位典藏資源網的文章，文章的標準答案則來自於中央研究院資訊科學研究所電腦系統與通訊實驗室所開發的撈智網。從實驗結果得知使用Bigram斷詞所得到的主題關鍵字部分和使用CKIP或史丹佛中文斷詞器所得到的主題關鍵字相同，且部分關鍵字與文章主題的關聯性更強，而使用Bigram斷詞的主要優點在於不用詞庫。最後，本研究所提出之演算法是基於能將數位典藏資料推廣出去的前提下所發展，希望未來透過此演算法能從當下熱門話題的文章擷取出主題關鍵字，並透過主題關鍵字連結到相關的數位典藏資料，進而帶動新一波「數典潮」。 / In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary. 關鍵字擷取單篇中文文章 Keyword Extraction single Chinese document
2	設計與實作一個針對遊戲論壇的中文文章整合系統 / Design and Implementation of a Chinese Document Integration System for Game Forums 黃重鈞, Huang, Chung Chun Unknown Date (has links) 現今網路發達便利，人們資訊交換的方式更多元，取得資訊的方式，不再僅是透過新聞，透過論壇任何人都可以快速地、較沒有門檻地分享資訊。也因為這個特性造成資訊量暴增，就算透過搜尋引擎，使用者仍需要花費許多精力蒐集、過濾與處理特定的主題。本研究以巴哈姆特電玩資訊站─英雄聯盟哈拉討論板為例，期望可以為使用者提供一個全面且精要的遊戲角色描述，讓使用者至少對該角色有大概的認知。本研究參考網路論壇探勘及新聞文件摘要系統，設計適用於論壇多篇文章的摘要系統。首先必須了解並分析論壇的特性，實驗如何從論壇挖掘出潛藏的資訊，並認識探勘論壇會遭遇的困難。根據前面的論壇分析再設計系統架構大致可分為三階段：1. 資料前處理：論壇文章與新聞文章不同，很難直接將名詞、動詞作為關鍵字，因此使用TF-IDF篩選出論壇文章中有代表性的詞彙，作為句子的向量空間維度。2. 分群：使用K-Means分群法分辨哪些句子是比較相似的，並將相似的句子分在同一群。 3. 句子挑選：根據句子的分群結果，依句子的關鍵字含量及TF-IDF選擇出最能代表文件集的句子。我們發現實驗分析過程中可以看到一些有用的相關資訊，在論文的最後提出可能的改善方法，期望未來可以開發更好的論壇文章分類方式。 / With the establishment of network infrastructure, forum users can provide information fast and easily. However, users can have information retrieved through search engines, but they still have difficulty handling the articles. This is usually beyond the ability of human processing. In this study, we design a tool to automate retrieval of information from each topic in a Chinese game forum. We analyze the characteristics of the game forum, and refer to English news summary system. Our method is divided into three phases. The first phase attempts to discover the keywords in documents by TF-IDF instead of part of speech, and builds a vector space model. The second phase distinguishes the sentences by the vector space model built in the first phase. Also in the second phase, K-means clustering algorithm is exploited to gather sentences with the same sense into the same cluster. In the third phase, we choose two features to weight sentences and order sentences according to their weights. The two features are keywords of a sentence and TF-IDF. We conduct an experiment with data collected from the game forum, and find useful information through the experiment. We believe the developed techniques and the results of the analysis can be used to design a better system in the future. 中文遊戲論壇文件摘要關鍵字擷取 K-Means分群 Chinese game forum summary keyword selection K-means clustering
3	以型態組合為主的關鍵詞擷取技術在學術寫作字彙上的研究 / A pattern approach to keyword extraction for academic writing vocabulary 邵智捷, Shao, Chih Chieh Unknown Date (has links) 隨著時間的推移演進，人們瞭解到將知識經驗著作成文獻典籍保存下來供後人研究開發的重要性。時至今日，以英語為主的學術寫作論文成為全世界最主要的研究交流媒介。而對於英語為非母語的研究專家而言，在進行英語學術寫作上常常會遇到用了不適當的字彙或搭配詞導致無法確切的傳達自己的研究成果，或是在表達上過於貧乏的問題，因此英語學術寫作字彙與搭配詞的學習與使用就顯得相當重要。在本研究中，我們藉由收集大量不同國家以及不同研究領域的學術論文為基礎，建構現實中實際使用的語料庫，並且建立數種詞性標籤型態，使用關鍵詞擷取關鍵詞擷取(Keyword Extraction)技術從中擷取出學術著作中常用的學術寫作字彙候選詞，當作是學術常用寫作字彙之初步結果，隨即將候選詞導入關鍵詞分析的指標形態模型，將候選詞依照指標特徵選出具有代表指標意義的進一步候選詞。在實驗方面，透過對不同範圍的樣本資料進行篩選，並導入統計上的方法對字彙進行不同領域共通性的分析檢證，再加上輔助篩選的機制後，最後求得名詞和動詞分別在學術寫作中常用的字彙，也以此字彙為基礎，發掘出語料庫中常用的搭配詞組合，提出以英語為外國語的研究學者以及學生在學術寫作上的常用字彙與搭配詞組合作為參考，在學術寫作上能夠提供更多樣性且正確的研究論述的協助。 / With the evolution over time, people start to know the importance of taking their knowledge and experience into literature texts and preserving them for future research. Until now, academic writing research papers mainly in English become the world’s leading communication media all over the world. For those non-native English researchers, they often encounter with the inappropriate vocabularies or collocations which causes them not to pass on their idea accurately or to express their research poorly. As a result, it’s very important to know how to learn or to use the correct academic writing in English vocabularies and collocations. In this study, we constructed the real academic thesis corpus which includes different countries and fields of academic research. The keyword extraction technique based on the several Part-of-Speech tag patterns is used for capturing the common academic writing vocabulary candidates in the academic works to be the initial result of the common vocabulary of academic writing. The candidate words would be introduced to the index analysis model of keyword and be picked out to the further meaningful candidate words according to the index characteristics. For the experiments, the sample data with different fields would be filtered and the vocabularies on different fields of commonality would be analyzed and verified through statistical methods. Moreover, the auxiliary filter mechanism would also be applied to get the common vocabularies in academic writing with nouns and verbs. Based on these vocabularies, we could discover the common combination with the words in the academic thesis corpus and provide them to the non-native English researchers and students as a reference with the common vocabularies and collocations in academic writing. Hopefully the study could help them to write more rich and correct research papers in the future. 關鍵字擷取英語學習學術字彙學術字彙列表詞性標籤型態 keyword extraction English learning academic vocabulary academic word list AWL PoS tag patterns

1

Page generated in 0.022 seconds