Global ETD Search

1	應用平行語料建構中文斷詞組件 / Applications of Parallel Corpora for Chinese Segmentation 王瑞平, Wang, Jui Ping Unknown Date (has links) 在本論文，我們建構一個基於中英平行語料的中文斷詞系統，並透過該系統對不同領域的語料斷詞。提供我們的系統不同領域的中英平行語料後，系統可以自動化地產生品質不錯的訓練語料，以節省透過人工斷詞方式取得訓練語料所耗費的時間、人力。在產生訓練語料時，首先對中英平行語料中的所有中文句，透過查詢中文辭典的方式產生句子的各種斷詞組合，再利用英漢翻譯的資訊處理交集型歧異，將錯誤的斷詞組合去除。此外本研究從中英平行語料中擷取新的中英詞對與未知詞，並分別將其擴充至英漢辭典模組與中文辭典模組，以提升我們的系統之斷詞效能。我們透過兩部分的實驗進行斷詞效能評估，而在實驗中會使用三種不同領域的實驗語料。在第一部分，我們以人工斷詞的測試語料進行斷詞效能評估。在第二部分，我們藉由漢英翻譯的翻譯品質間接地評估我們的系統之斷詞效能。由實驗結果顯示，我們的系統可以有一定的斷詞效能。 / In this paper, we construct a Chinese word segmentation system which based on Chinese-English Parallel Corpus to save time and manpower, and the corpora in different domains can be segmented by our system. By providing Chinese-English Parallel Corpus to our system, training corpus can be automatically produced by our system. Then segmentation model can be trained with the produced training corpus. We use Chinese translation of words in English parallel sentences to solve overlapping ambiguity. We extract translation pairs and unknown words from Chinese-English Parallel Corpus. In evaluation, two different experiments are conducted, and experimental data in three domains are used to evaluate segmentation performance in two experiments. In the first experiment, manually annotated Chinese sentences are used as testing data. In the second experiment, segmentation performance is indirectly indicated by translation quality. Experimental results show that our system achieves acceptable segmentation performance. 中文斷詞中英平行語料未知詞交集型歧異
2	社群媒體新詞偵測系統以PTT八卦版為例 / Chinese new words detection from social media 王力弘, Wang, Li Hung Unknown Date (has links) 近年來網路社群非常活躍,非常多的網民都以社群媒體來分享與討論時事。不傴於此,網路上的群聚力量已經漸漸從虛擬走向現實,社群媒體的傳播力已經可以與大眾傳媒比擬。像台大 PTT 的八卦版就是一個這樣具指標性的社群媒體,許多新聞或是事件都從此版開始討論,然後擴散至主流媒體。透過觀察, 網路鄉民常常會以略帶灰諧的方式,發明新的詞彙去討論時事與人物,例如: 割闌尾、祭止兀、婉君、貫老闆...等。這些新詞的出現,很可能代表一個新的熱門話題的正在醞釀中。但若以傳統的關鍵詞搜索,未必能找到這些含有此類新詞的討論文章。因此,本研究提出一個基於「滑動視窗(Sliding window)」的技巧來輔助中文斷詞,以利找出這些新詞,並進而透過這些新詞對來探詢社群媒體中的新興話題。我們以此技巧修改知名的Jieba 斷詞工具,加上新詞偵測的機制,並以 PTT的八卦版為監測對象,經過長期的的監測後,結果顯示我們的系統可以正確的找出絕大多數的新詞。此外,經過與主流媒體交叉比對,本系統發現的新詞與新話題的確有極高的相關性。 / Internet new residents like to share society current event on the social media website and the influence is propagate to the reality now. For example: On Gossip(八卦版) discussion board of 台大 PTT BBS that had many post are turn into the TV News every day. After some survey we found people like to crate new words to explain society topics, This paper attempt to build up a system to detect the new words from social media. But detect the Chinese new words from unknown words is a thorny problem, on this paper we invent a way – 『Sliding Window』 to elevate the new words detection from Jieba in Chinese words Segmentation, After testing we got 96.94% correct rate and cross valid the detection result by ours system with News and Google Trending we proved the new words detection is a reasonable way to discover new topic. 中文斷詞新詞偵測社群媒體分析 Chinese Words Segmentation New Words Detection Social Media Data Analysis
3	結合中文斷詞系統與雙分群演算法於音樂相關臉書粉絲團之分析：以KKBOX為例 / Combing Chinese text segmentation system and co-clustering algorithm for analysis of music related Facebook fan page: A case of KKBOX 陳柏羽, Chen, Po Yu Unknown Date (has links) 近年智慧型手機與網路的普及，使得社群網站與線上串流音樂蓬勃發展。臉書（Facebook）用戶截至去年止每月總體平均用戶高達18.6億人，粉絲專頁成為公司企業特別關注的行銷手段。粉絲專頁上的貼文能夠在短時間內經過點閱、分享傳播至用戶的頁面，達到比起電視廣告更佳的效果，也節省了許多的成本。本研究提供了一套針對臉書粉絲專頁貼文的分群流程，考量到貼文字詞的複雜性，除了抓取了臉書粉絲專頁的貼文外，也抓取了與其相關的KKBOX網頁資訊，整合KKBOX網頁中的資料，對中文斷詞系統（Jieba）的語料庫進行擴充，以提高斷詞的正確性，接著透過雙分群演算法（Minimum Squared Residue Co-Clustering Algorithm）對貼文進行分群，並利用鑑別率（Discrimination Rate）與凝聚率（Agglomerate Rate）配合主成份分析（Principal Component Analysis）所產生的分佈圖來對分群結果進行評估，選出較佳的分群結果進一步去分析，進而找出分類的根據。在結果中，發現本研究的方法能夠有效的區分出不同類型的貼文，甚至能夠依據使用字詞、語法或編排格式的不同來進行分群。 / In recent years, because both smartphones and the Internet have become more popular, social network sites and music streaming services have grown vigorously. The monthly average of Facebook users hit 1.86 billion last years and Facebook Fan Page has become a popular marketing tool. Posts on Facebook can be broadcasted to millions of people in a short period of time by LIKEing and SHAREing pages. Using Facebook Fan Page as a marketing tool is more effective than advertising on television and can definitely reduce the costs. This study presents a process to cluster posts on Facebook Fan Page. Considering the complicated word usage, we grasped information on Facebook Fan Page and related information on the KKBOX website. First, we integrated the information on the website of KKBOX and expanded the text corpus of Jibea to enhance the accuracy of word segmentation. Then, we clustered the posts into several groups through Minimum Squared Residue Co-Clustering Algorithm and used discrimination Rate and Agglomerate Rate to analyze the distribution chart of Principal Component Analysis. After that, we found the suitable classification and could further analyze it. How posts are classified can then be found. As a result, we found that the method of this study can effectively cluster different kinds of posts and even cluster these posts according to its words, syntax and arrangement. 雙分群中文斷詞臉書粉絲專頁貼文 Co-clustering Chinese text segmentation system Facebook fan page
4	轉換年報資料以擷取企業評價模型之非財務性資料項 / A Transformation Approach to Extract Annual Report for Non-Financial Category in Business Valuation 吳思宏, Wu, Szu-Hung Unknown Date (has links) 現今由於之前企業併購熱潮，使得企業到底價值多少？企業是否能夠還有前景？這些問題不僅僅是投資者所關心的問題，也同樣是會計師及企業評價者所關心的問題。又現今已邁入知識經濟時代，企業已從過去以土地、廠房、設備等固定資產來產生企業價值，轉而以服務、品牌、專利等無形資產為主要的企業價值時，企業的價值又要如何來估算。而這些問題都一再的顯示出“企業評價”的重要性。在進行企業評價之前，企業評價模型中之資料項的取得更是關係著最後評價結果的好壞。在企業評價資料項中，可分為財務性及非財務性。財務性資料項由於定義清楚，所以在資料的收集上較非財務性資料容易。但我們發現過往之資料收集方式並不足以應用在企業評價非財務性資料項的收集上，且現行大多採用人工處理資料的方式，不僅耗費大量時間及成本，又因人工輸入而有資料輸入錯誤之風險，使得資料的正確性大幅降低。故本研究提出一自動化擷取年報中企業評價非財務性資料項之方法，希望藉此方法達到簡化資料收集過程，提高資料的正確性。 / Because of the trend of the business combination, now, more and more people concern about “how much value does a business have?” And “does the business still have any perspectives?” This not only get investors’’ interest, but also the accountant and business valuator. Now we already get into a new economy, called knowledge-based economy. When the businesses are not just use fixed asset, such as facility, factory and land to earn money, but also earn their money by providing services, making brand, or sell patents for live, how to measure the business’s real value and what the real value for the business is. These problems all shows that the importance of “Business Valuation.” Before calculate the business value, the most important thing is to collect the data or data category for business valuation. There are two kinds of business valuation data item. One is financial data item; the other is non-financial data item. Because of the financial data item’s clear definition, the data collection process of financial data item is easier than non-financial data item. And the data collection in the past is not fit for today, and now most valuators use manual way to process these data. This way not only wastes the time and money, but also lowers the correctness and raises the risk of mistype during the process of data collection. In this thesis, we propose an approach to automatic extract business valuation data category from annual report by using the technology of data extraction. 企業評價資訊擷取 Portable Document Format ( PDF ) 資訊檢索斷詞 Business valuation Data extraction Portable Document Format ( PDF ) Information Retrieval Word Segmentation
5	基於領域詞典之詞彙-語義網路建構方法研究－以財務金融領域詞典為例 / The Construction of a Lexical-semantic Network Based on Domain Dictionary: Dictionary of Finance and Banking as an Example 曾建勛, Tzeng,Jian Shuin Unknown Date (has links) 領域詞典包含許多專業的詞彙以及對詞彙的定義，但詞典中詞彙間的關係是被隱藏起來的，本研究運用自然語言處理的相關技術，提出運用領域詞典找出詞彙間關係建構特定領域語義網路的方法。 / A domain dictionary contains many professional words and their definitions. In general, there are many hidden relations among words in a dictionary. In this thesis, we use techniques of natural language processing to find out these relations, and bring up a method to construct a domain specific lexical semantic network. 中文斷詞特徵向量詞空間語義網路語義相似度 Chinese word segmentation Feature vector Word space Semantic network Semantic similarity
6	支援數位人文研究之文本自動標註系統發展與使用評估研究 / Development and evaluation of an automatic text annotation system for supporting digital humanities research 劉鎮宇, Liu, Chen Yu Unknown Date (has links) 在傳統的人文研究中，人文學者大多以如古籍珍善本、歷史文獻等紙本出版形式之文本為主要研究文本型式，但是隨著資訊社會的來臨，許多研究機構陸續將這些紙本資料進行數位化並建置數位典藏資料庫，對人文研究環境與知識取得管道帶來巨大的改變，基於數位閱讀之文本研究型式也成為必然的發展趨勢。因此，本研究發展支援數位人文研究之「文本自動標註系統」，藉由Linked Data的概念匯集來自不同資料庫的資源，並加以整合後，替文本進行自動註解，讓使用者在解讀文本時能夠即時參照其他資料庫的資源，並提供友善的具文本標註之閱讀介面，以利於人文學者透過閱讀進行資料的解讀。本研究以實驗研究法比較本研究所發展之「文本自動標註系統」與「MARKUS文本半自動標註系統」在支援人文學者進行文本資料解讀之閱讀成效與科技接受度是否具有顯著差異，並輔以半結構式深度訪談了解人文學者對於本研究發展之「文本自動標註系統」的看法及感受，也進一步分析「文本自動標註系統」閱讀成效、科技接受度及使用者行為歷程之間是否具有關聯性。實驗結果發現，採用本研究發展之文本自動標註系統的閱讀成效高於MARKUS文本半自動標註系統，但未達顯著差異；而科技接受度分析結果則顯示文本自動標註系統之科技接受度顯著優於MARKUS文本半自動標註系統。另外，從訪談結果歸納得知，文本自動標註系統閱讀介面簡潔明瞭，比MARKUS文本半自動標註系統更適合閱讀，而閱讀介面是否易於使用與是否有用，是影響人文學者能否接受採用系統輔助數位人文研究的重要因素。此外，在兩個系統類似功能比較分析後也發現，文本自動標註系統在查詢詞彙功能、連結到來源網站功能及新增標註功能都比MARKUS文本半自動標註系統更為直覺易用。另外人文學者普遍認為斷句功能比自動斷詞功能更重要，鏈結來源資料庫則以萌典最有幫助。最後，採用文本自動標註系統之閱讀成效與使用者行為歷程之間無顯著關聯性。 / In traditional humanities research, most humanities scholars studied text-type paper-based publishing texts, such as rare ancient books and historical literature. However, many research institutes, in the information society, gradually digitalized such paper-based data and established digital archives database to result in great changes in humanities research environment and knowledge acquisition channels. The research pattern with digital reading based texts became the essential development trend. For this reason, an “automatic text annotation system” for supporting digital humanities research is developed in this study. Resources from distinct database are gathered through Linked Data and integrated for the automatic annotation of texts. It allows users immediately referring to resources from other database when interpreting texts and provides friendly reading interface with text annotation for humanities scholars interpreting data through reading. With experimental research, the “automatic text annotation system” developed in this study is compared with “MARKUS semi-automatic text annotation system” for supporting humanities scholars interpreting text data to discussed the difference in reading achievement and technology acceptance. Semi-structured in-depth interviews are also proceeded to understand humanities scholars’ opinions and perception about the “automatic text annotation system” developed in this study as well as to analyze the correlations among reading achievement, technology acceptance, and user behavior course of the “automatic text annotation system”. The experimental findings show that the reading achievement with the automatic text annotation system developed in this study is higher than that with MARKUS semi-automatic text annotation system, but not achieving the significance. The technology acceptance analysis reveals remarkably better technology acceptance of the automatic text annotation system than MARKUS semi-automatic text annotation system. According to the interviews, the reading interface of the automatic text annotation system is simple and clear that it is more suitable for reading than MARKUS semi-automatic text annotation system. The ease of use and usefulness of reading interface is a key factor in humanities scholars accepting the system for the digital humanities research. In regard to the comparison of similar functions between two systems, the functions of vocabulary enquiry, linking to source web sites, and annotation appending of the automatic text annotation system are more intuitive and easy to use than those of MARKUS semi-automatic text annotation system. What is more, humanities scholars emphasize more on the sentence segmentation function than the automatic word segmentation function, and the linked source database, Moedict, appears the best assistance. Finally, there is no significant correlation between reading achievement and user behavior course with the automatic text annotation system. 數位人文自動標註系統中文自動斷詞鏈結資料 Digital humanities Automatic annotation system Automatic Chinese word segmentation Linked data

1

Page generated in 0.0171 seconds