1 |
應用主題探勘與標籤聚合於標籤推薦之研究 / Application of topic mining and tag clustering for tag recommendation高挺桂, Kao, Ting Kuei Unknown Date (has links)
標記社群標籤是Web2.0以來流行的一種透過使用者詮釋和分享資訊的方式,作為傳統分類方法的替代,其方便、靈活的特色使得使用者能夠輕易地因應內容標註標籤。不過其也有缺點,除了有相當多無標籤標註的內容,也存在大量模糊、不精確的標籤,降低了系統本身組織分類標籤的能力。為了解決上述兩項問題,本研究提出了一種結合主題探勘與標籤聚合的自動化標籤推薦方法,期望能夠建立一個去人工過程的自動化標籤推薦規則,來推薦合適的標籤給使用者。
本研究蒐集了痞客邦部落格中,點閱次數大於5000次的熱門中文文章共2500篇,經過前處理,並以其中1939篇訓練模型及400篇作為測試語料來驗證方法。在主題探勘部分,本研究利用LDA主題模型計算不同文章的主題語意,來與既有標籤作出關聯,而能夠針對新進文章預測主題並推薦主題相關標籤給它。其中,本研究利用了能評斷模型表現情形的混淆度(Perplexity)來協助選取LDA的主題數,改善了LDA需要人主觀決定主題數的問題;在標籤聚合部分,本研究以階層式分群法,將有共同出現過的標籤群聚起來,以便找出有相似語意概念的標籤。其中,本研究將分群停止條件設定為共現次數最少為1次,改善了分群方法需要設定分群數量才能有結果的問題,也使本方法能夠自動化的找出合適的分群數目。
實驗結果顯示,依照文章主題語意來推薦標籤有一定程度的可行性,且以混淆度所協助選取的主題數取得一致性較好的結果。而依照階層式分群所分出的標籤群中,同一群中的標籤確實擁有相似、類似的概念語意。最後,在結合主題探勘與標籤聚合的方法上,其Top-1至Top-5的準確率平均提升了14.1%,且Top-1準確率也達到72.25%。代表本研究針對文章寫作及標記標籤的習性切入的做法,確實能幫助提升標籤推薦的準確率,也代表本研究確實建立了一個自動化的標籤推薦規則,能推薦出合適的標籤來幫助使用者在撰寫文章後,能夠更方便、精確的標上標籤。 / Tags are a popular way of interpreting and sharing information through use, and as a substitute for traditional classification methods, the convenience and flexibility of the community makes it easy for users to use. But it also has disadvantages, in addition to a considerable number of non-tagged content, there are also many fuzzy and inaccurate tags. To solve these two problems, this study proposes a tag recommendation method that combines the Topic Mining and Tag Clustering.
In this study, we collected a total of 2500 articles by Pixnet as a corpus. In the Topic Mining section, this study uses the LDA Model to calculate the subject semantics of different articles to associate with existing tags, and we can predict topics for new articles to recommend topics related tags to them. Among them, the topics number of the LDA Model uses the Perplexity to help the selection. In the Tag Clustering section, this study uses the Hierarchical Clustering to collect the tags that have appeared together to find similar semantic concepts. The stop condition is set to a minimum of 1 co-occurrence times, which solves the problem that the clustering method needs to set the number of groups to have the result.
First, the Topic Mining results show that it is feasible to recommend tags according to the semantics of the article, and the experiment proves that the number of topics chosen according to the Perplexity is superior to the other topics. Second, the Tag Clustering results show that the same group of tags does have similar conceptual semantics. Last, experiments show that the accuracy rate of Top-1 to Top-5 in combination with two methods increased average of 14.1%, and its Top-1 accuracy rate is 72.25%,and it tells that our tag recommendation method can recommend the appropriate tag for users to use.
|
2 |
階層式分群法在民事裁判要旨分群上之應用 / An Application of Hierarchical Clustering of Documents for Civil Judgments何君豪, Ho,Jim How Unknown Date (has links)
司法院經常聘請資深的法官將民事裁判中具有參考價值的法律意見摘錄出來,製作成民事裁判要旨,民事裁判要旨可作為法官審理類似案件時的辦案參考,因此,在司法實務上民事裁判的搜尋為不可或缺的工作。然隨著資訊科技的發達及裁判數量的累積,民裁判要旨的搜尋結果可能多達數百篇,造成法官須耗費大量的時間在民事裁判要旨的閱讀上,如果能利用資料探勘的技術將搜尋到的民事裁判要旨加以分群,且分群的正確性又可達到一定旳水準,便可節省法官閱讀民事裁判要旨的時間。在本研究中我們嘗試將資料探勘技術中的階層式分群法應用在民事裁判要旨的分群上,並將法律條文所出現的用語作為加權的主關鍵字評估可否改善分群的效果,以探討資料探勘技術中的階層式分群法應用在民事裁判要旨分群上的可行性與成效。 / Judicial Yuan often invites senior civil judges to extract legal opinions from civil judgments for making the purports of civil judgments. The purports of civil judgments can be consulted as trial judges handle the similar cases, therefore, in judicial practices, it is an indispensable work for civil judges to search the purports of civil judgments. However, with the development of information technology and the cumulative number of judgments, the number of search results may be as high as hundreds, civil judges must have spent a lot of time reviewing of the purports of civil judgments. If we can utilize data mining technologies to cluster the search results, and the accuracy of clustering can be attained to a certain standard, it will save civil judges a lot of time on reviewing the purports of civil judgments. In this study we attempt to apply hierarchical method on the clustering of the purports of civil judgments, and adjust the weights of main keywords derived from frequently used vocabulary of legal provisions to assess the feasibility and effectiveness of application of hierarchical method on clustering of the purports of civil judgments.
|
3 |
中文訴訟文書檢索系統雛形實作 / A Prototype of Information Services for Chinese Judicial Documents藍家樑, Lan, Chia Liang Unknown Date (has links)
訴訟案件與日俱增,欲閱讀完所有案件顯然不容易,此時便需要一套較完善的檢索系統來輔助使用者。我們整合前人的相關研究成果,實作一套分群式檢索系統的雛形,依檢索條件搜尋相關案件,並將結果分群輸出,便於使用者對各群集進行查詢,以期減少使用者閱讀案件上的負擔,同時獲得較完整資訊。另設計文件標記與註解功能,供使用者建立個人化資料庫,便於日後檢索。
當輸入為關鍵詞時我們利用階層式分群法來為結果作分群,也以共現詞彙的概念建立的索引,列出可能的相關詞彙提供使用者作查詢;檢索條件亦可輸入一段犯罪事實,系統透過k最近鄰居法的概念,找到相似的案件,依照案由分群。另外也可以透過判決刑期分佈針對特定區間作檢索。
本系統難以進行較正規的實驗,因為這是一個使用者互動的系統,而適不適用也難有一個評定標準。我們從使用者的執行效率,以及對於分群結果的相似度與判決刑期統計來分析與討論,檢驗本系統對使用者的助益以及討論系統本身須要再改善之處。 / Because cumulative number of the judgments grows unceasingly, it is obviously not easy for the users to read all the judicial documents. They need a handier system to retrieve the judgment information. We present a prototype of clustering retrieval system for Chinese judicial documents. The system can automatically cluster and integrate the search results. It is easy for the users to focus on the information they need and pass over the others. When they read a judicial document, they can mark some parts of sentences or annotate some comments if they are interested in. We let them create the personalized database and search more easily.
We can type a keyword, and then our system executes the hierarchical clustering method to cluster search results. We also can view some words which may be relative to the keyword from the collocation word lists. Besides we can input a crime description, and then our system executes the k-nearest neighbor method to classify the crime into some prosecution reason and provide the similar cases. Moreover, our system lets the users view the distribution of prison sentence lengths and the documents in the specific interval.
A formal evaluation of our system is not easy because this is an interactive system. We cannot definitely judge whether it is helpful or unhelpful. We evaluated the efficiency of our system by the operations of human subjects.
Besides we made some statistics about the similarity and the distribution of prison sentence lengths from the clustering results. We tried to discuss the help by our system for users and how to improve the system.
|
Page generated in 0.02 seconds