Global ETD Search

1	電腦輔助簡易刑事判決技術之探討 / An Exploration of Computer Assisted Criminal Summary Judgments 張正宗, Cheng-Tsung Chang Unknown Date (has links) 我們以機器學習(Machine Learning)的方法，建立rule-based與case-based的instances，再藉由這些 instances來判斷起訴書的案由和法條，其最好的正確率只比人工建立的rules與cases所判斷的結果低7%而已。由於在我們最基本的方法中，一個判例就會被建立成一個instance，如此，我們將需要大量的空間來儲存instances，針對這個問題，我們也提出了instances clustering與刪除部份較不重要詞這兩個方法，來降低instances所佔的空間，經過簡化的系統的正確率不但與原本未刪減instances時差不多，還可以減少將近一半左右的儲存空間；而且如果我們將這兩個刪減instances的方法混合使用，甚致可以找到一個更好的解，不但能些微提升正確率，還可以把儲存instances所需的空間，降低為原本的四分之一左右。 / I apply machine learning techniques to constructing rule-based and case-based reasoning systems. These systems determine the prosecution reasons and applicable articles of lawsuits, and may achieve an accuracy that is just 7% lower than that achieved by a manually-built system. The baseline method constructs one instance for each prior lawsuit, so it takes much space to store all instances. To reduce the storage space, I propose two methods – clustering instance and removing some less important words in instances. The effects of these methods not only maintain the original accuracy, but also reduce the storage space by half. When I integrated all proposed methods, I can even improve the accuracy slightly and reduce the storage space by three quarters. 自然語言處理法資訊學 Machine Learning
2	利用詞組檢索中文訴訟文書之研究 / An Exploration of Indexing Chinese Judicial Documents with Term Pairs 謝淳達, Hsieh,Chwen-Dar Unknown Date (has links) 本文將針對相似訴訟文書之搜尋進行研究與探討。在這裡所說的「相似案件」指的是有著相同犯罪行為的案件。判例是法院對於訴訟案件所作的確定判決的先例。在法律案件審判的過程中，對法官和律師而言，與目前的新案件案情相似的過去判例有時是有參考價值的。這意味著我們可以透過判例來推測新的訴訟案件可能的判決方向，因此搜尋過去判例是有其價值的。與一般常用的資訊檢索方法中以單一詞彙作為索引不同的是，我們嘗試以案件事實段中的詞組（兩個詞彙的組合）集合為基礎，由於詞組所包含的資訊比詞彙還多，我們希望透過詞組集合的比對，能夠更精確地找出類似於新案件的過去判例，藉此幫助一般人搜尋過去的相似判例，並能夠從過去判例中自行推測所遇上的法律糾紛可能的判決方向。然而，由於既有的電子詞典並未包含所有可能的詞彙，尤其是訴訟文件中常出現的一些特定詞彙，因此我們提出了一個可以從文件中自動擷取可能的中文詞彙的方法，並且利用這些擷取而得的詞彙協助我們分析判決書的事實段文字。此外我們將相似案件搜尋系統應用在實作「案件分類器」上，用以猜測新案件可能的案件類型。在我們的實驗中，我們提出的中文詞彙擷取方法TermSpotter所擷取出來的詞彙中，詞頻為30次以上的擷取正確率（人工判定為有用的詞彙數量╱程式輸出詞彙數量）為56.3%，而且這些詞彙經過人工過濾後，有三分之一的詞彙（953個）是HowNet電子詞典中所沒有的詞彙。而我們實作的案件分類器，在竊盜、搶奪、強盜、贓物、恐嚇、傷害、賭博七大類型案件的案由分類實驗有89.3%的正確率，而賭博罪的法條分類實驗也有81.9%的正確率。至於相似案件搜尋實驗中，我們以人工判斷其效果，目前所搜尋到的過去判例只有42%是值得參考的，未來仍有空間需要繼續嘗試改進。 / I study information retrieval methods for retrieving similar judicial documents. Here “similar judicial documents” refers to “cases that have a similar process of criminal violation”. For judges and lawyers, it is sometimes worth referring to prior cases which are similar to the new case in the process of judgment. Information about the judgments of the similar prior cases helps people to obtain a rough picture about how the new cases might be judged. In this work, I use phrases, rather than individual words as indices of Chinese judicial documents. Phrases provide a better foundation for indexing and retrieving documents than individual words. Constituents of phrases make other component words in the phrase less unambiguous than when the words appear separately. I expect the system could help anyone who is not a legal expert to retrieve similar prior cases on their own. The existing electronic dictionary does not collect all the possible words, especially the words that appear in specific-domain documents. Hence, I put forth an algorithm to automatically retrieve possible words in the corpus, and we will use these words as the basis to construct phrases in our system. Moreover, I implement the case classifier to automatically classify new cases into several different prosecution categories. I put forth the algorithm “TermSpotter” to automatically retrieve possible words that occur more than 30 times. In the experiments, 56.3% of the retrieved words are considered as useful words after manual filtration. Among these useful words, about one third of the words are not included in HowNet, and some of them are legal-domain-specific words. The implemented case classifier categorizes new cases into seven different prosecution categories: larceny, robbery, robbery by threatening or disabling the victims, receiving stolen property, causing bodily harm, intimidation, and gambling. It reaches 89.3% in accuracy. The classifier can also categorize cases based on what criminal articles are violated. In the experiment of classifying gambling cases into four combinations of three articles, it reaches 81.9% in accuracy. In the experiment of retrieving prior cases which are similar to the new case, it only reaches 42% in accuracy judged by a practicing judge, so there is a lot of work to do to improve the classifier. 法學資訊自然語言處理 Machine Learning
3	英文技術文獻中動詞與其受詞之中文翻譯的語境效用 / Collocational influences on the chinese translations of english verbs and their objects in technical documents 莊怡軒, Chuang, Yi Hsuan Unknown Date (has links) 本研究使用英漢平行語料庫，試圖從中找尋英文與中文之間的翻譯情形，我們將英文及中文的動名詞組合 (V-N-collocation) 作為觀察對象。本研究各別分析英漢專利平行文句語料庫及科學人雜誌英漢對照電子書兩套語料庫，將中英文互為翻譯的文件視為一體，觀察英文及中文語言其中的特定結構及共現性 (collocation) ，建構由真實世界的語料所反應的語言翻譯模型。　　我們使用技術名詞表將平行語料庫進行技術名詞斷詞，再將句子進行結構剖析得到關係樹 (dependency tree) ，並利用關係樹結構及近義詞典取得英漢動名詞組合。本研究運用英漢動名詞組合建立英文動詞與名詞的翻譯模型，我們的系統可以根據不同的模型推薦翻譯，並比較這些翻譯模型的成效；最後也加入中文語言使用者翻譯英文動詞的實驗與本研究的翻譯模型效果作比較，結果顯示本研究的翻譯模型比起受試者，可以有較好的推薦效果。 / In our investigation, we are interested in English Verb-Noun collocation (V-N collocation) and the corresponding usage in Chinese. To discover English-Chinese V-N collocation, a rich corpus is needed; therefore, we obtained one million English-Chinese parallel patent sentence pairs and seven years of bilingual Scientific American as two corpora to analyze. We trained translation models to find the usage of V-N collocations in English and Chinese. Given English V-N collocation and corresponding Chinese information, our system can recommend the proper translations of the English verb or object in collocation according to the translation models. We experimented ten formulas to train our models using two corpora, and observed similar trends in the analyses. Preliminary comparisons of the translation quality of human subjects and our system indicated that our system could offer better recommendations for the translation tasks. 機器翻譯特徵評比自然語言處理
4	以範例為基礎之英漢TIMSS詴題輔助翻譯 / Using Example-based Translation Techniques for Computer Assisted Translation of TIMSS Test Items 張智傑, Chang, Chih Chieh Unknown Date (has links) 本論文應用以範例為基礎的機器翻譯技術，應用英漢雙語對應的結構輔助英漢單句語料的翻譯。翻譯範例是運用一種特殊的結構，此結構包含來源句的剖析樹、目標句的字串、以及目標句和來源句詞彙對應關係。將翻譯範例建立資料庫，以提供來源句作詞序交換的依據，接著透過字典翻譯，以及利用統計式中英詞彙對列和語言模型來選詞，最後填補缺少的量詞，產生建議的翻譯。我們是以2003年國際數學與科學教育成就趨勢調查測驗詴題為主要翻譯的對象，以期提升翻譯的一致性和效率。以NIST 和BLEU 的評比方式，來評估和比較Google Translate 和Yahoo!線上翻譯系統及本系統所達成的翻譯品質。我們的系統經過詞序調動以及填補量詞後，翻譯品質比我們前一代系統要佳，但整體效果沒有比Google Translate 和Yahoo!線上翻譯的品質要佳。 / This paper presents an example-based machine translation based on bilingual structured string tree correspondence (BSSTC). The BSSTC structure includes a parse tree in source language, a string in target language and the correspondence between the source language tree and the target language string. / We designed an English to Chinese computer assisted translation system for Trends in International Mathematics and Science Study (TIMSS), through the BSSTC structure reordering, directory translation, choosing translation statistics model and measure word generation. / We evaluated our system by the BLEU and NIST score and compared with Google Translate and Yahoo! Translate. By reordering selected word sequences and inserting measure words in the default translations, the current system achieved a higher quality of default translations than the previous implementation of our research group, but the overall effects still lag behind that achieved by Google and Yahoo!. 自然語言處理試題翻譯機器翻譯 Natural language processing Item translation Machine translation TIMSS
5	中文訴訟文書檢索系統雛形實作 / A Prototype of Information Services for Chinese Judicial Documents 藍家樑, Lan, Chia Liang Unknown Date (has links) 訴訟案件與日俱增，欲閱讀完所有案件顯然不容易，此時便需要一套較完善的檢索系統來輔助使用者。我們整合前人的相關研究成果，實作一套分群式檢索系統的雛形，依檢索條件搜尋相關案件，並將結果分群輸出，便於使用者對各群集進行查詢，以期減少使用者閱讀案件上的負擔，同時獲得較完整資訊。另設計文件標記與註解功能，供使用者建立個人化資料庫，便於日後檢索。當輸入為關鍵詞時我們利用階層式分群法來為結果作分群，也以共現詞彙的概念建立的索引，列出可能的相關詞彙提供使用者作查詢；檢索條件亦可輸入一段犯罪事實，系統透過k最近鄰居法的概念，找到相似的案件，依照案由分群。另外也可以透過判決刑期分佈針對特定區間作檢索。本系統難以進行較正規的實驗，因為這是一個使用者互動的系統，而適不適用也難有一個評定標準。我們從使用者的執行效率，以及對於分群結果的相似度與判決刑期統計來分析與討論，檢驗本系統對使用者的助益以及討論系統本身須要再改善之處。 / Because cumulative number of the judgments grows unceasingly, it is obviously not easy for the users to read all the judicial documents. They need a handier system to retrieve the judgment information. We present a prototype of clustering retrieval system for Chinese judicial documents. The system can automatically cluster and integrate the search results. It is easy for the users to focus on the information they need and pass over the others. When they read a judicial document, they can mark some parts of sentences or annotate some comments if they are interested in. We let them create the personalized database and search more easily. We can type a keyword, and then our system executes the hierarchical clustering method to cluster search results. We also can view some words which may be relative to the keyword from the collocation word lists. Besides we can input a crime description, and then our system executes the k-nearest neighbor method to classify the crime into some prosecution reason and provide the similar cases. Moreover, our system lets the users view the distribution of prison sentence lengths and the documents in the specific interval. A formal evaluation of our system is not easy because this is an interactive system. We cannot definitely judge whether it is helpful or unhelpful. We evaluated the efficiency of our system by the operations of human subjects. Besides we made some statistics about the similarity and the distribution of prison sentence lengths from the clustering results. We tried to discuss the help by our system for users and how to improve the system. 法學資訊系統自然語言處理階層式分群法 k最近鄰居法
6	詞彙向量的理論與評估基於矩陣分解與神經網絡 / Theory and evaluation of word embedding based on matrix factorization and neural network 張文嘉, Jhang, Wun Jia Unknown Date (has links) 隨著機器學習在越來越多任務中有突破性的發展，特別是在自然語言處理問題上，得到越來越多的關注，近年來，詞向量是自然語言處理研究中最令人興奮的部分之一。在這篇論文中，我們討論了兩種主要的詞向量學習方法。一種是傳統的矩陣分解，如奇異值分解，另一種是基於神經網絡模型（具有負採樣的Skip-gram模型（Mikolov等人提出，2013），它是一種迭代演算法。我們提出一種方法來挑選初始值，透過使用奇異值分解得到的詞向量當作是Skip-gram模型的初始直，結果發現替換較佳的初始值，在某些自然語言處理的任務中得到明顯的提升。 / Recently, word embedding is one of the most exciting part of research in natural language processing. In this thesis, we discuss the two major learning approaches for word embedding. One is traditional matrix factorization like singular value decomposition, the other is based on neural network model (e.g. the Skip-gram model with negative sampling (Mikolov et al., 2013b)) which is an iterative algorithm. It is known that an iterative process is sensitive to initial starting values. We present an approach for implementing the Skip-gram model with negative sampling from a given initial value that is using singular value decomposition. Furthermore, we show that refined initial starting points improve the analogy task and succeed in capturing fine-gained semantic and syntactic regularities using vector arithmetic. 矩陣分解初始值自然語言處理神經網絡 Matrix factorization Initalization Natural language processing Neural network
7	英文聽力字彙與聽寫練習題輔助出題系統之研究 / Computer-Assisted Item Generation for Listening Cloze and Dictation Practice in English 黃上銘, Huang, Shang-Ming Unknown Date (has links) 以Web為基礎的語言測驗（Web-based language test）目前多為靜態網頁或是互動性低的動態網頁，如何提昇測驗系統的互動性，仍有待探討。目前已有學者利用自然語言處理技術來自動產生試題，並分析學生的作答錯誤。然而，相關研究仍多著重於字彙測驗與文法評量上。本研究針對聽力測驗，發展一套線上練習系統，並運用自然語言處理技術來自動產生聽力選擇題，以及分析學生的聽寫錯誤。我們分析語料庫中的詞彙，將發音相近的單字歸為一類，系統要產生聽力選擇題時，即從單字所屬的群組中，隨機挑選其他發音相近的單字當作誘答選項。而在分析聽寫錯誤方面，我們從語尾變化、拼字、拼音三種層次來分析學生的作答；系統批改完後，會按照學生的錯誤類型，提供不同種類的練習。本練習系統共有聽寫測驗、單字聽力選擇題、單字組聽力選擇題三種題型。系統可以輔助教師大量地編製試題與測驗，並輔助批改學生的作答。學生則可透過網路來接受測驗，並針對自己的弱點單字來作練習。 / Most Web-based language tests are presently static pages or dynamic pages with low interactivity. How to enhance the interactivity of the test systems is still under development. Today some researchers have utilized natural language processing techniques for automatic item generation and intelligent error analysis of students’ answers. However, most known research works put more emphases on vocabulary tests and grammar correction. In this paper, we propose a Web-based practice system for English dictation and listening cloze. We use natural language processing techniques to automatically generate multiple-choice item and analyze students’ dictation errors. First, we analyze the words in the corpus and group them based on the similarity of the phones. When the system generates multiple-choice items, it will search the matched cluster and randomly pick other words with similar phones. Second, we analyze students’ dictation from three aspects: inflected form, spelling errors and phoneme structures. After analyzing students’ answers, the system will provide different practice items according to the error types. Our practice system has three item types: one for dictation and two for multiple-choice listening cloze. The system can assist teachers in authoring items and grading the test results. Students can take the test through the Internet and practice on their weak words. 電腦輔助教學自然語言處理線上測驗聽力克漏字聽寫測驗 Computer-Assited Learning Natural language processing Web-based test listening cloze dictatioin test
8	探索美國財務報表的主觀性詞彙與盈餘的關聯性:意見分析之應用 / Exploring the relationships between annual earnings and subjective expressions in US financial statements: opinion analysis applications 陳建良, Chen, Chien Liang Unknown Date (has links) 財務報表中的主觀性詞彙往往影響市場中的參與者對於報導公司價值和獲利能力衡量的決策判斷。因此，公司的管理階層往往有高度的動機小心謹慎的選擇用詞以隱藏負面的消息而宣揚正面的消息。然而使用人工方式從文字量極大的財務報表挖掘有用的資訊往往不可行，因此本研究採用人工智慧方法驗證美國財務報表中的主觀性多字詞 (subjective MWEs) 和公司的財務狀況是否具有關聯性。多字詞模型往往比傳統的單字詞模型更能掌握句子中的語意情境，因此本研究應用條件隨機域模型 (conditional random field) 辨識多字詞形式的意見樣式。另外，本研究的實證結果發現一些跡象可以印證一般人對於財務報表的文字揭露往往與真實的財務數字存在有落差的印象；更發現在負向的盈餘變化情況下，公司管理階層通常輕描淡寫當下的短拙卻堅定地承諾璀璨的未來。 / Subjective assertions in financial statements influence the judgments of market participants when they assess the value and profitability of the reporting corporations. Hence, the managements of corporations may attempt to conceal the negative and to accentuate the positive with "prudent" wording. To excavate this accounting phenomenon hidden behind financial statements, we designed an artificial intelligence based strategy to investigate the linkage between financial status measured by annual earnings and subjective multi-word expressions (MWEs). We applied the conditional random field (CRF) models to identify opinion patterns in the form of MWEs, and our approach outperformed previous work employing unigram models. Moreover, our novel algorithms take the lead to discover the evidences that support the common belief that there are inconsistencies between the implications of the written statements and the reality indicated by the figures in the financial statements. Unexpected negative earnings are often accompanied by ambiguous and mild statements and sometimes by promises of glorious future. 意見探勘自然語言處理語意分析財務報表文字探勘資訊擷取 opinion mining natural language processing sentiment analysis financial text mining information extraction
9	中文詞彙集的來源與權重對中文裁判書分類成效的影響 / Exploring the Influences of Lexical Sources and Term Weights on the Classification of Chinese Judgment Documents 鄭人豪, Cheng, Jen-Hao Unknown Date (has links) 國外法學資訊系統已研究多年，嘗試利用科技幫助提昇司法審判的效率。重要的議題包括輔助判決，法律文件分類，或是相似案件搜尋等。本研究將針對中文裁判書的分類做進一步談討。在文件特徵表示方面，我們以有序詞組來表達中文裁判書，我們嘗試比較採用不同的詞彙來源對於分類效果的影響。實驗中我們分別採用一般通用的電子詞典建立一般詞組；以及以演算法取出法學專業詞彙集建立專業詞組。並依tf-idf(term frequency – inverse document frequency)的概念，設計兩種詞組權重tpf-idf(term pair frequency – inverse document frequency)以及tpf-icf(term pair frequency – inverse category frequency)，來計算特徵詞組權重。在文件分類演算法方面，我們實作以相似度為基礎的k最近鄰居法作為系統分類機制，藉由裁判書的案由欄位，將案例分為七種類別，分別為竊盜、搶奪、強盜、贓物、傷害、恐嚇以及賭博。並藉由觀察案例資料庫的相似度分佈，以找出恰當的參數，進一步得到較佳的分類正確率與較低的拒絕率。我們並依照自省式學習法的精神，建立權重調整的機制。企圖藉由自省式學習法提昇分類效果，以及找出對分類有影響的詞組。而我們以案例資料庫的相似度差異值以及距離差異值，分析調整前後案例資料庫的變化，藉以觀察自省式學習法的效果。 / Legal information systems for non-Chinese languages have been studied intensively in the past many years. There are several topics under discussion, such as judgment assistance, legal document classification, and similar case search, and so on. This thesis studies the classification of Chinese judgment documents. I use phrases as the indices for documents. I attempt to compare the influences of different lexical sources for segmenting Chinese text. One of the lexical sources is a general machine-readable dictionary, Hownet, and the other is the set of terms algorithmically extracted from legal documents. Based on the concept of tf-idf, I design two kinds of phrase weights: tpf-idf and tpf-icf. In the experiments, I use the k-nearest neighbor method to classify Chinese judgment documents into seven categories based on their prosecution reasons: larceny(竊盜), robbery (搶奪), robbery by threatening or disabling the victims (強盜), receiving stolen property (贓物), causing bodily harm (傷害), intimidation (恐嚇), and gambling(賭博). To achieve high accuracy with low rejection rates, I observe and discuss the distribution of similarity of the training documents to select appropriate parameters. In addition, I also conduct a set of analogous experiments for classifying documents based on the cited legal articles for gambling cases. To improve the classification effects, I apply the introspective learning technique to adjust the weights of phrases. I observe the intra-cluster similarity and inter-cluster similarity in evaluating the effects of weight adjustment on experiments for classifying documents based on their prosecution reasons and cited articles. 法學資訊系統自然語言處理 k最近鄰居法自省式學習法 Legal information system Natural language processing k nearest neighbor introspective learning
10	電腦輔助試題翻譯：以國際數學與科學教育成就趨勢調查為例 / Computer Aided Item Translation for the Trends in International Mathematics and Science Study 呂明欣, Lu,Ming-Shin Unknown Date (has links) 由國際教育學習成就調查委員會統一命題之國際數學與科學教育成就趨勢調查測驗，為便於台灣中小學生施測與理解，英文原文試題內容需要經過許多人工討論及翻譯時間。為了增進翻譯內容一致性及其效率，我們設計一套符合測驗試題的輔助翻譯系統，將不同格式的試題文件，經執行語法分析式的片語擷取和字典查詢，透過使用者介面，選擇合適的片語詞彙翻譯選項和詞序調整，以及提供目前常用之線上翻譯服務、回顧翻譯類似句、以及加減詞彙等功能。為了能提昇翻譯詞彙的選擇正確性，我們記錄翻譯者選詞動作，讓翻譯者能回顧過去曾處理過的翻譯類似句，並且按照系統提供之選詞頻率資訊、科學領域的期刊語料之詞頻統計，以及利用統計式中英詞彙對列和語言模型，更改選詞的優先順序。我們嘗試以過去試題為實驗對象，按年級及學科區分6大試題類別，搭配4種選詞策略，透過BLEU及NIST之翻譯評估指標比較線上翻譯系統和本系統，實驗結果顯示在各實驗組的評估上均有優於線上翻譯系統的效果。 / Test items used in the Trends in International Mathematics and Science Study (TIMSS) are designed by The International Association for the Evaluation of Education Achievement, for facilitating education scientists to measure students’ competence in science and mathematics. Translating the English items into Chinese items demands a lot of work. Therefore, we would like to offer a computer-aided translation environment to improve the consistency and efficiency of the translation process. Through the user interface, translators could input different document format of test items, use phrase analysis and dictionary to find different phrase translations, and adjust word orders. Users of our system may obtain translations from on-line translations provided by Google and Yahoo, can look for previously translated items that contain specific word patterns, and so on. For selecting appropriate Chinese translations for English words, we considered users’ past selection, word frequencies in relevant corpora, and other language-related information in parallel corpora. We employed test items used in TIMSS 1999 and TIMSS 2003 to evaluate the effectiveness of our translation environment. Translations recommended by our system were compared with actual Chinese translations of the test data, and the similarity was measured with the BLEU and NIST metrics. Experimental results indicate that our system performed better or similarly with Google and Yahoo on-line translation systems. 自然語言處理電腦輔助教學受限語言試題翻譯機器翻譯 Natural language processing Computer-aided learning Controlled-language Item translation Machine translation TIMSS

1	電腦輔助簡易刑事判決技術之探討 / An Exploration of Computer Assisted Criminal Summary Judgments 張正宗, Cheng-Tsung Chang Unknown Date (has links) 我們以機器學習(Machine Learning)的方法，建立rule-based與case-based的instances，再藉由這些 instances來判斷起訴書的案由和法條，其最好的正確率只比人工建立的rules與cases所判斷的結果低7%而已。由於在我們最基本的方法中，一個判例就會被建立成一個instance，如此，我們將需要大量的空間來儲存instances，針對這個問題，我們也提出了instances clustering與刪除部份較不重要詞這兩個方法，來降低instances所佔的空間，經過簡化的系統的正確率不但與原本未刪減instances時差不多，還可以減少將近一半左右的儲存空間；而且如果我們將這兩個刪減instances的方法混合使用，甚致可以找到一個更好的解，不但能些微提升正確率，還可以把儲存instances所需的空間，降低為原本的四分之一左右。 / I apply machine learning techniques to constructing rule-based and case-based reasoning systems. These systems determine the prosecution reasons and applicable articles of lawsuits, and may achieve an accuracy that is just 7% lower than that achieved by a manually-built system. The baseline method constructs one instance for each prior lawsuit, so it takes much space to store all instances. To reduce the storage space, I propose two methods – clustering instance and removing some less important words in instances. The effects of these methods not only maintain the original accuracy, but also reduce the storage space by half. When I integrated all proposed methods, I can even improve the accuracy slightly and reduce the storage space by three quarters. 自然語言處理法資訊學 Machine Learning
2	利用詞組檢索中文訴訟文書之研究 / An Exploration of Indexing Chinese Judicial Documents with Term Pairs 謝淳達, Hsieh,Chwen-Dar Unknown Date (has links) 本文將針對相似訴訟文書之搜尋進行研究與探討。在這裡所說的「相似案件」指的是有著相同犯罪行為的案件。判例是法院對於訴訟案件所作的確定判決的先例。在法律案件審判的過程中，對法官和律師而言，與目前的新案件案情相似的過去判例有時是有參考價值的。這意味著我們可以透過判例來推測新的訴訟案件可能的判決方向，因此搜尋過去判例是有其價值的。與一般常用的資訊檢索方法中以單一詞彙作為索引不同的是，我們嘗試以案件事實段中的詞組（兩個詞彙的組合）集合為基礎，由於詞組所包含的資訊比詞彙還多，我們希望透過詞組集合的比對，能夠更精確地找出類似於新案件的過去判例，藉此幫助一般人搜尋過去的相似判例，並能夠從過去判例中自行推測所遇上的法律糾紛可能的判決方向。然而，由於既有的電子詞典並未包含所有可能的詞彙，尤其是訴訟文件中常出現的一些特定詞彙，因此我們提出了一個可以從文件中自動擷取可能的中文詞彙的方法，並且利用這些擷取而得的詞彙協助我們分析判決書的事實段文字。此外我們將相似案件搜尋系統應用在實作「案件分類器」上，用以猜測新案件可能的案件類型。在我們的實驗中，我們提出的中文詞彙擷取方法TermSpotter所擷取出來的詞彙中，詞頻為30次以上的擷取正確率（人工判定為有用的詞彙數量╱程式輸出詞彙數量）為56.3%，而且這些詞彙經過人工過濾後，有三分之一的詞彙（953個）是HowNet電子詞典中所沒有的詞彙。而我們實作的案件分類器，在竊盜、搶奪、強盜、贓物、恐嚇、傷害、賭博七大類型案件的案由分類實驗有89.3%的正確率，而賭博罪的法條分類實驗也有81.9%的正確率。至於相似案件搜尋實驗中，我們以人工判斷其效果，目前所搜尋到的過去判例只有42%是值得參考的，未來仍有空間需要繼續嘗試改進。 / I study information retrieval methods for retrieving similar judicial documents. Here “similar judicial documents” refers to “cases that have a similar process of criminal violation”. For judges and lawyers, it is sometimes worth referring to prior cases which are similar to the new case in the process of judgment. Information about the judgments of the similar prior cases helps people to obtain a rough picture about how the new cases might be judged. In this work, I use phrases, rather than individual words as indices of Chinese judicial documents. Phrases provide a better foundation for indexing and retrieving documents than individual words. Constituents of phrases make other component words in the phrase less unambiguous than when the words appear separately. I expect the system could help anyone who is not a legal expert to retrieve similar prior cases on their own. The existing electronic dictionary does not collect all the possible words, especially the words that appear in specific-domain documents. Hence, I put forth an algorithm to automatically retrieve possible words in the corpus, and we will use these words as the basis to construct phrases in our system. Moreover, I implement the case classifier to automatically classify new cases into several different prosecution categories. I put forth the algorithm “TermSpotter” to automatically retrieve possible words that occur more than 30 times. In the experiments, 56.3% of the retrieved words are considered as useful words after manual filtration. Among these useful words, about one third of the words are not included in HowNet, and some of them are legal-domain-specific words. The implemented case classifier categorizes new cases into seven different prosecution categories: larceny, robbery, robbery by threatening or disabling the victims, receiving stolen property, causing bodily harm, intimidation, and gambling. It reaches 89.3% in accuracy. The classifier can also categorize cases based on what criminal articles are violated. In the experiment of classifying gambling cases into four combinations of three articles, it reaches 81.9% in accuracy. In the experiment of retrieving prior cases which are similar to the new case, it only reaches 42% in accuracy judged by a practicing judge, so there is a lot of work to do to improve the classifier. 法學資訊自然語言處理 Machine Learning
3	英文技術文獻中動詞與其受詞之中文翻譯的語境效用 / Collocational influences on the chinese translations of english verbs and their objects in technical documents 莊怡軒, Chuang, Yi Hsuan Unknown Date (has links) 本研究使用英漢平行語料庫，試圖從中找尋英文與中文之間的翻譯情形，我們將英文及中文的動名詞組合 (V-N-collocation) 作為觀察對象。本研究各別分析英漢專利平行文句語料庫及科學人雜誌英漢對照電子書兩套語料庫，將中英文互為翻譯的文件視為一體，觀察英文及中文語言其中的特定結構及共現性 (collocation) ，建構由真實世界的語料所反應的語言翻譯模型。　　我們使用技術名詞表將平行語料庫進行技術名詞斷詞，再將句子進行結構剖析得到關係樹 (dependency tree) ，並利用關係樹結構及近義詞典取得英漢動名詞組合。本研究運用英漢動名詞組合建立英文動詞與名詞的翻譯模型，我們的系統可以根據不同的模型推薦翻譯，並比較這些翻譯模型的成效；最後也加入中文語言使用者翻譯英文動詞的實驗與本研究的翻譯模型效果作比較，結果顯示本研究的翻譯模型比起受試者，可以有較好的推薦效果。 / In our investigation, we are interested in English Verb-Noun collocation (V-N collocation) and the corresponding usage in Chinese. To discover English-Chinese V-N collocation, a rich corpus is needed; therefore, we obtained one million English-Chinese parallel patent sentence pairs and seven years of bilingual Scientific American as two corpora to analyze. We trained translation models to find the usage of V-N collocations in English and Chinese. Given English V-N collocation and corresponding Chinese information, our system can recommend the proper translations of the English verb or object in collocation according to the translation models. We experimented ten formulas to train our models using two corpora, and observed similar trends in the analyses. Preliminary comparisons of the translation quality of human subjects and our system indicated that our system could offer better recommendations for the translation tasks. 機器翻譯特徵評比自然語言處理
4	以範例為基礎之英漢TIMSS詴題輔助翻譯 / Using Example-based Translation Techniques for Computer Assisted Translation of TIMSS Test Items 張智傑, Chang, Chih Chieh Unknown Date (has links) 本論文應用以範例為基礎的機器翻譯技術，應用英漢雙語對應的結構輔助英漢單句語料的翻譯。翻譯範例是運用一種特殊的結構，此結構包含來源句的剖析樹、目標句的字串、以及目標句和來源句詞彙對應關係。將翻譯範例建立資料庫，以提供來源句作詞序交換的依據，接著透過字典翻譯，以及利用統計式中英詞彙對列和語言模型來選詞，最後填補缺少的量詞，產生建議的翻譯。我們是以2003年國際數學與科學教育成就趨勢調查測驗詴題為主要翻譯的對象，以期提升翻譯的一致性和效率。以NIST 和BLEU 的評比方式，來評估和比較Google Translate 和Yahoo!線上翻譯系統及本系統所達成的翻譯品質。我們的系統經過詞序調動以及填補量詞後，翻譯品質比我們前一代系統要佳，但整體效果沒有比Google Translate 和Yahoo!線上翻譯的品質要佳。 / This paper presents an example-based machine translation based on bilingual structured string tree correspondence (BSSTC). The BSSTC structure includes a parse tree in source language, a string in target language and the correspondence between the source language tree and the target language string. / We designed an English to Chinese computer assisted translation system for Trends in International Mathematics and Science Study (TIMSS), through the BSSTC structure reordering, directory translation, choosing translation statistics model and measure word generation. / We evaluated our system by the BLEU and NIST score and compared with Google Translate and Yahoo! Translate. By reordering selected word sequences and inserting measure words in the default translations, the current system achieved a higher quality of default translations than the previous implementation of our research group, but the overall effects still lag behind that achieved by Google and Yahoo!. 自然語言處理試題翻譯機器翻譯 Natural language processing Item translation Machine translation TIMSS
5	中文訴訟文書檢索系統雛形實作 / A Prototype of Information Services for Chinese Judicial Documents 藍家樑, Lan, Chia Liang Unknown Date (has links) 訴訟案件與日俱增，欲閱讀完所有案件顯然不容易，此時便需要一套較完善的檢索系統來輔助使用者。我們整合前人的相關研究成果，實作一套分群式檢索系統的雛形，依檢索條件搜尋相關案件，並將結果分群輸出，便於使用者對各群集進行查詢，以期減少使用者閱讀案件上的負擔，同時獲得較完整資訊。另設計文件標記與註解功能，供使用者建立個人化資料庫，便於日後檢索。當輸入為關鍵詞時我們利用階層式分群法來為結果作分群，也以共現詞彙的概念建立的索引，列出可能的相關詞彙提供使用者作查詢；檢索條件亦可輸入一段犯罪事實，系統透過k最近鄰居法的概念，找到相似的案件，依照案由分群。另外也可以透過判決刑期分佈針對特定區間作檢索。本系統難以進行較正規的實驗，因為這是一個使用者互動的系統，而適不適用也難有一個評定標準。我們從使用者的執行效率，以及對於分群結果的相似度與判決刑期統計來分析與討論，檢驗本系統對使用者的助益以及討論系統本身須要再改善之處。 / Because cumulative number of the judgments grows unceasingly, it is obviously not easy for the users to read all the judicial documents. They need a handier system to retrieve the judgment information. We present a prototype of clustering retrieval system for Chinese judicial documents. The system can automatically cluster and integrate the search results. It is easy for the users to focus on the information they need and pass over the others. When they read a judicial document, they can mark some parts of sentences or annotate some comments if they are interested in. We let them create the personalized database and search more easily. We can type a keyword, and then our system executes the hierarchical clustering method to cluster search results. We also can view some words which may be relative to the keyword from the collocation word lists. Besides we can input a crime description, and then our system executes the k-nearest neighbor method to classify the crime into some prosecution reason and provide the similar cases. Moreover, our system lets the users view the distribution of prison sentence lengths and the documents in the specific interval. A formal evaluation of our system is not easy because this is an interactive system. We cannot definitely judge whether it is helpful or unhelpful. We evaluated the efficiency of our system by the operations of human subjects. Besides we made some statistics about the similarity and the distribution of prison sentence lengths from the clustering results. We tried to discuss the help by our system for users and how to improve the system. 法學資訊系統自然語言處理階層式分群法 k最近鄰居法
6	詞彙向量的理論與評估基於矩陣分解與神經網絡 / Theory and evaluation of word embedding based on matrix factorization and neural network 張文嘉, Jhang, Wun Jia Unknown Date (has links) 隨著機器學習在越來越多任務中有突破性的發展，特別是在自然語言處理問題上，得到越來越多的關注，近年來，詞向量是自然語言處理研究中最令人興奮的部分之一。在這篇論文中，我們討論了兩種主要的詞向量學習方法。一種是傳統的矩陣分解，如奇異值分解，另一種是基於神經網絡模型（具有負採樣的Skip-gram模型（Mikolov等人提出，2013），它是一種迭代演算法。我們提出一種方法來挑選初始值，透過使用奇異值分解得到的詞向量當作是Skip-gram模型的初始直，結果發現替換較佳的初始值，在某些自然語言處理的任務中得到明顯的提升。 / Recently, word embedding is one of the most exciting part of research in natural language processing. In this thesis, we discuss the two major learning approaches for word embedding. One is traditional matrix factorization like singular value decomposition, the other is based on neural network model (e.g. the Skip-gram model with negative sampling (Mikolov et al., 2013b)) which is an iterative algorithm. It is known that an iterative process is sensitive to initial starting values. We present an approach for implementing the Skip-gram model with negative sampling from a given initial value that is using singular value decomposition. Furthermore, we show that refined initial starting points improve the analogy task and succeed in capturing fine-gained semantic and syntactic regularities using vector arithmetic. 矩陣分解初始值自然語言處理神經網絡 Matrix factorization Initalization Natural language processing Neural network
7	英文聽力字彙與聽寫練習題輔助出題系統之研究 / Computer-Assisted Item Generation for Listening Cloze and Dictation Practice in English 黃上銘, Huang, Shang-Ming Unknown Date (has links) 以Web為基礎的語言測驗（Web-based language test）目前多為靜態網頁或是互動性低的動態網頁，如何提昇測驗系統的互動性，仍有待探討。目前已有學者利用自然語言處理技術來自動產生試題，並分析學生的作答錯誤。然而，相關研究仍多著重於字彙測驗與文法評量上。本研究針對聽力測驗，發展一套線上練習系統，並運用自然語言處理技術來自動產生聽力選擇題，以及分析學生的聽寫錯誤。我們分析語料庫中的詞彙，將發音相近的單字歸為一類，系統要產生聽力選擇題時，即從單字所屬的群組中，隨機挑選其他發音相近的單字當作誘答選項。而在分析聽寫錯誤方面，我們從語尾變化、拼字、拼音三種層次來分析學生的作答；系統批改完後，會按照學生的錯誤類型，提供不同種類的練習。本練習系統共有聽寫測驗、單字聽力選擇題、單字組聽力選擇題三種題型。系統可以輔助教師大量地編製試題與測驗，並輔助批改學生的作答。學生則可透過網路來接受測驗，並針對自己的弱點單字來作練習。 / Most Web-based language tests are presently static pages or dynamic pages with low interactivity. How to enhance the interactivity of the test systems is still under development. Today some researchers have utilized natural language processing techniques for automatic item generation and intelligent error analysis of students’ answers. However, most known research works put more emphases on vocabulary tests and grammar correction. In this paper, we propose a Web-based practice system for English dictation and listening cloze. We use natural language processing techniques to automatically generate multiple-choice item and analyze students’ dictation errors. First, we analyze the words in the corpus and group them based on the similarity of the phones. When the system generates multiple-choice items, it will search the matched cluster and randomly pick other words with similar phones. Second, we analyze students’ dictation from three aspects: inflected form, spelling errors and phoneme structures. After analyzing students’ answers, the system will provide different practice items according to the error types. Our practice system has three item types: one for dictation and two for multiple-choice listening cloze. The system can assist teachers in authoring items and grading the test results. Students can take the test through the Internet and practice on their weak words. 電腦輔助教學自然語言處理線上測驗聽力克漏字聽寫測驗 Computer-Assited Learning Natural language processing Web-based test listening cloze dictatioin test
8	探索美國財務報表的主觀性詞彙與盈餘的關聯性:意見分析之應用 / Exploring the relationships between annual earnings and subjective expressions in US financial statements: opinion analysis applications 陳建良, Chen, Chien Liang Unknown Date (has links) 財務報表中的主觀性詞彙往往影響市場中的參與者對於報導公司價值和獲利能力衡量的決策判斷。因此，公司的管理階層往往有高度的動機小心謹慎的選擇用詞以隱藏負面的消息而宣揚正面的消息。然而使用人工方式從文字量極大的財務報表挖掘有用的資訊往往不可行，因此本研究採用人工智慧方法驗證美國財務報表中的主觀性多字詞 (subjective MWEs) 和公司的財務狀況是否具有關聯性。多字詞模型往往比傳統的單字詞模型更能掌握句子中的語意情境，因此本研究應用條件隨機域模型 (conditional random field) 辨識多字詞形式的意見樣式。另外，本研究的實證結果發現一些跡象可以印證一般人對於財務報表的文字揭露往往與真實的財務數字存在有落差的印象；更發現在負向的盈餘變化情況下，公司管理階層通常輕描淡寫當下的短拙卻堅定地承諾璀璨的未來。 / Subjective assertions in financial statements influence the judgments of market participants when they assess the value and profitability of the reporting corporations. Hence, the managements of corporations may attempt to conceal the negative and to accentuate the positive with "prudent" wording. To excavate this accounting phenomenon hidden behind financial statements, we designed an artificial intelligence based strategy to investigate the linkage between financial status measured by annual earnings and subjective multi-word expressions (MWEs). We applied the conditional random field (CRF) models to identify opinion patterns in the form of MWEs, and our approach outperformed previous work employing unigram models. Moreover, our novel algorithms take the lead to discover the evidences that support the common belief that there are inconsistencies between the implications of the written statements and the reality indicated by the figures in the financial statements. Unexpected negative earnings are often accompanied by ambiguous and mild statements and sometimes by promises of glorious future. 意見探勘自然語言處理語意分析財務報表文字探勘資訊擷取 opinion mining natural language processing sentiment analysis financial text mining information extraction
9	中文詞彙集的來源與權重對中文裁判書分類成效的影響 / Exploring the Influences of Lexical Sources and Term Weights on the Classification of Chinese Judgment Documents 鄭人豪, Cheng, Jen-Hao Unknown Date (has links) 國外法學資訊系統已研究多年，嘗試利用科技幫助提昇司法審判的效率。重要的議題包括輔助判決，法律文件分類，或是相似案件搜尋等。本研究將針對中文裁判書的分類做進一步談討。在文件特徵表示方面，我們以有序詞組來表達中文裁判書，我們嘗試比較採用不同的詞彙來源對於分類效果的影響。實驗中我們分別採用一般通用的電子詞典建立一般詞組；以及以演算法取出法學專業詞彙集建立專業詞組。並依tf-idf(term frequency – inverse document frequency)的概念，設計兩種詞組權重tpf-idf(term pair frequency – inverse document frequency)以及tpf-icf(term pair frequency – inverse category frequency)，來計算特徵詞組權重。在文件分類演算法方面，我們實作以相似度為基礎的k最近鄰居法作為系統分類機制，藉由裁判書的案由欄位，將案例分為七種類別，分別為竊盜、搶奪、強盜、贓物、傷害、恐嚇以及賭博。並藉由觀察案例資料庫的相似度分佈，以找出恰當的參數，進一步得到較佳的分類正確率與較低的拒絕率。我們並依照自省式學習法的精神，建立權重調整的機制。企圖藉由自省式學習法提昇分類效果，以及找出對分類有影響的詞組。而我們以案例資料庫的相似度差異值以及距離差異值，分析調整前後案例資料庫的變化，藉以觀察自省式學習法的效果。 / Legal information systems for non-Chinese languages have been studied intensively in the past many years. There are several topics under discussion, such as judgment assistance, legal document classification, and similar case search, and so on. This thesis studies the classification of Chinese judgment documents. I use phrases as the indices for documents. I attempt to compare the influences of different lexical sources for segmenting Chinese text. One of the lexical sources is a general machine-readable dictionary, Hownet, and the other is the set of terms algorithmically extracted from legal documents. Based on the concept of tf-idf, I design two kinds of phrase weights: tpf-idf and tpf-icf. In the experiments, I use the k-nearest neighbor method to classify Chinese judgment documents into seven categories based on their prosecution reasons: larceny(竊盜), robbery (搶奪), robbery by threatening or disabling the victims (強盜), receiving stolen property (贓物), causing bodily harm (傷害), intimidation (恐嚇), and gambling(賭博). To achieve high accuracy with low rejection rates, I observe and discuss the distribution of similarity of the training documents to select appropriate parameters. In addition, I also conduct a set of analogous experiments for classifying documents based on the cited legal articles for gambling cases. To improve the classification effects, I apply the introspective learning technique to adjust the weights of phrases. I observe the intra-cluster similarity and inter-cluster similarity in evaluating the effects of weight adjustment on experiments for classifying documents based on their prosecution reasons and cited articles. 法學資訊系統自然語言處理 k最近鄰居法自省式學習法 Legal information system Natural language processing k nearest neighbor introspective learning
10	電腦輔助試題翻譯：以國際數學與科學教育成就趨勢調查為例 / Computer Aided Item Translation for the Trends in International Mathematics and Science Study 呂明欣, Lu,Ming-Shin Unknown Date (has links) 由國際教育學習成就調查委員會統一命題之國際數學與科學教育成就趨勢調查測驗，為便於台灣中小學生施測與理解，英文原文試題內容需要經過許多人工討論及翻譯時間。為了增進翻譯內容一致性及其效率，我們設計一套符合測驗試題的輔助翻譯系統，將不同格式的試題文件，經執行語法分析式的片語擷取和字典查詢，透過使用者介面，選擇合適的片語詞彙翻譯選項和詞序調整，以及提供目前常用之線上翻譯服務、回顧翻譯類似句、以及加減詞彙等功能。為了能提昇翻譯詞彙的選擇正確性，我們記錄翻譯者選詞動作，讓翻譯者能回顧過去曾處理過的翻譯類似句，並且按照系統提供之選詞頻率資訊、科學領域的期刊語料之詞頻統計，以及利用統計式中英詞彙對列和語言模型，更改選詞的優先順序。我們嘗試以過去試題為實驗對象，按年級及學科區分6大試題類別，搭配4種選詞策略，透過BLEU及NIST之翻譯評估指標比較線上翻譯系統和本系統，實驗結果顯示在各實驗組的評估上均有優於線上翻譯系統的效果。 / Test items used in the Trends in International Mathematics and Science Study (TIMSS) are designed by The International Association for the Evaluation of Education Achievement, for facilitating education scientists to measure students’ competence in science and mathematics. Translating the English items into Chinese items demands a lot of work. Therefore, we would like to offer a computer-aided translation environment to improve the consistency and efficiency of the translation process. Through the user interface, translators could input different document format of test items, use phrase analysis and dictionary to find different phrase translations, and adjust word orders. Users of our system may obtain translations from on-line translations provided by Google and Yahoo, can look for previously translated items that contain specific word patterns, and so on. For selecting appropriate Chinese translations for English words, we considered users’ past selection, word frequencies in relevant corpora, and other language-related information in parallel corpora. We employed test items used in TIMSS 1999 and TIMSS 2003 to evaluate the effectiveness of our translation environment. Translations recommended by our system were compared with actual Chinese translations of the test data, and the similarity was measured with the BLEU and NIST metrics. Experimental results indicate that our system performed better or similarly with Google and Yahoo on-line translation systems. 自然語言處理電腦輔助教學受限語言試題翻譯機器翻譯 Natural language processing Computer-aided learning Controlled-language Item translation Machine translation TIMSS

Search results