Global ETD Search

1	以詞性組合為基礎之中文語言特徵研究 / A Study of Part-of-Speech Pair-based Language Features in Chinese Texts 江易倫, Jiang, Yi Lun Unknown Date (has links) 在作者歸屬的研究中，語言特徵的選擇一直是很重要的一環，因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異，像是高頻詞彙、n-grams、及標點符號等，但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題，本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證，並以雷震這位作者的文本為基準，探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型，使用OOB錯誤率評估分類模型分類表現，並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中，找出不同作者以及不同類型間語言特徵的獨特性詞組並做解釋。 / In the study of authorship attribution, the choice of language features have always been a very important part because it reflects the performance of the whole prediction. Most of the commonly used language features are excellent in classification, such as word frequencies, n-grams, and punctuation, but the phrases within these language features can not explain the causal relationship between categories and the differences between them. In order to solve this problem, this paper proposes 3 kinds of linguistic meaning as a auxiliary verification, and based on the Lei-Chen 's text, discussed "different authors with same topics" and "different genres with same author" is applied on the two research directions. In this paper, we will use the random forest algorithm to establish the classification model, use the OOB error rate assessment classification model classification performance, and through the important feature values to find the weight of each phrase as a decision point. Finally, we hope to find out unique phrases of different authors and different genres of language features from the classification rules and explain them. 作者歸屬語言特徵隨機森林 Authorship attribution Language features Random forest
2	隨機森林分類方法於基因組顯著性檢定上之應用 / Assessing the significance of a Gene Set 卓達瑋 Unknown Date (has links) 在現今生物醫學領域中，一重要課題為透過基因實驗所獲得的量化資料，來研究與分析基因與外顯表型變數(phenotype)的相關性。已知多數已發展的方法皆屬於單基因分析法，無法適當的考慮基因之間的相關性。本研究主要針對基因組分析(gene set analysis)問題，提出統計檢定方法來驗證特定基因組的顯著性。為了能盡其所能的捕捉整體基因組與外顯表型變數的關係，我們結合了傳統的檢定方法與分類方法，提出以隨機森林分類方法(Random Forests)的測試組分類誤差值(test error)作為檢定統計量(test statistic)，並以其排列顯著值(permutation-based p-value)來獲得統計結論。我們透過模擬研究將本研究方法和其他七種基因組分析方法做比較，可發現本方法在型一誤差率(type I error rate)和檢定力(power)上皆有優異表現。最後，我們運用本方法在數個實際基因資料組的分析上，並深入探討所獲得結果。 / Nowadays microarray data analysis has become an important issue in biomedical research. One major goal is to explore the relationship between gene expressions and some specific phenotypes. So far in literatures many developed methods are single gene-based methods, which use solely the information of individual genes and cannot appropriately take into account the relationship among genes. This research focuses on the gene set analysis, which carries out the statistical test for the significance of a set of genes to a phenotype. In order to capture the relationship between a gene set and the phenotype, we propose the use of performance of a complex classifier in the statistical test: The test error rate of a Random Forests classification is adopted as the test statistic, and the statistical conclusion is drawn according to its permutation-based p-value. We compare our test with other seven existing gene set analyses through simulation studies. It’s found that our method has leading performance in terms of having a controlled type I error rate and a high power. Finally, this method is applied in several real examples and brief discussions on the results are provided. 外顯表型變數基因組分析隨機森林分類方法排列顯著值 phenotypes gene set analysis Random Forests permutation-based p-value

Search results

以詞性組合為基礎之中文語言特徵研究 / A Study of Part-of-Speech Pair-based Language Features in Chinese Texts

隨機森林分類方法於基因組顯著性檢定上之應用 / Assessing the significance of a Gene Set