Global ETD Search

1	線性維度縮減應用質譜儀資料之研究陳柏宇 Unknown Date (has links) 近年來電腦科技進步、資料庫健全發展，使得處理大量資料的需求增加，因而發展出結合生物醫學與資訊統計兩大領域的生物資訊(Bio-informative)。這個新學門的特色在於資料量及資料變數的龐雜，但過多資料經常干擾資訊的篩選，甚至癱瘓資料分析，因此如何適當地縮減資料(Data Reduction)就變得必要。資料縮減常藉由維度縮減(Dimension Reduction)進行，其中常見的線性維度縮減方法首推主成份分析，屬於非監督式學習(Unsupervised Learning)的一種，而線性的監督式學習(Supervised Learning)方法則有SIR(Sliced Inverse Regression)、SAVE(Sliced Average Variance Estimate)及pHd(Principal Hessian Directions)。非監督式學習的主成份分析，主要在找出少數幾個維度而可以解釋代表自變數的變異程度，而監督式學習的SIR、SAVE及pHd則可以在縮減維度時，同時考量自變數跟應變數之間的關係，而找出可以解釋應變數的維度。本研究為解決蛋白質質譜儀資料高維度的問題，將應用各種線性維度縮減方法，並分別使用CART(Classification and Regression Tree)、KNN(K-Nearest Neighbor)、SVM(Support Vector Machine)、ANN(Artificial Neural Network)四種分類器，比較各維度縮減方法的分錯率高低，以交叉驗證(Cross Validation)比較維度縮減方法的優劣。研究發現在四種維度縮減方法中，PCA及SIR在各種分類器下都有較為穩定的分錯率，表現較為一致，但SAVE及pHd較不理想。我們也發現在不同的分類器下，PCA跟SIR兩者有不同表現，正確率較高的分類器(SVM與ANN)與PCA結合，而正確率較低的分類器(CART與KNN)與SIR結合，會有較佳的結果。另外，我們也嘗試整合分析(Meta Analysis)，綜合幾種線性維度縮減方法，而提出邊際訓練效果法(Marginal Training Effect Method)與加權整合法(Meta Weighted Method)，其中發現邊際訓練效果法若可以挑選出有效的維度，可以在不同分類器下提高整體模型，而加權整合法則確保在不同分類器下，讓其分類模型具有較為穩定的準確率；並提出相關係數重疊法(Overlap Correlation Method)來解決需要決定維度大小的問題。線性維度縮減蛋白質質譜儀資料 PCA SIR Meta Analysis
2	基於資訊理論熵之特徵選取 / Entropy based feature selection 許立農 Unknown Date (has links) 特徵選取為機器學習常見的資料前處理的方法，現今已有許多不同的特徵選取演算法，然而並不存在一個在所有資料上都優於其他方法的演算法，且由於現今的資料種類繁多，所以研發新的方法能夠帶來更多有關資料的資訊並且根據資料的特性採用不同的變數選取演算法是較好的做法。本研究使用資訊理論entropy的概念依照變數之間資料雲幾何樹的分群結果定義變數之間的相關性，且依此選取資料的特徵，並與同樣使用entropy概念的FCBF方法、Lasso、F-score、隨機森林、基因演算法互相比較，本研究使用階層式分群法與多數決投票法套用在真實的資料上判斷預測率。結果顯示，本研究使用的entropy方法在各個不同的資料集上有較穩定的預測率提升表現，同時資料縮減的維度也相對穩定。 / Feature selection is a common preprocessing technique in machine learning. Although a large pool of feature selection techniques has existed, there is no such a dominant method in all datasets. Because of the complexity of various data formats, establishing a new method can bring more insight into data, and applying proper techniques to analyzing data would be the best choice. In this study, we used the concept of entropy from information theory to build a similarity matrix between features. Additionally, we constructed a DCG-tree to separate variables into clusters. Each core cluster consists of rather uniform variables, which share similar covariate information. With the core clusters, we reduced the dimension of a high-dimensional dataset. We assessed our method by comparing it with FCBF, Lasso, F-score, random forest and genetic algorithm. The performances of prediction were demonstrated through real-world datasets using hierarchical clustering with voting algorithm as the classifier. The results showed that our entropy method has more stable prediction performances and reduces sufficient dimensions of the datasets simultaneously. 機器學習特徵選取維度縮減 entropy Machine learning Feature selection Dimension reduction Entropy
3	SIR、SAVE、SIR-II、pHd等四種維度縮減方法之比較探討方悟原, Fang, Wu-Yuan Unknown Date (has links) 本文以維度縮減(dimension reduction)為主題，介紹其定義以及四種目前較被廣為討論的處理方式。文中首先針對Li (1991)所使用的維度縮減定義型式y = g(x,ε) = g1(βx,ε)，與Cook (1994)所採用的定義型式「條件密度函數f(y \| x)=f(y \|βx)」作探討，並就Cook (1994)對最小維度縮減子空間的相關討論作介紹。此外文中也試圖提出另一種適用於pHd的可能定義(E(y \| x)=E(y \|βx)，亦即縮減前後y的條件期望值不變)，並發現在此一新定義下所衍生而成的子空間會包含於Cook (1994)所定義的子空間。有關現有四種維度縮減方法(SIR、SAVE、SIR-II、pHd)的理論架構，則重新予以說明並作必要的補充證明，並以兩個機率模式(y = bx +ε及y = \|z\| +ε)為例，分別測試四種方法能否縮減出正確的方向。文中同時也分別找出對應於這四種方法的等價條件，並利用這些等價條件相互比較，得到彼此間的關係。我們發現當解釋變數x為多維常態情形下，四種方法理論上都不會保留可以被縮減的方向，而該保留住的方向卻不一定能夠被保留住，但是使用SAVE所可以保留住的方向會比單獨使用其他三者之一來的多(或至少一樣多)，而如果SIR與SIR-II同時使用則恰好等同於使用SAVE。另外使用pHd似乎時並不需要「E(y│x)二次可微分」這個先決條件。 / The focus of the study is on the dimension reduction and the over-view of the four methods frequently cited in the literature, i.e. SIR, SAVE, SIR-II, and pHd. The definitions of dimension reduction proposed by Li (1991)(y = g( x,ε) = g1(βx,ε)), and by Cook (1994)(f(y \| x)=f(y\|βx)) are briefly reviewed. Issues on minimum dimension reduction subspace (Cook (1994)) are also discussed. In addition, we propose a possible definition (E(y \| x)=E(y \|βx)), i.e. the conditional expectation of y remains the same both in the original subspace and the reduced subspace), which seems more appropriate when pHd is concerned. We also found that the subspace induced by this definition would be contained in the subspace generated based on Cook (1994). We then take a closer look at basic ideas behind the four methods, and supplement some more explanations and proofs, if necessary. Equivalent conditions related to the four methods that can be used to locate "right" directions are presented. Two models (y = bx +ε and y = \|z\| +ε) are used to demonstrate the methods and to see how good they can be. In order to further understand the possible relationships among the four methods, some comparisons are made. We learn that when x is normally distributed, directions that are redundant will not be preserved by any of the four methods. Directions that contribute significantly, however, may be mistakenly removed. Overall, SAVE has the best performance in terms of saving the "right" directions, and applying SIR along with SIR-II performs just as well. We also found that the prerequisite, 「E(y \| x) is twice differentiable」, does not seem to be necessary when pHd is applied. 維度縮減子空間 dimension reduction subspace pHd principal Hessian directions SIR sliced inverse regression SAVE sliced average variance estimate SIR-II
4	充分維度縮減於整體性檢定之應用 / Application of sufficient dimension reduction to global test 徐碩亨, Hsu, Shuo Heng Unknown Date (has links) 隨著科技不斷的進步，人們需要處理的資料量也不斷地增加。在巨量資料的分析上，維度縮減將有助於增進效率。本篇論文主要介紹切片平均變異數估計維度縮減方法，並將此法應用於整體相關性檢定問題上。我們考慮切片平均變異數估計法中的邊際維度檢定，並將利用排列重抽法建構檢定統計量的虛無分配，藉此計算排列顯著值來獲得統計推論。此整體相關性檢定可用在基因組分析問題上，以驗證特定基因組與外顯特徵變數間的相關程度。最後我們將模擬本檢定的型一誤差率和檢定力，並與前人提出的方法做比較。維度縮減切片平均變異數估計法基因組分析排列顯著值
5	維度縮減應用於蛋白質質譜儀資料 / Dimension Reduction on Protein Mass Spectrometry Data 黃靜文, Huang, Ching-Wen Unknown Date (has links) 本文應用攝護腺癌症蛋白質資料庫，是經由表面強化雷射解吸電離飛行質譜技術的血清蛋白質強度資料，藉此資料判斷受測者是否罹患癌症。此資料庫之受測者包含正常、良腫、癌初和癌末四種類別，其中包括兩筆資料，一筆為包含約48000個區間資料(變數)之原始資料，另一筆為經由人工變數篩選後，僅剩餘779區間資料(變數)之人工處理資料，此兩筆皆為高維度資料，皆約有650個觀察值。高維度資料因變數過多，除了分析不易外，亦造成運算時間較長。故本研究目的即探討在有效的維度縮減方式下，找出最小化分錯率的方法。本研究先比較分類方法－支持向量機、類神經網路和分類迴歸樹之優劣，再將較優的分類方法：支持向量機和類神經網路，應用於維度縮減資料之分類。本研究採用之維度縮減方法，包含離散小波分析、主成份分析和主成份分析網路。根據分析結果，離散小波分析和主成份分析表現較佳，而主成份分析網路差強人意。本研究除探討以上維度縮減方法對此病例資料庫分類之成效外，亦結合線性維度縮減－主成份分析，非線性維度縮減－主成份分析網路，希望能藉重疊法再改善僅做單一維度縮減方法之病例篩檢分錯率，根據分析結果，重疊法對原始資料改善效果不明顯，但對人工處理資料卻有明顯的改善效果。 / In this paper, we study the serum protein data set of prostate cancer, which acquired by Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) technique. The data set, with four populations of prostate cancer patients, includes both raw data and preprocessed data. There are around 48000 variables in raw data and 779 variables in preprocessed data. The sample size of each data is around 650. Because of the high dimensionality, this data set provokes higher level of difficulty and computation time. Therefore, the goal of this study is to search efficient dimension reduction methods. We first compare three classification methods: support vector machine, artificial neural network, and classification and regression tree. And, we use discrete wavelet transform, principal component analysis and principal component analysis networks to reduce the data dimension. Then, we discuss the dimension reduction methods and propose overlap method that combines the linear dimension reduction method－principal component analysis, and the nonlinear dimension reduction method－principal component analysis networks to improve the classification result. We find that the improvement of overlap method is significant in the preprocessed data, but not significant in the raw data. 分類維度縮減疾病診斷電腦模擬 Classification Dimension reduction Disease diagnosis Computer simulation
6	重疊法應用於蛋白質質譜儀資料 / Overlap Technique on Protein Mass Spectrometry Data 徐竣建, Hsu, Chun-Chien Unknown Date (has links) 癌症至今已連續蟬聯並高居國人十大死因之首，由於癌症初期病患接受適時治療的存活率較高，因此若能「早期發現，早期診斷，早期治療」則可降低死亡率。本文所引用的資料庫，是經由「表面強化雷射解吸電離飛行質譜技術」（SELDI-TOF-MS）所擷取建置的蛋白質質譜儀資料，包括兩筆高維度資料：一筆為攝護腺癌症，另一筆則為頭頸癌症。然而蛋白質質譜儀資料常因維度變數繁雜眾多，對於資料的存取容量及運算時間而言，往往造成相當沉重的負擔與不便；有鑑於此，本文之目的即在探討將高維度資料經由維度縮減後，找出分錯率最小化之分析方法，希冀提高癌症病例資料分類的準確性。本研究分為實驗組及對照組兩部分，實驗組是以主成份分析（Principal Component Analysis，PCA）進行維度縮減，再利用支持向量機（Support Vector Machine，SVM）予以分類，最後藉由重疊法（Overlap）以期改善分類效果；對照組則是以支持向量機直接進行分類。分析結果顯示，重疊法對於攝護腺癌症具有顯著的改善效果，但對於頭頸癌症的改善效果卻不明顯。此外，本研究也探討關於蛋白質質譜儀資料之質量範圍，藉以確認專家學者所建議的質量範圍是否與分析結果相互一致。在攝護腺癌症中的原始資料，專家學者所建議的質量範圍以外，似乎仍隱藏著重要的相關資訊；在頭頸癌症中的原始資料，專家學者所建議的質量範圍以外，對於研究分析而言則並沒有實質上的幫助。 / Cancer has been the number one leading cause of death in Taiwan for the past 24 years. Early detection of this disease would significantly reduce the mortality rate. The database adopted in this study is from the Protein Mass Spectrometry Data Sets acquired and established by “Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry” (SELDI-TOF-MS) technique, including the Prostate Cancer and Head/Neck Cancer Data Sets. However, because of its high dimensionality, dealing the analysis of the raw data is not easy. Therefore, the purpose of this thesis is to search a feasible method, putting the dimension reduction and minimizing classification errors in the same time. The data sets are separated into the experimental and controlled groups. The first step of the experimental group is to use dimension reduction by Principal Component Analysis (PCA), following by Support Vector Machine (SVM) for classification, and finally Overlap Method is used to reduce classification errors. For comparison, the controlled group uses SVM for classification. The empirical results indicate that the improvement of Overlap Method is significant in the Prostate Cancer case, but not in that of the Head/Neck case. We also study data range suggested according to the expert opinions. We find that there is information hidden outside the data range suggested by the experts in the Prostate Cancer case, but not in the Head/Neck case. 疾病診斷維度縮減分類主成份分析支持向量機重疊法 Disease Diagnosis Dimension Reduction Classification Principal Component Analysis Support Vector Machine Overlap

1

Page generated in 0.2309 seconds