1 |
台股股利完全填權息關鍵影響因素之研究 / The key influencing factors of Taiwan stock price successfully remaining previous price after dividend payment陳人豪, Chen, Jen Hao Unknown Date (has links)
本研究以台灣50與中型100成分股為對象,運用資料探勘特徵選取技術,分析影響股票完全填權息成功之關鍵因素,並依此關鍵因素建構一個完全填權息預測模型,最後比較研究結果與過去研究之異同。本研究完全填權息預測模型的建構過程分為五階段:(1)定義完全填權息之股票:運用TEJ資料庫抓到的歷史股價資料與股利資訊,計算除權息前與除權息後股價,標註完全填權息和未完全填權息二個類別。(2)影響填權息相關因素:根據過去文獻所發現,影響短期填權息行情超額報酬的因素,以及影響股價的基本面因素,蒐集與股利相關的指標與基本分析中所用的公開財務報表資料。(3)特徵選取分析:利用循序前進搜尋(SFS)結合分類演算法,整合與計算所有影響因素資料,藉此找出關鍵的影響因素。(4)預測模型建立:根據特徵選取之結果資料,使用Weka軟體進行資料探勘支持向量機和決策樹分類模型訓練。(5)模型準確性比較與分析:本研究所建構之模型可協助存股型投資者,判斷可領取高股息且無股價損失之股票,提供投資人選股參考。 / In this study, we use the Feature Selection Method for Data Mining to analyze the key factors that may affect the rate of the stock price successfully remaining previous price after dividend payment among stocks of 50 largest companies and 100 medium-sized companies in Taiwan. Based on these key factors, we construct a forecasting model for stocks with the 100% flat stock price. Finally, We try to find out the similarities and differences between the current study and past research. In this study, the construction of a forecasting model for stocks with the 100% flat stock price is divided into five stages: (1) Defining stocks with the 100% flat stock price: Marking stocks with the 100% flat stock price and the non-100% flat stock price on historical stock data and dividend information captured by the TEJ database; (2) Relevant Factors Affecting increase in the stock price after dividend payment: According to the factors found in the past literature that may affect excess returns from short-term increase in the stock price after dividend payment and the fundamental factors affecting the stock price, we are able to collect indexes related to dividends and public financial statements for basic analysis. (3) Feature Selection Analysis: By using the Sequential Forward Selection (SFS) method and the classification algorithm, all influencing factors are integrated and calculated to find out the key influencing factors; (4) The Establishment of the Prediction Model: According to the results of feature selection, we use the Weka software to conduct data mining and train the classification model based on support vector machines and decision trees. (5) Comparison and Analysis on Accuracy of the Model: The model constructed in this study can help stock-holding investors determine stocks with high dividends without loss of the stock price and provide reference for investors in stock selection.
|
2 |
使用Meta-Learning在蛋白質質譜資料特徵選取之探討 / Feature Selection via Meta-Learning on Proteomic Mass Spectrum Data陳詩佳 Unknown Date (has links)
癌症高居國人十大死因之首,由於癌症初期病患接受適時治療的存活率較高,因此若能「早期發現,早期診斷,早期治療」則可降低死亡率。本研究主要針對「表面強化雷射解析電離飛行質譜技術」(Surface-Enhanced Laser Desorption / Ionization Time-of-Flight Mass Spectrometry,SELDI-TOF-MS)所蒐集而來的攝護腺癌症蛋白質質譜之事前處理資料進行分析。目的是希望藉由Meta-Learning的方式結合分類器,並以逐步特徵選取之,期望以較少且具代表的特徵變數將資料分類,以達到較高的正確率。本文利用正確率決定逐步特徵選取時變數加入的順序,並進一步以Elastic Net與判定係數作為特徵變數排序依據,以改善變數間共線性高的問題。並且考慮投票法(多數表決法與權重投票法)以及串聯法(cascading):多個分類器串聯與單一分類器串聯。研究發現,以判定係數刪選特徵變數加入的先後順序並以支持向量機(Support Vector Machine,SVM)串聯的特徵選取結果在各分類下皆有良好表現,為較佳的特徵選取方式。
關鍵字:特徵選取、串聯法、蛋白質質譜、meta-learning、支持向量機
|
3 |
基於資訊理論熵之特徵選取 / Entropy based feature selection許立農 Unknown Date (has links)
特徵選取為機器學習常見的資料前處理的方法,現今已有許多不同的特徵選取演算法,然而並不存在一個在所有資料上都優於其他方法的演算法,且由於現今的資料種類繁多,所以研發新的方法能夠帶來更多有關資料的資訊並且根據資料的特性採用不同的變數選取演算法是較好的做法。
本研究使用資訊理論entropy的概念依照變數之間資料雲幾何樹的分群結果定義變數之間的相關性,且依此選取資料的特徵,並與同樣使用entropy概念的FCBF方法、Lasso、F-score、隨機森林、基因演算法互相比較,本研究使用階層式分群法與多數決投票法套用在真實的資料上判斷預測率。結果顯示,本研究使用的entropy方法在各個不同的資料集上有較穩定的預測率提升表現,同時資料縮減的維度也相對穩定。 / Feature selection is a common preprocessing technique in machine learning. Although a large pool of feature selection techniques has existed, there is no such a dominant method in all datasets. Because of the complexity of various data formats, establishing a new method can bring more insight into data, and applying proper techniques to analyzing data would be the best choice.
In this study, we used the concept of entropy from information theory to build a similarity matrix between features. Additionally, we constructed a DCG-tree to separate variables into clusters. Each core cluster consists of rather uniform variables, which share similar covariate information. With the core clusters, we reduced the dimension of a high-dimensional dataset. We assessed our method by comparing it with FCBF, Lasso, F-score, random forest and genetic algorithm. The performances of prediction were demonstrated through real-world datasets using hierarchical clustering with voting algorithm as the classifier. The results showed that our entropy method has more stable prediction performances and reduces sufficient dimensions of the datasets simultaneously.
|
4 |
使用AUC特徵選取方法在蛋白質質譜儀資料分類之應用 / An AUC criterion for feature selection on classifying proteomic spectra data葉勝宗 Unknown Date (has links)
表面增強雷射脫附遊離/飛行時間質譜(SELDI-TOF-MS)是種屬於高維度的蛋白質質譜儀資料,主要是用來偵測蛋白質分子的表現。由於SELDI技術的限制,導致掃描出來的質譜儀資料往往存在誤差與雜訊,因此在分析前通常會先針對原始資料進行低階的事前處理,步驟包括去除基線、正規化、峰偵測(peak detection)與峰調準(peak alignment)。本文中所探討前列腺癌資料,可分成正常、良性腫瘤、癌症初期與癌症末期四種類別。我們分析及比較兩筆事前處理的蛋白質質譜資料,包括我們自行處理的以及Adam等人所處理的資料。為了解決SELDI在偵測分子質量時常出現的位移誤差以及同位素的問題,我們提出以”質荷比段落”當作新的特徵變數的想法來進行分析。本文利用「ROC曲線下面積」(AUC)當作選取的準則來挑選出重要的質荷比段落,而分類方法則採用支援向量機(SVM)。在四分類的分類結果中,我們自行處理的事前處理資可以得到訓練資料89%及測試資料63 %的正確率。而Adam等人所處理的事前處理資料,則得到訓練資料94%及測試資料86 %的正確率。本研究結果指出不同事前處理的方法對分類結果確實有影響,同時也驗證了利用”特徵變數段落”的方法來進行分析的可行性。 / The surface enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS) is a technique for presenting the expression of molecular masses. It is obvious that every spectrum has a huge dimension of features. In order to analyze these types of spectra samples, preprocessing steps are necessary. The steps of preprocessing include baseline subtraction, normalization, peak detection, and alignment. In our study, we use a prostate cancer data for demonstration. This prostate cancer data can be classified into four categories, namely, healthy men, benign prostate hyperplasia, early stage prostate cancer, and late stage prostate cancer. We analyzed both the preprocessed data processed by ourselves and the preprocessed data done by Adam et al.. In this thesis, we use segmentations of features as “new features” in attempt to solve problems due to location shifts and isotopes. The selection of important segmentations was based on the values of AUC and the SVM was applied for classification. For four-class classification, 94 % and 86 % of accuracy were obtained for training samples and validation samples, respectively, by using Dr. Adam et al.’s preprocessed data, and 89% for training samples, and 63% for validation samples by using our preprocessed data. This study suggested that the preprocessed method does have effect on classification result and a reasonable classification result can be obtained by using segmentations of features.
|
5 |
兩階段特徵選取法在蛋白質質譜儀資料之應用 / A Two-Stage Approach of Feature Selection on Proteomic Spectra Data王健源, Wang,Chien-yuan Unknown Date (has links)
藉由「早期發現,早期治療」的方式,我們可以降低癌症的死亡率。因此找出與癌症病變有關的生物標記以期及早發現與治療是一項重要的工作。本研究分析了包含正常人以及攝護腺癌症病人實際的蛋白質質譜資料,而這些蛋白質質譜資料是來自於表面強化雷射解吸電離飛行質譜技術(SELDI-TOF MS)的蛋白質晶片實驗。表面增強雷射脫附遊離飛行時間質譜技術可有效地留存生物樣本的蛋白質特徵。如果沒有經過適當的事前處理步驟以消除實驗雜訊,ㄧ 個質譜中可能包含多於數百或數千的特徵變數。為了加速對於可能的蛋白質生物標記的搜尋,我們只考慮可以區分癌症病人與正常人的特徵變數。
基因演算法是一種類似生物基因演化的總體最佳化搜尋機制,它可以有效地在高維度空間中去尋找可能的最佳解。本研究中,我們利用仿基因演算法(GAL)進行蛋白質的特徵選取以區分癌症病人與正常人。另外,我們提出兩種兩階段仿基因演算法(TSGAL),以嘗試改善仿基因演算法的缺點。 / Early detection and diagnosis can effectively reduce the mortality of cancer. The discovery of biomarkers for the early detection and diagnosis of cancer is thus an important task. In this study, a real proteomic spectra data set of prostate cancer patients and normal patients was analyzed. The data were collected from a Surface-Enhanced Laser Desorption/Ionization Time-Of-Flight Mass Spectrometry (SELDI-TOF MS) experiment. The SELDI-TOF MS technology captures protein features in a biological sample. Without suitable pre-processing steps to remove experimental noise, a mass spectrum could consists of more than hundreds or thousands of peaks. To narrow down the search for possible protein biomarkers, only those features that can distinguish between cancer and normal patients are selected.
Genetic Algorithm (GA) is a global optimization procedure that uses an analogy of the genetic evolution of biological organisms. It’s shown that GA is effective in searching complex high-dimensional space. In this study, we consider GA-Like algorithm (GAL) for feature selection on proteomic spectra data in classifying prostate cancer patients from normal patients. In addition, we propose two types of Two-Stage GAL algorithm (TSGAL) to improve the GAL.
|
6 |
對於高維度資料進行特徵選取-應用於分類蛋白質質譜儀資料黃仁澤 Unknown Date (has links)
傳統的腫瘤指標篩檢方法,往往靈敏度、普及度及特異性有限,無法得到正確、即時的診斷結果。現今癌症的研究,則透過蛋白質體學經由光譜及影像觀察癌症不同時期的蛋白質表現變化,期望未來得以發展較佳之診斷工具。本研究中主要針對兩組攝護腺癌症病人之蛋白質質譜資料,此資料應用蛋白質晶片與表面強化雷射解吸電離飛行質譜技術(SELDI-TOF-MS)收集而來。我們的研究目的在於從大量的蛋白質特徵中篩選出一群有助於分類的蛋白質特徵變數。我們提出以最小分錯率特徵選取法與最小p值( 檢定、Kruskal-Wallis檢定)特徵選取法進行初步特徵辨識度排序以及選取,並進一步發展出k-mean萃取法、最大相關係數萃取法與判定係數萃取法以改善變數間嚴重的共線性問題。我們利用支援向量機(Support Vector Machine)方法進行分類並評估分類效果,在不同的分類目的下萃取有助於辨識的蛋白質特徵,以決定最佳特徵集合。研究發現運用最小分錯率特徵選取法與最小p值分錯率特徵選取法,輔以判定係數萃取法,在各分類目的下皆有良好表現,為較佳的特徵選取方式。
|
Page generated in 0.0141 seconds