Global ETD Search

1	監督式學習與R 語言 / An overview of supervised learning with R 趙蘭益 Unknown Date (has links) 本研究著重比較各監督式學習的優缺點，再搭配統計軟體R 做舉例說明，並比較分類正確率，期望在不同的資料形態下找出最合適的分類方法。決策樹最大優勢在於可清楚知道分類過程，能夠讓使用者理解與解釋。隨機森林修正了決策樹預測準確率不高的缺點，並且可用來降低資料維度。最近鄰居法的演算法簡單易懂，樸素貝葉斯分類器在少量訓練及資料上也能做準確的預測分類，羅吉斯回歸建立在札實的數學理論上，是傳統且廣泛運用的分類方法之一，特色是可預測每一筆資料的類別機率值。人工神經網路對於人工智慧的發展有著重要影響，它的學習精度高、回想速度快，且輸出值可為連續值與不連續值。SVM 在小樣本、非線性及高維模式識別問題中表現出特有的優勢，並已應用於手寫體識別、三維目標識別、人臉識別、文本圖像分類等實際問題中。若資料為連續型或混和型建議使用最近鄰居法做分類，可得到較佳的預測準確率。若資料為類別型則使用隨機森林和支援向量機。機器學習
2	數據幾何特徵的機器學習 / A study of Data Geometry-based Learning 劉憲忠, Liu, Hsien Chung Unknown Date (has links) 本研究著重於數據的幾何模式以了解資料變數間的關係，運用統計模型配適所得的係數加權於距離矩陣上，是否能有效提升正確率。本研究主要使用資料雲幾何樹及餘弦相似度方法與抽樣多數決投票法判別預測資料類別，另外並與階層式分群法、支持向量機、Hybrid法於三筆不同資料的分類結果比較，其中有兩筆為生物行為評估專案資料與美國威斯康辛州診斷乳癌資料，使用監督式學習驗證資料分類結果，另一筆月亮模擬資料，使用半監督式學習預測新資料分類結果。最後，各方法的優劣性與原因將被探討與總結，可知不同資料數據的幾何，確實需要嘗試不同公式與演算法來達到好的機器學習結果。 / The study focuses on the computed data-geometry based learning to discover the inter-dependence patterns among covariate vectors. In order to discover the patterns and improve classification accuracy, the distance functions are modified to better capture the geometry patterns and measure the association between variables. A comparison of the performance of my proposed learning rule to the other machine learning techniques will be summarized through three datasets. In the end, I demonstrated why the concept of geometry patterns is essential. 機器學習幾何模式 machine learning data-geometry
3	基於資料科學方法之巨量蛋白質功能預測 / Applying Data Science to High-throughput Protein Function Prediction 劉義瑋, Liu, Yi-Wei Unknown Date (has links) 自人體基因組計畫與次世代定序的完成後，生物資料呈現爆炸性的成長，其中蛋白質序列也是大量發現的基因產物之一，然而蛋白質的功能檢測與標記極其耗時，因此存在大量已知序列卻不知其功能的蛋白質，在實驗前透過電腦先預測可能之功能，能夠幫助生物學家排定不同的蛋白質功能實驗順序，因而加快蛋白質功能標注的速度。基因本體論（GO）是一個被廣泛使用描述基因產物功能與性質的分類方法，分為生物途徑、細胞組件、分子功能三個分支，每個分支皆為一個由多個GO組成的階層樹。蛋白質功能預測為透過蛋白質序列預測該蛋白質所擁有的GO，因此可以視為一個多標籤的分類機器學習問題。我們提出一個基於序列同源性的機器學習預測框架，同時能夠結合蛋白質家族的資訊，並設計多種不同的投票方法解決多標籤的預測問題。 / Biological data has grown explosively with the accomplishment of Human Genome Project and Next-generation sequencing. Annotating protein function with wet lab experiment is time-consuming, so many proteins’ functions are still unknown. Fortunately, computational function prediction can help wet lab formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is the framework for unifying the representation of gene function and classifying these functions into three domains namely, Biological Process Ontology, Cellular Component Ontology, and Molecular Function Ontology. Each domain is a hierarchical tree composed of labels known as GO terms. Protein function prediction can be considered as a multiple label classification problem, i.e., given a protein sequence, predict its GO terms. We proposed a machine learning framework to predict protein function based on its homology sequence structure, which is believed to contain protein family information and designed various voting mechanisms to resolve the multiple label prediction problem. 蛋白質功能預測機器學習 Protein function prediction Machine learning
4	基於資訊理論熵之特徵選取 / Entropy based feature selection 許立農 Unknown Date (has links) 特徵選取為機器學習常見的資料前處理的方法，現今已有許多不同的特徵選取演算法，然而並不存在一個在所有資料上都優於其他方法的演算法，且由於現今的資料種類繁多，所以研發新的方法能夠帶來更多有關資料的資訊並且根據資料的特性採用不同的變數選取演算法是較好的做法。本研究使用資訊理論entropy的概念依照變數之間資料雲幾何樹的分群結果定義變數之間的相關性，且依此選取資料的特徵，並與同樣使用entropy概念的FCBF方法、Lasso、F-score、隨機森林、基因演算法互相比較，本研究使用階層式分群法與多數決投票法套用在真實的資料上判斷預測率。結果顯示，本研究使用的entropy方法在各個不同的資料集上有較穩定的預測率提升表現，同時資料縮減的維度也相對穩定。 / Feature selection is a common preprocessing technique in machine learning. Although a large pool of feature selection techniques has existed, there is no such a dominant method in all datasets. Because of the complexity of various data formats, establishing a new method can bring more insight into data, and applying proper techniques to analyzing data would be the best choice. In this study, we used the concept of entropy from information theory to build a similarity matrix between features. Additionally, we constructed a DCG-tree to separate variables into clusters. Each core cluster consists of rather uniform variables, which share similar covariate information. With the core clusters, we reduced the dimension of a high-dimensional dataset. We assessed our method by comparing it with FCBF, Lasso, F-score, random forest and genetic algorithm. The performances of prediction were demonstrated through real-world datasets using hierarchical clustering with voting algorithm as the classifier. The results showed that our entropy method has more stable prediction performances and reduces sufficient dimensions of the datasets simultaneously. 機器學習特徵選取維度縮減 entropy Machine learning Feature selection Dimension reduction Entropy
5	貓狗影像辨識之特徵萃取 / Feature extraction in dogs and cats image recognition 鍾立強, Chung, Li Chiang Unknown Date (has links) 近年來，很多要求高安全性的網站都使用扭曲變形的英文或數字字串作為辨識碼，以避免網站或系統受到大量暴力的攻擊。微軟公司則於2007年提出以貓狗影像的新辨識碼系統—Asirra。對於電腦而言，貓狗影像辨識較字串更為困難。本研究主要針對Asirra的影像資料試圖建構出貓狗影像自動辨識法，藉此來了解此辨識碼系統的有效性。已知影像包含大量雜訊，若使用原始資料則計算困難而且辨識效果差，所以萃取關鍵特徵為重要的研究課題。本文考慮方向梯度直方圖法 (Histograms of Oriented Gradients, HOG) 以及主成分分析 (Principal Components Analysis, PCA) 來篩選重要變數。我們將運用挑選出的特徵建立支持向量機 (Support Vector Machine, SVM) 分類器。在實證分析中，我們發現結合此兩種特徵萃取法，除了能夠大幅降低運算時間，也能得到良好的預測正確率。 / In recent years, many websites, which requires a high standard of security, use CAPTCHA to avoid mass and brutal attacks from hackers. The CAPTCHA considers the use of strings of twisted and deformed English letters or numbers as an identification code. In 2007, the company Microsoft proposed a new image-based recognition system-Assira, which uses dogs and cats images as an identification code. Dogs and cats image recognition is not more difficult than strings of letters or numbers recognition for human, but is more challenging for computers. In this paper, we aim to develop a classification method for images from Asirra. An image is represented by an enormous number of pixels. Only few pixels carry important feature information, most pixels are noise. The abundance of noise leads to computational inefficiency, and even worse, may results in inaccurate recognition. Therefore, in this problem feature extraction is an essential step before a classifier construction. We consider HOG (Histograms of Oriented Gradients) and PCA (Principal Components Analysis) to select important features, and use the features to construct a SVM (Support Vector Machine) classifier. In the real example, we find that combining the two feature detection methods can dramatically reduce computational time and have satisfactory predictive accuracy. Asirra 機器學習影像辨識方向梯度直方圖主成分分析
6	利用動態訊號資料庫以減少測量數之無線網路定位系統 / Reducing Calibration Effort for WLAN Locating System with Dynamic Radio Map 簡盧德, Chien Lu,Te Unknown Date (has links) 隨著無線網路的興起，許多相關的研究議題也應運而生，利用無線網路(WLAN)對於使用者位置的判斷與追蹤就是其中相當熱門的一塊。經過近幾年的發展，室內WLAN定位誤差的進步空間已達到極限，其原因主要出在無線訊號傳播的物理性質所產生的侷限。然而，大部分擁有良好精準度的定位系統是建立在不切實際的人力成本上，故我們將著眼點放在如何減少收集大量訊號所耗費的人力，同時保持不錯的精準度。取得AP位置所消耗的人力資源也是我們考慮的一環。因此，我們提出一套新的定位系統：首先建立少數的資料點，再透過推測基地台位置和插入機制來完成訊號資料庫的初步建置。然後在定位的同時收集使用者接收到連續的訊號強度，透過隱馬可夫鏈建立的模型，再配合其他演算法來更新訊號資料庫。實驗結果顯示，相較於其他兩個定位系統，我們的系統能夠減少最多的人力建置資源，並且達到有競爭力的定位精準度。除此之外，我們也分析了系統在使用舊的資料庫或是不同的實驗環境下，能夠展現怎樣的定位結果。 / Following the raise of Wireless LAN networks, there are a lot of relative research issues in today’s life. Tracking and locating mobile users in RF-based WLAN (IEEE 802.11) is a very important issue in location-based applications area. The error distances of indoor WLAN locating was decreased to approximately 1.5 meter in recent years. However, the improvement in accuracy was limited due to the nature of radio propagation. Many researches which contain precise accuracy were based on an impractical effort of collecting too much signal data which we usually called “calibration” in this area. So this thesis focuses on how to reduce the calibration efforts without losing too much accuracy. Confirming the allocation of access points is another kind of calibration effort we concerned. As a consequence, we proposed a new locating system: first we calibrated few points and utilized inferring AP’s position and interpolation to complete radio map. During location estimation phase, radio map could be updated dynamically using learning mechanism modeled by HMM and other algorithms. In the experimental results, we proved our system maintained a comparable accuracy under reducing much calibration effort than other two locating systems. Besides, we analyzed the performance of our system with elder radio map and in two different experimental environments. 定位無線網路訊號強度機器學習資料庫 locating WLAN signal strength learning radio map
7	財報文字分析之句子風險程度偵測研究 / Risk-related Sentence Detection in Financial Reports 柳育彣, Liu, Yu-Wen Unknown Date (has links) 本論文的目標是利用文本情緒分析技巧，針對美國上市公司的財務報表進行以句子為單位的風險評估。過去的財報文本分析研究裡，大多關注於詞彙層面的風險偵測。然而財務文本中大多數的財務詞彙與前後文具有高度的語意相關性，僅靠閱讀單一詞彙可能無法完全理解其隱含的財務訊息。本文將研究層次由詞彙拉升至句子，根據基於嵌入概念的~fastText~與~Siamese CBOW~兩種句子向量表示法學習模型，利用基於嵌入概念模型中，使用目標詞與前後詞彙關聯性表示目標詞語意的特性，萃取出財報句子裡更深層的財務意涵，並學習出更適合用於財務文本分析的句向量表示法。實驗驗證部分，我們利用~10-K~財報資料與本文提出的財務標記資料集進行財務風險分類器學習，並以傳統詞袋模型（Bag-of-Word）作為基準，利用精確度（Accuracy）與準確度（Precision）等評估標準進行比較。結果證實基於嵌入概念模型的表示法在財務風險評估上比傳統詞袋模型有著更準確的預測表現。由於近年大數據時代的來臨，網路中的資訊量大幅成長，依賴少量人力在短期間內分析海量的財務資訊變得更加困難。因此如何協助專業人員進行有效率的財務判斷與決策，已成為一項重要的議題。為此，本文同時提出一個以句子為分析單位的財報風險語句偵測系統~RiskFinder~，依照~fastText~與~Siamese CBOW~兩種模型，經由~10-K~財務報表與人工標記資料集學習出適當的風險語句分類器後，對~1996~至~2013~年的美國上市公司財務報表進行財報句子的自動風險預測，讓財務專業人士能透過系統的協助，有效率地由大量財務文本中獲得有意義的財務資訊。此外，系統會依照公司的財報發布日期動態呈現股票交易資訊與後設資料，以利使用者依股價的時間走勢比較財務文字型與數值型資料的關係。 / The main purpose of this paper is to evaluate the risk of financial report of listed companies in sentence-level. Most of past sentiment analysis studies focused on word-level risk detection. However, most financial keywords are highly context-sensitive, which may likely yield biased results. Therefore, to advance the understanding of financial textual information, this thesis broadens the analysis from word-level to sentence level. We use two sentence-level models, fastText and Siamese-CBOW, to learn sentence embedding and attempt to facilitate the financial risk detection. In our experiment, we use the 10-K corpus and a financial sentiment dataset which were labeled by financial professionals to train our financial risk classifier. Moreover, we adopt the Bag-of-Word model as a baseline and use accuracy, precision, recall and F1-score to evaluate the performance of financial risk prediction. The experimental results show that the embedding models could lead better performance than the Bag-of-word model. In addition, this paper proposes a web-based financial risk detection system which is constructed based on fastText and Siamese CBOW model called RiskFinder. There are total 40,708 financial reports inside the system and each risk-related sentence is highlighted based on different sentence embedding models. Besides, our system also provides metadata and a visualization of financial time-series data for the corresponding company according to release day of financial report. This system considerably facilitates case studies in the field of finance and can be of great help in capturing valuable insight within large amounts of textual information. 文字探勘財務風險情緒分析機器學習 Text mining Financial risk Sentiment analysis Machine learning
8	機器學習與房地產估價 / Machine learning and appraisal of real estate 蔡育展, Tsai, Yu Chang Unknown Date (has links) 近年來，房地產之投資及買賣廣為盛行，而房地產依舊為人們投資的方向之一。屬於人工智慧範疇之類神經網路，其具有學習能力，可以進一步的歸納推演所要預估的結果，也適合應用於非線性的問題中，但以往類神經網路的機器學習模型，皆採用中央處理器(CPU)進行運算，在計算量龐大時往往耗費大量時間於訓練上。而圖形處理器(GPU)之崛起，將增進機器學習的速率。本研究利用穩健學習程序搭配信封模組的概念，建立一類神經網路系統，利用GPU設備及機器學習工具–Tensorflow實作，針對民國一零四年之台北市不動產交易之住宅資料，並使用1276筆資料，隨機選取60%資料作為訓練範例並分別進行以假設有5%為可能離群值及沒有之情況做學習，並選取影響房地產價格之11個變數做為輸入變數，對網路進行訓練，實證結果發現類神經網路的速度有顯著的提升；且在假定有5%離群值之狀況下學習有較好的預測水準；另外在對資料依價格進行分組後，顯示此網路在對中價位以上的資料有較好的預測能力。就實務應用方面，藉由類神經網路適合應用於非線性問題的特性，對未來房地產之估價系統輔助做為參考。 / Real estate investment and transcation prevails in recent year. And it is still one of the choices for people to invest. The Neural Network which belongs to the category of Arificial Intelligence has the ability to learn and it can deduce to reach the goal. In addi-tion, it is also suitable for the application of non-linear problems. However, the machine learning model of the Neural Network use CPU to operate before and it will always spend a lot of time on training when the calculation is large.However, the rise of GPU speeds up the machine learing system. This study will implement resistant learning procedure with the concept of Enve-lope Bulk focus to built a Neural Network system. Using TensorFlow and graphics pro-cessing unit (GPU) to speed up the original Arificial Intelligence system. According to the real estate transaction data of Taipei City in 2015, 1276 data will be used. We will pick 60% of the data in a random way as training data of our two experiment , one of it will assume that there are 5% of outlier and another won’t. Then select 11 variables which may impact the value of real estate as input. As the experiment result, it makes the operation more efficient and faster , training of the Neural Network really speed up a lot. The experiment which has assume that there are 5% of outlier shows the better effect of predicting than the another. And we got a better prediction on the part of the higher price after we divided the data into six groups by their price.In the other hand, Neural Network is good at solving the problem of non-linear. It can be a reference of the sup-port system of real estate appraisal in the future. 房地產估價機器學習類神經網路張量流圖形處理器運算
9	基於筆畫與結構分析之中文書法美感評估 / Aesthetic Evaluation of Chinese Calligraphy Based on Stroke and Structural Analysis 林育如, Lin, Yuh Ru Unknown Date (has links) 中文書法經過了長久歷史的演變，已不單用來記錄事物，儼然成為了一種藝術。從古至今，有眾多書法大家和美學家撰寫書法專書，然而中文書法理論大多講述較抽象的技法，且在相關文獻鮮少之情況下難以具體將美感量化。本論文的目的為以電腦視覺角度解析中文書法筆畫與結構，找出影響書法美觀程度的視覺元素，並加以量化分析，透過機器學習機制，使電腦具有基本的書法鑑賞能力。有別於前人研究，我們提出6種描述整體楷書書法作品美感的特徵，包含排版工整度、字距掌握度、文字偏移程度、文字書寫大小穩定度、筆畫風格一致程度以及筆畫平滑程度。本研究蒐集書法比賽和素人作品共100張，每張皆經由一般母語為中文之受測者的評估，並且將得到的評分作為樣本的標籤，透過SVM辨識3個級別和5個級別的樣本，兩者皆有好的辨識效果。再者，我們將辨識結果轉換成美感分數，亦能真實呼應人工評分。透過我們的研究成果，期望能提供書法初學者在書法創作上的基礎參考標準。 / After a long history of evolution, Chinese calligraphy has transformed from a tool for writing to a unique form of art. Many publications regarding calligraphy writing techniques and appreciation have emerged along the way. Although the theory of Chinese calligraphy aesthetics is profound, it is difficult to define measures to quantify ‘beauty’ or ‘taste’. The objective of this research is to explore and extract relevant visual features for aesthetic evaluation of Chinese calligraphy using computer vision and machine learning techniques. Specifically, we propose six visual features to describe the quality of calligraphy work in Kai style, including layout, word separation, character offset, size regularity, style consistency and stroke uniformity. We then employ support vector machine (SVM) classifier to categorize the work into three or five levels of expertise. In both cases, good recognition results have been achieved. Furthermore, an aesthetic score can be obtained by converting the classification result with weighting factors. We hope that the evaluation result can assist beginners in identifying flaws in their writings and provide constructive suggestions to improve their skills in Chinese calligraphy. 美感評估書法楷書筆畫分析結構分析機器學習 Aesthetic evaluation Calligraphy Kai style Stroke analysis Structural analysis Machine learning
10	透過Spark平台實現大數據分析與建模的比較：以微博為例 / Accomplish Big Data Analytic and Modeling Comparison on Spark: Weibo as an Example 潘宗哲, Pan, Zong Jhe Unknown Date (has links) 資料的快速增長與變化以及分析工具日新月異，增加資料分析的挑戰，本研究希望透過一個完整機器學習流程，提供學術或企業在導入大數據分析時的參考藍圖。我們以Spark作為大數據分析的計算框架，利用MLlib的Spark.ml與Spark.mllib兩個套件建構機器學習模型，解決傳統資料分析時可能會遇到的問題。在資料分析過程中會比較Spark不同分析模組的適用性情境，首先使用本地端叢集進行開發，最後提交至Amazon雲端叢集加快建模與分析的效能。大數據資料分析流程將以微博為實驗範例，並使用香港大學新聞與傳媒研究中心提供的2012年大陸微博資料集，我們採用RDD、Spark SQL與GraphX萃取微博使用者貼文資料的特增值，並以隨機森林建構預測模型，來預測使用者是否具有官方認證的二元分類。 / The rapid growth of data volume and advanced data analytics tools dramatically increase the challenge of big data analytics services adoption. This paper presents a big data analytics pipeline referenced blueprint for academic and company when they consider importing the associated services. We propose to use Apache Spark as a big data computing framework, which Spark MLlib contains two packages Spark.ml and Spark.mllib, on building a machine learning model. This resolves the traditional data analytics problem. In this big data analytics pipeline, we address a situation for adopting suitable Spark modules. We first use local cluster to develop our data analytics project following the jobs submitted to AWS EC2 clusters to accelerate analytic performance. We demonstrate the proposed big data analytics blueprint by using 2012 Weibo datasets. Finally, we use Spark SQL and GraphX to extract information features from large amount of the Weibo users’ posts. The official certification prediction model is constructed for Weibo users through Random Forest algorithm. 大數據分析機器學習微博分析流程亞馬遜雲端服務 Big data analytics machine learning Weibo analytics pipeline Amazon EC2

1	監督式學習與R 語言 / An overview of supervised learning with R 趙蘭益 Unknown Date (has links) 本研究著重比較各監督式學習的優缺點，再搭配統計軟體R 做舉例說明，並比較分類正確率，期望在不同的資料形態下找出最合適的分類方法。決策樹最大優勢在於可清楚知道分類過程，能夠讓使用者理解與解釋。隨機森林修正了決策樹預測準確率不高的缺點，並且可用來降低資料維度。最近鄰居法的演算法簡單易懂，樸素貝葉斯分類器在少量訓練及資料上也能做準確的預測分類，羅吉斯回歸建立在札實的數學理論上，是傳統且廣泛運用的分類方法之一，特色是可預測每一筆資料的類別機率值。人工神經網路對於人工智慧的發展有著重要影響，它的學習精度高、回想速度快，且輸出值可為連續值與不連續值。SVM 在小樣本、非線性及高維模式識別問題中表現出特有的優勢，並已應用於手寫體識別、三維目標識別、人臉識別、文本圖像分類等實際問題中。若資料為連續型或混和型建議使用最近鄰居法做分類，可得到較佳的預測準確率。若資料為類別型則使用隨機森林和支援向量機。機器學習
2	數據幾何特徵的機器學習 / A study of Data Geometry-based Learning 劉憲忠, Liu, Hsien Chung Unknown Date (has links) 本研究著重於數據的幾何模式以了解資料變數間的關係，運用統計模型配適所得的係數加權於距離矩陣上，是否能有效提升正確率。本研究主要使用資料雲幾何樹及餘弦相似度方法與抽樣多數決投票法判別預測資料類別，另外並與階層式分群法、支持向量機、Hybrid法於三筆不同資料的分類結果比較，其中有兩筆為生物行為評估專案資料與美國威斯康辛州診斷乳癌資料，使用監督式學習驗證資料分類結果，另一筆月亮模擬資料，使用半監督式學習預測新資料分類結果。最後，各方法的優劣性與原因將被探討與總結，可知不同資料數據的幾何，確實需要嘗試不同公式與演算法來達到好的機器學習結果。 / The study focuses on the computed data-geometry based learning to discover the inter-dependence patterns among covariate vectors. In order to discover the patterns and improve classification accuracy, the distance functions are modified to better capture the geometry patterns and measure the association between variables. A comparison of the performance of my proposed learning rule to the other machine learning techniques will be summarized through three datasets. In the end, I demonstrated why the concept of geometry patterns is essential. 機器學習幾何模式 machine learning data-geometry
3	基於資料科學方法之巨量蛋白質功能預測 / Applying Data Science to High-throughput Protein Function Prediction 劉義瑋, Liu, Yi-Wei Unknown Date (has links) 自人體基因組計畫與次世代定序的完成後，生物資料呈現爆炸性的成長，其中蛋白質序列也是大量發現的基因產物之一，然而蛋白質的功能檢測與標記極其耗時，因此存在大量已知序列卻不知其功能的蛋白質，在實驗前透過電腦先預測可能之功能，能夠幫助生物學家排定不同的蛋白質功能實驗順序，因而加快蛋白質功能標注的速度。基因本體論（GO）是一個被廣泛使用描述基因產物功能與性質的分類方法，分為生物途徑、細胞組件、分子功能三個分支，每個分支皆為一個由多個GO組成的階層樹。蛋白質功能預測為透過蛋白質序列預測該蛋白質所擁有的GO，因此可以視為一個多標籤的分類機器學習問題。我們提出一個基於序列同源性的機器學習預測框架，同時能夠結合蛋白質家族的資訊，並設計多種不同的投票方法解決多標籤的預測問題。 / Biological data has grown explosively with the accomplishment of Human Genome Project and Next-generation sequencing. Annotating protein function with wet lab experiment is time-consuming, so many proteins’ functions are still unknown. Fortunately, computational function prediction can help wet lab formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is the framework for unifying the representation of gene function and classifying these functions into three domains namely, Biological Process Ontology, Cellular Component Ontology, and Molecular Function Ontology. Each domain is a hierarchical tree composed of labels known as GO terms. Protein function prediction can be considered as a multiple label classification problem, i.e., given a protein sequence, predict its GO terms. We proposed a machine learning framework to predict protein function based on its homology sequence structure, which is believed to contain protein family information and designed various voting mechanisms to resolve the multiple label prediction problem. 蛋白質功能預測機器學習 Protein function prediction Machine learning
4	基於資訊理論熵之特徵選取 / Entropy based feature selection 許立農 Unknown Date (has links) 特徵選取為機器學習常見的資料前處理的方法，現今已有許多不同的特徵選取演算法，然而並不存在一個在所有資料上都優於其他方法的演算法，且由於現今的資料種類繁多，所以研發新的方法能夠帶來更多有關資料的資訊並且根據資料的特性採用不同的變數選取演算法是較好的做法。本研究使用資訊理論entropy的概念依照變數之間資料雲幾何樹的分群結果定義變數之間的相關性，且依此選取資料的特徵，並與同樣使用entropy概念的FCBF方法、Lasso、F-score、隨機森林、基因演算法互相比較，本研究使用階層式分群法與多數決投票法套用在真實的資料上判斷預測率。結果顯示，本研究使用的entropy方法在各個不同的資料集上有較穩定的預測率提升表現，同時資料縮減的維度也相對穩定。 / Feature selection is a common preprocessing technique in machine learning. Although a large pool of feature selection techniques has existed, there is no such a dominant method in all datasets. Because of the complexity of various data formats, establishing a new method can bring more insight into data, and applying proper techniques to analyzing data would be the best choice. In this study, we used the concept of entropy from information theory to build a similarity matrix between features. Additionally, we constructed a DCG-tree to separate variables into clusters. Each core cluster consists of rather uniform variables, which share similar covariate information. With the core clusters, we reduced the dimension of a high-dimensional dataset. We assessed our method by comparing it with FCBF, Lasso, F-score, random forest and genetic algorithm. The performances of prediction were demonstrated through real-world datasets using hierarchical clustering with voting algorithm as the classifier. The results showed that our entropy method has more stable prediction performances and reduces sufficient dimensions of the datasets simultaneously. 機器學習特徵選取維度縮減 entropy Machine learning Feature selection Dimension reduction Entropy
5	貓狗影像辨識之特徵萃取 / Feature extraction in dogs and cats image recognition 鍾立強, Chung, Li Chiang Unknown Date (has links) 近年來，很多要求高安全性的網站都使用扭曲變形的英文或數字字串作為辨識碼，以避免網站或系統受到大量暴力的攻擊。微軟公司則於2007年提出以貓狗影像的新辨識碼系統—Asirra。對於電腦而言，貓狗影像辨識較字串更為困難。本研究主要針對Asirra的影像資料試圖建構出貓狗影像自動辨識法，藉此來了解此辨識碼系統的有效性。已知影像包含大量雜訊，若使用原始資料則計算困難而且辨識效果差，所以萃取關鍵特徵為重要的研究課題。本文考慮方向梯度直方圖法 (Histograms of Oriented Gradients, HOG) 以及主成分分析 (Principal Components Analysis, PCA) 來篩選重要變數。我們將運用挑選出的特徵建立支持向量機 (Support Vector Machine, SVM) 分類器。在實證分析中，我們發現結合此兩種特徵萃取法，除了能夠大幅降低運算時間，也能得到良好的預測正確率。 / In recent years, many websites, which requires a high standard of security, use CAPTCHA to avoid mass and brutal attacks from hackers. The CAPTCHA considers the use of strings of twisted and deformed English letters or numbers as an identification code. In 2007, the company Microsoft proposed a new image-based recognition system-Assira, which uses dogs and cats images as an identification code. Dogs and cats image recognition is not more difficult than strings of letters or numbers recognition for human, but is more challenging for computers. In this paper, we aim to develop a classification method for images from Asirra. An image is represented by an enormous number of pixels. Only few pixels carry important feature information, most pixels are noise. The abundance of noise leads to computational inefficiency, and even worse, may results in inaccurate recognition. Therefore, in this problem feature extraction is an essential step before a classifier construction. We consider HOG (Histograms of Oriented Gradients) and PCA (Principal Components Analysis) to select important features, and use the features to construct a SVM (Support Vector Machine) classifier. In the real example, we find that combining the two feature detection methods can dramatically reduce computational time and have satisfactory predictive accuracy. Asirra 機器學習影像辨識方向梯度直方圖主成分分析
6	利用動態訊號資料庫以減少測量數之無線網路定位系統 / Reducing Calibration Effort for WLAN Locating System with Dynamic Radio Map 簡盧德, Chien Lu,Te Unknown Date (has links) 隨著無線網路的興起，許多相關的研究議題也應運而生，利用無線網路(WLAN)對於使用者位置的判斷與追蹤就是其中相當熱門的一塊。經過近幾年的發展，室內WLAN定位誤差的進步空間已達到極限，其原因主要出在無線訊號傳播的物理性質所產生的侷限。然而，大部分擁有良好精準度的定位系統是建立在不切實際的人力成本上，故我們將著眼點放在如何減少收集大量訊號所耗費的人力，同時保持不錯的精準度。取得AP位置所消耗的人力資源也是我們考慮的一環。因此，我們提出一套新的定位系統：首先建立少數的資料點，再透過推測基地台位置和插入機制來完成訊號資料庫的初步建置。然後在定位的同時收集使用者接收到連續的訊號強度，透過隱馬可夫鏈建立的模型，再配合其他演算法來更新訊號資料庫。實驗結果顯示，相較於其他兩個定位系統，我們的系統能夠減少最多的人力建置資源，並且達到有競爭力的定位精準度。除此之外，我們也分析了系統在使用舊的資料庫或是不同的實驗環境下，能夠展現怎樣的定位結果。 / Following the raise of Wireless LAN networks, there are a lot of relative research issues in today’s life. Tracking and locating mobile users in RF-based WLAN (IEEE 802.11) is a very important issue in location-based applications area. The error distances of indoor WLAN locating was decreased to approximately 1.5 meter in recent years. However, the improvement in accuracy was limited due to the nature of radio propagation. Many researches which contain precise accuracy were based on an impractical effort of collecting too much signal data which we usually called “calibration” in this area. So this thesis focuses on how to reduce the calibration efforts without losing too much accuracy. Confirming the allocation of access points is another kind of calibration effort we concerned. As a consequence, we proposed a new locating system: first we calibrated few points and utilized inferring AP’s position and interpolation to complete radio map. During location estimation phase, radio map could be updated dynamically using learning mechanism modeled by HMM and other algorithms. In the experimental results, we proved our system maintained a comparable accuracy under reducing much calibration effort than other two locating systems. Besides, we analyzed the performance of our system with elder radio map and in two different experimental environments. 定位無線網路訊號強度機器學習資料庫 locating WLAN signal strength learning radio map
7	財報文字分析之句子風險程度偵測研究 / Risk-related Sentence Detection in Financial Reports 柳育彣, Liu, Yu-Wen Unknown Date (has links) 本論文的目標是利用文本情緒分析技巧，針對美國上市公司的財務報表進行以句子為單位的風險評估。過去的財報文本分析研究裡，大多關注於詞彙層面的風險偵測。然而財務文本中大多數的財務詞彙與前後文具有高度的語意相關性，僅靠閱讀單一詞彙可能無法完全理解其隱含的財務訊息。本文將研究層次由詞彙拉升至句子，根據基於嵌入概念的~fastText~與~Siamese CBOW~兩種句子向量表示法學習模型，利用基於嵌入概念模型中，使用目標詞與前後詞彙關聯性表示目標詞語意的特性，萃取出財報句子裡更深層的財務意涵，並學習出更適合用於財務文本分析的句向量表示法。實驗驗證部分，我們利用~10-K~財報資料與本文提出的財務標記資料集進行財務風險分類器學習，並以傳統詞袋模型（Bag-of-Word）作為基準，利用精確度（Accuracy）與準確度（Precision）等評估標準進行比較。結果證實基於嵌入概念模型的表示法在財務風險評估上比傳統詞袋模型有著更準確的預測表現。由於近年大數據時代的來臨，網路中的資訊量大幅成長，依賴少量人力在短期間內分析海量的財務資訊變得更加困難。因此如何協助專業人員進行有效率的財務判斷與決策，已成為一項重要的議題。為此，本文同時提出一個以句子為分析單位的財報風險語句偵測系統~RiskFinder~，依照~fastText~與~Siamese CBOW~兩種模型，經由~10-K~財務報表與人工標記資料集學習出適當的風險語句分類器後，對~1996~至~2013~年的美國上市公司財務報表進行財報句子的自動風險預測，讓財務專業人士能透過系統的協助，有效率地由大量財務文本中獲得有意義的財務資訊。此外，系統會依照公司的財報發布日期動態呈現股票交易資訊與後設資料，以利使用者依股價的時間走勢比較財務文字型與數值型資料的關係。 / The main purpose of this paper is to evaluate the risk of financial report of listed companies in sentence-level. Most of past sentiment analysis studies focused on word-level risk detection. However, most financial keywords are highly context-sensitive, which may likely yield biased results. Therefore, to advance the understanding of financial textual information, this thesis broadens the analysis from word-level to sentence level. We use two sentence-level models, fastText and Siamese-CBOW, to learn sentence embedding and attempt to facilitate the financial risk detection. In our experiment, we use the 10-K corpus and a financial sentiment dataset which were labeled by financial professionals to train our financial risk classifier. Moreover, we adopt the Bag-of-Word model as a baseline and use accuracy, precision, recall and F1-score to evaluate the performance of financial risk prediction. The experimental results show that the embedding models could lead better performance than the Bag-of-word model. In addition, this paper proposes a web-based financial risk detection system which is constructed based on fastText and Siamese CBOW model called RiskFinder. There are total 40,708 financial reports inside the system and each risk-related sentence is highlighted based on different sentence embedding models. Besides, our system also provides metadata and a visualization of financial time-series data for the corresponding company according to release day of financial report. This system considerably facilitates case studies in the field of finance and can be of great help in capturing valuable insight within large amounts of textual information. 文字探勘財務風險情緒分析機器學習 Text mining Financial risk Sentiment analysis Machine learning
8	機器學習與房地產估價 / Machine learning and appraisal of real estate 蔡育展, Tsai, Yu Chang Unknown Date (has links) 近年來，房地產之投資及買賣廣為盛行，而房地產依舊為人們投資的方向之一。屬於人工智慧範疇之類神經網路，其具有學習能力，可以進一步的歸納推演所要預估的結果，也適合應用於非線性的問題中，但以往類神經網路的機器學習模型，皆採用中央處理器(CPU)進行運算，在計算量龐大時往往耗費大量時間於訓練上。而圖形處理器(GPU)之崛起，將增進機器學習的速率。本研究利用穩健學習程序搭配信封模組的概念，建立一類神經網路系統，利用GPU設備及機器學習工具–Tensorflow實作，針對民國一零四年之台北市不動產交易之住宅資料，並使用1276筆資料，隨機選取60%資料作為訓練範例並分別進行以假設有5%為可能離群值及沒有之情況做學習，並選取影響房地產價格之11個變數做為輸入變數，對網路進行訓練，實證結果發現類神經網路的速度有顯著的提升；且在假定有5%離群值之狀況下學習有較好的預測水準；另外在對資料依價格進行分組後，顯示此網路在對中價位以上的資料有較好的預測能力。就實務應用方面，藉由類神經網路適合應用於非線性問題的特性，對未來房地產之估價系統輔助做為參考。 / Real estate investment and transcation prevails in recent year. And it is still one of the choices for people to invest. The Neural Network which belongs to the category of Arificial Intelligence has the ability to learn and it can deduce to reach the goal. In addi-tion, it is also suitable for the application of non-linear problems. However, the machine learning model of the Neural Network use CPU to operate before and it will always spend a lot of time on training when the calculation is large.However, the rise of GPU speeds up the machine learing system. This study will implement resistant learning procedure with the concept of Enve-lope Bulk focus to built a Neural Network system. Using TensorFlow and graphics pro-cessing unit (GPU) to speed up the original Arificial Intelligence system. According to the real estate transaction data of Taipei City in 2015, 1276 data will be used. We will pick 60% of the data in a random way as training data of our two experiment , one of it will assume that there are 5% of outlier and another won’t. Then select 11 variables which may impact the value of real estate as input. As the experiment result, it makes the operation more efficient and faster , training of the Neural Network really speed up a lot. The experiment which has assume that there are 5% of outlier shows the better effect of predicting than the another. And we got a better prediction on the part of the higher price after we divided the data into six groups by their price.In the other hand, Neural Network is good at solving the problem of non-linear. It can be a reference of the sup-port system of real estate appraisal in the future. 房地產估價機器學習類神經網路張量流圖形處理器運算
9	基於筆畫與結構分析之中文書法美感評估 / Aesthetic Evaluation of Chinese Calligraphy Based on Stroke and Structural Analysis 林育如, Lin, Yuh Ru Unknown Date (has links) 中文書法經過了長久歷史的演變，已不單用來記錄事物，儼然成為了一種藝術。從古至今，有眾多書法大家和美學家撰寫書法專書，然而中文書法理論大多講述較抽象的技法，且在相關文獻鮮少之情況下難以具體將美感量化。本論文的目的為以電腦視覺角度解析中文書法筆畫與結構，找出影響書法美觀程度的視覺元素，並加以量化分析，透過機器學習機制，使電腦具有基本的書法鑑賞能力。有別於前人研究，我們提出6種描述整體楷書書法作品美感的特徵，包含排版工整度、字距掌握度、文字偏移程度、文字書寫大小穩定度、筆畫風格一致程度以及筆畫平滑程度。本研究蒐集書法比賽和素人作品共100張，每張皆經由一般母語為中文之受測者的評估，並且將得到的評分作為樣本的標籤，透過SVM辨識3個級別和5個級別的樣本，兩者皆有好的辨識效果。再者，我們將辨識結果轉換成美感分數，亦能真實呼應人工評分。透過我們的研究成果，期望能提供書法初學者在書法創作上的基礎參考標準。 / After a long history of evolution, Chinese calligraphy has transformed from a tool for writing to a unique form of art. Many publications regarding calligraphy writing techniques and appreciation have emerged along the way. Although the theory of Chinese calligraphy aesthetics is profound, it is difficult to define measures to quantify ‘beauty’ or ‘taste’. The objective of this research is to explore and extract relevant visual features for aesthetic evaluation of Chinese calligraphy using computer vision and machine learning techniques. Specifically, we propose six visual features to describe the quality of calligraphy work in Kai style, including layout, word separation, character offset, size regularity, style consistency and stroke uniformity. We then employ support vector machine (SVM) classifier to categorize the work into three or five levels of expertise. In both cases, good recognition results have been achieved. Furthermore, an aesthetic score can be obtained by converting the classification result with weighting factors. We hope that the evaluation result can assist beginners in identifying flaws in their writings and provide constructive suggestions to improve their skills in Chinese calligraphy. 美感評估書法楷書筆畫分析結構分析機器學習 Aesthetic evaluation Calligraphy Kai style Stroke analysis Structural analysis Machine learning
10	透過Spark平台實現大數據分析與建模的比較：以微博為例 / Accomplish Big Data Analytic and Modeling Comparison on Spark: Weibo as an Example 潘宗哲, Pan, Zong Jhe Unknown Date (has links) 資料的快速增長與變化以及分析工具日新月異，增加資料分析的挑戰，本研究希望透過一個完整機器學習流程，提供學術或企業在導入大數據分析時的參考藍圖。我們以Spark作為大數據分析的計算框架，利用MLlib的Spark.ml與Spark.mllib兩個套件建構機器學習模型，解決傳統資料分析時可能會遇到的問題。在資料分析過程中會比較Spark不同分析模組的適用性情境，首先使用本地端叢集進行開發，最後提交至Amazon雲端叢集加快建模與分析的效能。大數據資料分析流程將以微博為實驗範例，並使用香港大學新聞與傳媒研究中心提供的2012年大陸微博資料集，我們採用RDD、Spark SQL與GraphX萃取微博使用者貼文資料的特增值，並以隨機森林建構預測模型，來預測使用者是否具有官方認證的二元分類。 / The rapid growth of data volume and advanced data analytics tools dramatically increase the challenge of big data analytics services adoption. This paper presents a big data analytics pipeline referenced blueprint for academic and company when they consider importing the associated services. We propose to use Apache Spark as a big data computing framework, which Spark MLlib contains two packages Spark.ml and Spark.mllib, on building a machine learning model. This resolves the traditional data analytics problem. In this big data analytics pipeline, we address a situation for adopting suitable Spark modules. We first use local cluster to develop our data analytics project following the jobs submitted to AWS EC2 clusters to accelerate analytic performance. We demonstrate the proposed big data analytics blueprint by using 2012 Weibo datasets. Finally, we use Spark SQL and GraphX to extract information features from large amount of the Weibo users’ posts. The official certification prediction model is constructed for Weibo users through Random Forest algorithm. 大數據分析機器學習微博分析流程亞馬遜雲端服務 Big data analytics machine learning Weibo analytics pipeline Amazon EC2

Search results