1 |
詞彙向量的理論與評估基於矩陣分解與神經網絡 / Theory and evaluation of word embedding based on matrix factorization and neural network張文嘉, Jhang, Wun Jia Unknown Date (has links)
隨著機器學習在越來越多任務中有突破性的發展,特別是在自然語言處理問題上,得到越來越多的關注,近年來,詞向量是自然語言處理研究中最令人興奮的部分之一。在這篇論文中,我們討論了兩種主要的詞向量學習方法。一種是傳統的矩陣分解,如奇異值分解,另一種是基於神經網絡模型(具有負採樣的Skip-gram模型(Mikolov等人提出,2013),它是一種迭代演算法。我們提出一種方法來挑選初始值,透過使用奇異值分解得到的詞向量當作是Skip-gram模型的初始直,結果發現替換較佳的初始值,在某些自然語言處理的任務中得到明顯的提升。 / Recently, word embedding is one of the most exciting part of research in natural language processing. In this thesis, we discuss the two major learning approaches for word embedding. One is traditional matrix factorization like singular value decomposition, the other is based on neural network model (e.g. the Skip-gram model with negative sampling (Mikolov et al., 2013b)) which is an iterative algorithm. It is known that an iterative process is sensitive to initial starting values. We present an approach for implementing the Skip-gram model with negative sampling from a given initial value that is using singular value decomposition. Furthermore, we show that refined initial starting points improve the analogy task and succeed in capturing fine-gained semantic and syntactic regularities using vector arithmetic.
|
2 |
應用大數據於杭州市房地產價格模型之建立 / The Application of Big Data Analytics on Real Estate Price Model of Hangzhou郁嘉綾, Yu, Cia-Ling Unknown Date (has links)
互聯網的發展與近年來數據平台受到公私部門重視,資訊的取得與流通變得便捷,中國房地產文化目前有別於台灣,尚無實價登錄機制且地域面積廣大,傳統估價模型可能無法直接應用,面對房地產背後眾多的影響因素,本研究將預測建模目標放在泡沫化尚不嚴重且較具有潛力的中國新一線城市杭州市,自新浪二手房網爬取杭州市房地產數據,並自國家統計局取得各地區行政支出數據,作為實證分析資料。結合自動程序爬蟲抓取數據、統計分析與機器學習方法,期望對中國房地產建立一混合非監督式與監督式學習之模型。
在分群結果之後建構模型採用之技術為C5.0、三層CHAID、五層CHAID與Neural Network,挑選出最適合的模型為使用混合模型後的C5.0決策樹方法,達到降低變數維度亦提升或達到相當的預測準確率的雙贏目標,模型中行政地區、面積、總樓層為最頻出現的重要變數。
另外透過集群分析於行政支出的應用,發現2016年度杭州市投入的行政支出集中於余杭區、蕭山區、濱江區,成為賣屋及購屋者的第二項決策標準。 / In recent years, with the growth of the Internet and the importance of data platform on public sector and private sector. Getting and sharing information are made easily. The culture of real estate in China is different from Taiwan. For instance, there is no actual house price registration system. Furthermore, traditional estimate model may not be directly applicable to China which has the vast geographical area of the mainland. There are many factors to influence house price model. This study focus on Hangzhou city. Because the burst of real estate bubbles is not serious as first-tier cities and it is one of new first-tier cities in China. The research data were crawler from Sina second-hand housing website and National Bureau of Statistics. By using auto web crawler skill, statistical analysis, and machine learning method to build a real estate model in China, which was combining unsupervised learning method with supervised learning method.
After clustering Hangzhou second-hand housing data, this study used C5.0, three layers Chi-Square Automatic Interaction Detector(CHAID), five layers CHAID, and Neural Network(NN). The study goal are both reducing dimension and getting better forecast accuracy. Choosing clustering- C5.0 model as appropriate house price model to achieve win-win situation after comparing final result. Administrative region, area, and total floor are the top three high frequency influential factors.
Applying Clustering Analysis to administrative expenses data in Hangzhou, the study found that the government resource focus on Yuhang, Xiaoshan, and Binjiang. It can be the second decision-making criterion for house sellers and house buyers.
|
3 |
適用於中文史料文本之作者語言模型分析方法研究 / An enhanced writer language model for Chinese historical corpora梁韶中, Liang, Shao Zhong Unknown Date (has links)
因應近年來數位典藏的趨勢日漸發展,越來越多珍貴中文歷史文本 選擇進行數保存,而保存的同時會面對文本的作者遺失或從缺,進而 影響文本的完整性,而本論文提出了一個適用於中文史料文本作者分 析的方法,主要是透過語言模型的建構,為每一位潛在的作者訓練出 一個專屬的語言模型,而搭配不同的平滑方法能避免掉某一受測文本 單詞出現的機率為零的機率進而造成計算上的錯誤,而本論文主要採 用改良式 Kneser–Ney 平滑方法,該平滑方法因其會同時考慮到 N 詞彙 語言模型的高低頻詞的影響,而使其成為建構語言模型普遍選擇的平 滑方式。
若僅將每一位潛在作者的所有文章進行合併訓練成單一的語言模型 會忽略掉許多特性,所以本篇論文在取得附有價值的歷史文本之外, 又加入後設資料 (Metadata) 進行綜合分析,包括人工標記的主題分類 的統計資訊,使建構出來的語言模型更適配受測文本,增加預測結果 的準確性。和加入額外的自定義的字詞以符合文本專有名詞的用詞習 慣,還會在一般建構語言模型的基礎上,加入長字詞的權重,以確定 字詞長度對預測準確度的關係。最後還會採用遞歸神經網路 (Recursive neural networks) 結合語言模型進行作者預測,與傳統的語言模型分析 作進一步的比較。 / In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors, thus affecting the integrity of the corpora. A method for analyzing the author of the Chinese historical text is mainly through the construction of the language model, for each potential author to train a specific language model, and with a different smoothing method can be avoided zero probability of words and the error is caused by the calculation. This paper mainly adopts the Interpolated Modified Kneser-Ney smoothing method, which will take into account the influence of higher order and lower order n-grams string frequency. So, Interpolated Modified Kneser-Ney smoothing is become a very popular way to construct a general choice of language models.
The combination of all the articles of each potential author into a single language model will ignore many of the features, so this paper in addition to the value of the historical corpora, but also to add the metadata to integrate analysis, including the statistical information of the subject matter classification of the artificial mark, so that the constructed language model is more suitable for the measured text, increase the accuracy of the forecast results, add additional custom words to match the language of the proper nouns, in addition. But also on the basis of the general construction language model, the weight of the long word to join, to determine the length of the word on the relationship between the accuracy of prediction. Finally, recursive neural networks language models are also used to predict the authors and to make further comparisons with the traditional language model analysis.
|
4 |
基於圖像資訊之音樂資訊檢索研究 / A study of image-based music information retrieval夏致群 Unknown Date (has links)
以往的音樂資訊檢索方法多使用歌詞、曲風、演奏的樂器或一段音頻訊號來當作查詢的媒介,然而,在某些情況下,使用者沒有辦法清楚描述他們想要尋找的歌曲,如:情境式的音樂檢索。本論文提出了一種基於圖像的情境式音樂資訊檢索方法,可以透過輸入圖片來找尋相應的音樂。此方法中我們使用了卷積神經網絡(Convolutional Neural Network)技術來處理圖片,將其轉為低維度的表示法。為了將異質性的多媒體訊息映射到同一個向量空間,資訊網路表示法學習(Network Embedding)技術也被使用,如此一來,可以使用距離計算找回和輸入圖片有關的多媒體訊息。我們相信這樣的方法可以改善異質性資訊間的隔閡(Heterogeneous Gap),也就是指不同種類的多媒體檔案之間無法互相轉換或詮釋。在實驗與評估方面,首先利用從歌詞與歌名得到的關鍵字來搜尋大量圖片當作訓練資料集,接著實作提出的檢索方法,並針對實驗結果做評估。除了對此方法的有效性做測試外,使用者的回饋也顯示此檢索方法和其他方法相比是有效的。同時我們也實作了一個網路原型,使用者可以上傳圖片並得到檢索後的歌曲,實際的使用案例也將在本論文中被展示與介紹。 / Listening to music is indispensable to everyone. Music information retrieval systems help users find their favorite music. A common scenario of music information retrieval systems is to search songs based on user's query. Most existing methods use descriptions (e.g., genre, instrument and lyric) or audio signal of music as the query; then the songs related to the query will be retrieved. The limitation of this scenario is that users might be difficult to describe what they really want to search for. In this paper, we propose a novel method, called ``image2song,'' which allows users to input an image to retrieve the related songs. The proposed method consists of three modules: convolutional neural network (CNN) module, network embedding module, and similarity calculation module. For the processing of the images, in our work the CNN is adopted to learn the representations for images. To map each entity (e.g., image, song, and keyword) into a same embedding space, the heterogeneous representation is learned by network embedding algorithm from the information graph. This method is flexible because it is easy to join other types of multimedia data into the information graph. In similarity calculation module, the Euclidean distance and cosine distance is used as our criterion to compare the similarity. Then we can retrieve the most relevant songs according to the similarity calculation. The experimental results show that the proposed method has a good performance. Furthermore, we also build an online image-based music information retrieval prototype system, which can showcase some examples of our experiments.
|
5 |
探索類神經網路於網路流量異常偵測中的時效性需求 / Exploring the timeliness requirement of artificial neural networks in network traffic anomaly detection連茂棋, Lian, Mao-Ci Unknown Date (has links)
雲端的盛行使得人們做任何事都要透過網路,但是總會有些有心人士使用一些惡意程式來創造攻擊或通過網絡連接竊取資料。為了防止這些網路惡意攻擊,我們必須不斷檢查網路流量資料,然而現在這個雲端時代,網路的資料是非常龐大且複雜,若要檢查所有網路資料不僅耗時而且非常沒有效率。
本研究使用TensorFlow與多個圖形處理器(Graphics Processing Unit, GPU)來實作類神經網路(Artificial Neural Networks, ANN)機制,用以分析網路流量資料,並得到一個可以判斷正常與異常網路流量的偵測規則,也設計一個實驗來驗證我們提出的類神經網路機制是否符合網路流向異常偵測的時效性和有效性。
在實驗過程中,我們發現使用更多的GPU可以減少訓練類神經網路的時間,並且在我們的實驗設計中使用三個GPU進行運算可以達到網路流量異常偵測的時效性。透過該方法得到的初步實驗結果,我們提出機制的結果優於使用反向傳播算法訓練類神經網路得到的結果。 / The prosperity of the cloud makes people do anything through the Internet, but there are people with bad intention to use some malicious programs to create attacks or steal information through the network connection. In order to prevent these cyber-attacks, we have to keep checking the network traffic information. However, in the current cloud environment, the network information is huge and complex that to check all the information is not only time-consuming but also inefficient.
This study uses TensorFlow with multiple Graphic Processing Units (GPUs) to implement an Artificial Neural Networks (ANN) mechanism to analyze network traffic data and derive detection rules that can identify normal and malicious traffics, and we call it Network Traffic Anomaly Detection (NTAD).
Experiments are also designed to verify the timeliness and effectiveness of the derived ANN mechanism. During the experiment, we found that using more GPUs can reduce training time, and using three GPUs to do the operation can meet the timeliness in NTAD. As a result of this method, the experiment result was better than ANN with back propagation mechanism.
|
Page generated in 0.0218 seconds