Global ETD Search

1	適用於中文史料文本之標記式主題模型分析方法研究 / An Enhanced Topic Model Based on Labeled LDA for Chinese Historical Corpora 陳奕安 Unknown Date (has links) 本論文提出了一個適用於中文史料文本主題分析方法,主要是根據標記式隱含狄利克雷分布(Labeled Latent Dirichlet Allocation,LLDA) 演算法,使其可以透過人工標記的中文文本找出特定主題的相關詞彙。在我們提出的演算法中,我們加上主題種子字詞(Seed Words) 資訊,以增強 LDA 群聚過後的結果,使群聚過後的詞彙與主題的關聯度能夠獲得提昇。近年來,隨著網際網路的普及以及資訊檢索的蓬勃發展,同時由於數位典藏的資料成長,越來越多的實體書藉被編輯成數位版本並且加上後設資料(Metadata),在取得這些富有價值的歷史文本資料後,如何利用文字探勘技術(Text Mining)在這些資料上變成一項重要的研究議題。其中,如何從大量文本史料中辨識出文章主題更是許多學者感興趣的方向,而 LDA 主題模型則是在文字探勘領域中非常經典的方法。在此研究中我們發現傳統 LDA 對於群聚後的主題描述存在些許問題,包括主題類別的高隨機性以及個別主題的低易讀性,使得後續的解讀工作變得十分困難,因此我們採用了由 LDA 衍生出的標記式主題模型 Labeled LDA 演算法,限定能夠產生的主題類別以降低期隨機性,此外我們還加入了考量中文字詞的長度以及自定義的相關種子字詞等改進,使群聚出的主題詞彙能夠與主題更加相關,更加容易描述。實驗部分,我們利用改良後的演算法提取出主題詞彙,並進行人工標記,接著將標記的結果作為正確解答來計算平均準度均值(Mean Average Precision,MAP)等資訊檢索之評估方法作為評估,結果證實以長字詞以及種子字詞為考量所群聚出的結果皆優於傳統主題模型所群聚出的結果;此外,我們也將最終的結果與 TF-IDF 權重計算後的字詞進行比較,並由實驗結果可見其兩者之間的差異性。 / This paper proposes an enhanced topic model based on Labeled Latent Dirichlet Allocation (LLDA) for Chinese historical corpora to discover words related to specific topics. To enhance the traditional LDA performance and to increase the readability of its clustered words, we attempt to use the infor- mation of seed words and the Chinese word length into the traditional LDA algorithm. In this study, we find that the traditional LDA exists some prob- lems about topic descriptions after clustering. We therefore apply the Labeled LDA algorithm, which is derived from traditional LDA, with the proposed improvements of considering the lengths of the words and related seed words. In our experiments, Mean Average Precision (MAP) is used to evaluate our experiment results based on the topics words labeled manually by historical experts. The experimental results shows that the proposed method of consid- ering both Chinese word length information and seed words is better than the traditional LDA method. In addition, we compare the proposed results with the TF-IDF weighting scheme, and the proposed method also outperforms the TF-IDF method significantly. 主題模型標記式主題模型隱含狄利克雷分布
2	基於標記式主題模型之資料視覺化研究與實現 / A study of data visualization based on labeled topic model and its implementation 曾子芸 Unknown Date (has links) 隨著文字資訊的爆炸式增長，越來越多的訊息開始以電子文本的形式儲存及傳遞。但隨著文本內容資訊量不斷地增加，使用者也越來越難以快速地掌握文本全貌。因此本研究試圖透過主題模型（TopicModels）、標記式主題模型（Labeled Topic Models）演算法－在自然語言處理領域裡文本探勘的方法，識別出大規模文本中潛藏的主題訊息之後，再利用圖像視覺化在資訊表達上的優勢和效率，透過各種視覺化圖案的呈現從不同的角度來探索文本，形成一種嶄新的大規模文本閱讀與分析方式。本研究設計了兩階段實驗：第一階段任務導向性實驗、第二階段指定任務實驗，以及評估問卷來驗證本介面的易用性（ Ease-of-use ）和有用性（ Usefulness ）。並透過實驗問卷的分數結果驗證了，本研究所設計之介面在實務上的確能輔助專家學者進行文本相關研究，也能讓對文本熟悉程度不一的使用者在利用此介面探索文本的過程中，更快速地掌握大規模文本的事件全貌。 / With the explosion of text information, there are more and more data being recorded and transmitted in the form of texts. However, as the amount of textual information becomes larger, how to effectively and efficiently realize the information also becomes more difficult. This study attempts to use the Topics Models, text-mining techniques to identify the important topics in the large textual information. In addition, this study also aims to use the techniques of data visualization to present the most informative and valuable details within the large texts. There are two parts in this work: the first part is the introduction of text mining algorithms and the second part is the design of the data visualization.Moreover, in the experiments, we also conduct several surveys to verify the proficiency and usefulness and the visualization design. The results of the experiments and surveys, supports that our design provides an effective and efficient interface for users to understand a large set of texts, even for the experts familiar with the corpus. 資料視覺化文字資料視覺化主題模型
3	基於意見探勘與主題模型之部落格食記剖析研究 / A Study of Opinion Mining and Topic Model Analysis on Food Diaries 賴柏帆, Lai, Po Fan Unknown Date (has links) 隨著Web 2.0興起，社群網站在資訊傳遞與獲取所占比重相當高。以美食領域來看，人們在進餐廳前先行閱覽食記評論之情形越來越常見，而部落格文章因圖文並茂，常被消費者列入參考比較之來源。儘管這一類食記內容相對短篇食評來說較為完整，但評論分散於文章中，且多半沒有評分可供參考，讀者很難在第一時間獲悉評論樣貌，得花上一番心力進行閱覽，才能對餐廳整體有所評鑑。本研究提出一套基於意見探勘與主題模型的食記剖析方法，由部落格中各餐廳貼文情緒量來反映正負面評價，將提及評論歸納為「食物」、「服務」及「環境」三個評分面向，進而提供該家餐廳的整體推薦分數，供讀者快速參閱之。實驗語料自痞客邦美食類貼文中選定添好運台灣－台北站前店、京星港式飲茶PART2、金泰日式料理－內湖店以及喀佈貍（一店）大眾和風串燒居酒洋食堂，合計4家餐廳與200篇語料。透過LDA主題模型對食記敘述進行主題式分群，使擁有相近主題概念的句子分為一群，並歸類至各面向，例如喀佈貍（一店）之語料可分為10群主題語句，食物面向上有6群，服務與環境面向各為2群。另一方面，為了更有效辨別食記中含有的正負向情緒，本研究透過語意導向方法(SO-PMI)來計算食記中常出現情緒詞彙之極性，以建置該領域的意見詞詞庫。實驗結果方面，以線上餐廳評論網站－iPeen愛評網作為驗證對象，顯示其語料的平均情緒量相近，於大眾觀感與評價上傾向一致，且相較一般評論網站，本研究能從較細微的面向來切入，並以情緒量反映真實的餐廳評價。最後提出未來欲探討與改善之處，供後續研究參考之。 / As the time of Web 2.0 rise, social media platform plays a crucial role in transferring and receiving information. More and more people get used to reading the related posts before having meal. Because of its richness in content and referring photographs, blog posts are most frequently used for reference. Although the blog posts are more complete regarding their content than other short reviews, the actual reviews are scattered among words that are simply descriptions, and there are no grading scale to take as reference. These all together gives the reader a hard time to efficiently organize the overview of the review, and for them to, therefore, make the decision if they should go to the restaurant. Our study offers a method of analyzing food diaries based on opinion mining and topic model. The scale of emotion in a blog post about a restaurant is used as the reflection of its review's positive or negative. The comments are categorized into food, service and environment. And the restaurant will be graded based on these three aspects to further provide the user an overall score of recommendation. We collected total of 200 articles written on 4 restaurants in PIXNET, then categorized the contents using LDA (Latent Dirichlet Allocation) model base on their theme. The sentences with similar theme with be put into a group, then be further categorized to the three aspects that was mentioned earlier. On the other hand, to better distinguish if the emotion in certain food diary is positive or negative, our study calculated the polarity of common opinion-based words in food diaries using semantic orientation (SO-PMI), and built an opinion corpus specifically for food diaries. In terms of the result, using iPeen, a restaurant rating website, as test reference, it shows that the average scales of opinion of the restaurants we got using our method are close to iPeen, which in this case we can say are close to the public opinion and review. Furthermore, compare to common rating website, our study touches on even the minute aspect, and use the cumulative opinion to reflect the true blog authors' evaluation of the restaurant. Lastly, we would like to bring up what we intend to discuss and improve in the future for upcoming research's reference. 意見探勘 LDA 主題模型餐廳評分 Opinion Mining LDA Topic Model Restaurant Rating
4	AppReco: 基於行為識別的行動應用服務推薦系統 / AppReco: Behavior-aware Recommendation for iOS Mobile Applications 方子睿, Fang, Zih Ruei Unknown Date (has links) 在現在的社會裡，手機應用程式已經被人們接受與廣泛地利用，然而目前市面上的手機 App 推薦系統，多以使用者實際使用與回報作為參考，若有惡意行為軟體，在使用者介面後竊取使用者資料，這些推薦系統是難以查知其行為的，因此我們提出了 AppReco，一套可以系統化的推薦 iOS App 的推薦系統，而且不需要使用者去實際操作、執行 App。整個分析流程包括三個步驟：(1) 透過無監督式學習法的隱含狄利克雷分布(Latent Dirichlet Allocation, LDA)做出主題模型，再使用增長層級式自我組織映射圖(Growing Hierarchical Self-Organizing Map, GHSOM)進行分群。(2)使用靜態分析程式碼，去找出其應用程式所執行的行為。(3)透過我們的評分公式對於這些 App，進行評分。在分群 App 方面，AppReco 使用這些應用程式的官方敘述來進行分群，讓擁有類似屬性的手機應用程式群聚在一起；在檢視 App 方面，AppReco 透過靜態分析這些 App 的程式碼，來計算其使用行為的多寡；在推薦 App 方面，AppReco 分析類似屬性的 App 與其執行的行為，最後推薦使用者使用較少敏感行為(如使用廣告、使用個人資料、使用社群軟體開發包等)的 App。而本研究使用在 Apple App Store 上面數千個在各個類別中的前兩百名 App 做為我們的實驗資料集來進行實驗。 / Mobile applications have been widely used in life and become dominant software applications nowadays. However there are lack of systematic recommendation systems that can be leveraged in advance without users’ evaluations. We present AppReco, a systematic recommendation system of iOS mobile applications that can evaluate mobile applications without executions. AppReco evaluates apps that have similar interests with static binary analysis, revealing their behaviors according to the embedded functions in the executable. The analysis consists of three stages: (1) unsupervised learning on app descriptions with Latent Dirichlet Allocation for topic discovery and Growing Hierarchical Self-organizing Maps for hierarchical clustering, (2) static binary analysis on executables to discover embedded system calls and (3) ranking common-topic applications from their matched behavior patterns. To find apps that have similar interests, AppReco discovers (unsupervised) topics in official descriptions and clusters apps that have common topics as similar-interest apps. To evaluate apps, AppReco adopts static binary analysis on their executables to count invoked system calls and reveal embedded functions. To recommend apps, AppReco analyzes similar-interest apps with their behaviors of executables, and recommend apps that have less sensitive behaviors such as commercial advertisements, privacy information access, and internet connections, to users. We report our analysis against thousands of iOS apps in the Apple app store including most of the listed top 200 applications in each category. 推薦系統手機應用程式主題模型 Recommender System Mobile Application Topic Model
5	應用情感分析於媒體新聞傾向之研究-以中央社為例 / Applying sentiment analysis to the tendency of media news: a case study of central news agency 吳信維, Wu, Xin-Wei Unknown Date (has links) 本研究目的在於結合關聯規則新詞發掘演算法來擴增詞庫，並藉此提高結斷詞句的精確度以及透過非監督式情感分析方法，從中央通訊社中抓取國民黨以及民進黨的相關新聞文本，建立主題模型與情緒傾向的標注。再藉由監督式學習方法建立分類模型並驗證其成果。　　本研究藉由n-gram with a-priori algorithm來進行斷詞斷句的詞庫擴增。共有32007組詞被發掘，於這些詞中具有真正意義的詞共有28838筆，成功率可達88%。　　本研究比較兩種分群方法建立主題模型，分別為TFIDF-Kmeans以及LDA。在TFIDF-Kmeans分群結果中，因為文本數量遠大於議題詞數量，造成TFIDF矩陣過於稀疏，造成分群效果不佳。在LDA的分群結果底下，因為LDA模型其多文章多主題共享的特性，主題分類的精準度更高達八成以上。故本研究認為在分析具有多主題特性之文本，採用LDA模型來進行議題詞分群會有較佳的表現。　　本研究透過結合不同的資料時間區間，呈現出中央通訊社的新聞文本在我國近五次總統大選前後三個月間的新聞情緒傾向。同時探討各主題模型中各類別於大選前後三個月之情緒傾向變化。可以觀察到大致上文本的情感指數高峰值會出現於投票日的時候，而近三次總統大選的結果顯示，相關的政黨新聞情感值會於選舉過後趨於平緩。而從新聞文本的正負向情感統計以及以及整體情緒傾向分析可以看出，不論執政黨為何，中央通訊社的新聞對於國民黨以及民進黨皆呈現了正向且平穩的內容，大抵不會特別偏向單一政黨 / The purpose of this research is to combine association rules and new word mining algorithms to expand the lexicons so as to improve the accuracy of word segmentations, and by capturing the KMT and DPP news from the Central News Agency, it establishes the theme model and sentiment orientation through the unsupervised sentiment analysis method. Finally, by means of supervised learning methods, this research establishes classifications models and verifies its results. 　　This research uses n-gram with a-priori algorithm to segment words and sentences to expand the lexicons. A total of 32007 word are found, and among them, there have 28838 words with real meaning. The success rate is up to 88%. 　　In this research, we compare two different clustering methods to form the theme model, which are the TFIDF-Kmeans, and the LDA. From the results of TFIDF-Kmeans, the TFIDF matrix is too sparse, resulting in poor clustering because the number of texts is a lot larger than that of the issues. Unlike TFIDF-Kmeans, because of LDA model with more features of multi-topic sharing, the accuracy of topic classification is more than 80%. Therefore, this research suggests that it will have a better performance to analyze the multi-subjective texts with LDA model to classify the word clustering. 　　Through the combination of different data time interval, this research presents the sentimental tendencies of Central News Agency’s news in three months before and after the last five presidential elections in Taiwan. At the same time, it also explores the changes of the sentimental tendencies in the various theme models in the three months before and after the election. It can be observed the sentimental peak of the text will be appeared on the polling day, and nearly three times of the presidential election results show that the sentimental value of the relevant party’s news will become smooth after the election. From the positive and negative sentimental statistics of the news text and the analysis of the overall sentimental tendencies, no matter which the ruling party is, the news of the Central News Agency for the KMT and the DPP presents a positive and stable content, not particularly toward any political party. 情感分析 LDA主題模型 n-gram a-priori Sentiment analysis LDA N-gram A-priori
6	股市趨勢預測之研究 -財經評論文本情感分析 / Predict the trend in the stock by Sentiment analyzing financial posts 蔡宇祥, Tsai, Yu Shiang Unknown Date (has links) 根據過去研究指出，社群網站上的貼文訊息會對群眾情緒造成影響，進而影響股市波動，故對於投資者而言，如果能快速分析大量社群網站的財經文本來推測投資情緒進而預測股市走勢，將可提升投資獲利。過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果，但監督式學習方法所使用的訓練資料集須有事先定義好的已知類別，故其有無法預期未知類別的限制，所以本研究透過深度學習方法，從巨量資料及裡抓出有關於股市之文章，並透過財經文本的混合監督式學習與非監督式學習之情感分析方法，透過非監督式學習對微博財經貼文進行文本主題判別、情緒指數計算與情緒傾向標注，並且透過監督式學習的方式，建立分類模型以預測上海指數走勢，最後配合視覺化工具作趨勢線圖分析，找出具有領先指標特性之主題。在實驗結果中，深度學習方面，本研究透過word2vec抓取有效之股市主題文章，有效篩選了需要分析之文本，主題模型方面，我們最後使用LDA作為本研究標註主題之方法，因為其文本數量大於議題詞數量造成TFIDF矩陣過於稀疏，造成Kmeans分群效果不佳，故後續採用LDA主題模型進行主題標注。情緒傾向標注方面，透過擴充後的情感詞集比起NTUSD有更好的詞性分數判斷效果，計算出的情緒指數之趨勢線能有效預測上海指數之趨勢。此外，並非所有主題模型之情緒指數皆具有領先特性，僅公司表現與上海指數之主題模型的情緒指數能提前反應上海指數趨勢，故本研究用此二主題之文本的情緒指數來建立分類模型。本研究透過比較情緒指數與單純指數指標分類模型的準確度，前者較後者高出7%的準確率。故證實了情感分析確實能有效提升上海指數趨勢預測準確度，幫助投資者增加股市報酬率。情感分析 Word2vec LDA主題模型 K-means 上海股價指數
7	應用情感分析於指數型證券投資信託基金趨勢預測之研究 / Research into sentimental analysis to predict exchange-traded fund trend 黃泓銘, Huang, Hung-Ming Unknown Date (has links) 近年來ETF規模快速成長，亞洲區域經濟成長與穩步發展更是帶動國際ETF市場動力來源，而元大台灣50指數型證券投資信託基金因規模大，受到投資人的青睞。根據過去的研究指出，網路上的文本訊息會對群眾情緒造成影響，進而影響股價波動，對投資者而言，若能從大量網路財金快速分析投資者大眾情緒進而預測股價波動走勢，勢必可提高報酬率。然而，每日有上百篇的財金文本產生，人工分析耗時耗力，本研究採用文字探勘技術，提出一套情感分析的價格預測模型。過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果，然而，為解決監督式學習無法預期未知的限制，本研究透過非監督式學習將2016整年度的財金文本進行文章主題判別，計算情緒指數並標記文本情緒傾向，再來使用監督式學習結合台股資訊指標、國際指標、總體經濟指標、技術指標等，建立分類模型以預測元大台灣50ETF的價格趨勢。實驗結果中，主題標注方面，本研究發現因文本數量遠大於議題詞數量造成TF-IDF矩陣過於稀疏，使得TF-IDF結合K-means主題模型分類效果不佳。LDA主題模型基於所有主題被所有文章共享的特性，使得在字詞分群優於TF-IDF結合K-means。情緒傾向標注方面，證實本研究擴充後的情感詞集比起NTUSD有更好的字詞極性判斷效果。本研究透過比較情緒指數結合技術指標之分類模型與單純技術指標分類模型的準確率發現，前者較後者高出7%的準確率。進一步結合間接情緒指標的分類模型更有71%準確率，故證實財金文本的情感分析確實能有效提升元大台灣50的價格趨勢預測。 / Rapid and stable economic growth in Asia motivated the asset scale of ETF in the globe growing rapidly in the recent years. Yuanta Taiwan Top 50 ETF gains the investors’ favor because of the advantages of large market scale. Past research have shown that the text documents on the internet, e.g. news and tweets, would make great effect on public emotion, and the public emotion could even affect the stock price. For investors, it is important to know how to analyze the potential emotion in text documents to predict the stock trend. However, the traditional way to analyze text documents by human cannot afford the large volume of financial text documents on the internet. In past sentimental analysis research, supervised method is proven as a method with high accuracy, but there are limits about predicting unknown future trend. This research combined supervised and unsupervised methods to deal with these large financial text documents. By using unsupervised method to find out the topic of documents, and then calculate the sentimental index of each documents to differentiate the sentiment polarity. Afterwards, using supervised method to build a prediction model with the sentimental index. According to the result, we found that the performance of LDA model is better than the TF-IDF with K-means model. Moreover, the prediction model which include the sentiment index has higher accuracy than the one include the technical indicators only. 情感分析 LDA主題模型支援向量機 ETF Sentimental analysis LDA SVM ETF
8	應用主題探勘與標籤聚合於標籤推薦之研究 / Application of topic mining and tag clustering for tag recommendation 高挺桂, Kao, Ting Kuei Unknown Date (has links) 標記社群標籤是Web2.0以來流行的一種透過使用者詮釋和分享資訊的方式，作為傳統分類方法的替代，其方便、靈活的特色使得使用者能夠輕易地因應內容標註標籤。不過其也有缺點，除了有相當多無標籤標註的內容，也存在大量模糊、不精確的標籤，降低了系統本身組織分類標籤的能力。為了解決上述兩項問題，本研究提出了一種結合主題探勘與標籤聚合的自動化標籤推薦方法，期望能夠建立一個去人工過程的自動化標籤推薦規則，來推薦合適的標籤給使用者。本研究蒐集了痞客邦部落格中，點閱次數大於5000次的熱門中文文章共2500篇，經過前處理，並以其中1939篇訓練模型及400篇作為測試語料來驗證方法。在主題探勘部分，本研究利用LDA主題模型計算不同文章的主題語意，來與既有標籤作出關聯，而能夠針對新進文章預測主題並推薦主題相關標籤給它。其中，本研究利用了能評斷模型表現情形的混淆度(Perplexity)來協助選取LDA的主題數，改善了LDA需要人主觀決定主題數的問題；在標籤聚合部分，本研究以階層式分群法，將有共同出現過的標籤群聚起來，以便找出有相似語意概念的標籤。其中，本研究將分群停止條件設定為共現次數最少為1次，改善了分群方法需要設定分群數量才能有結果的問題，也使本方法能夠自動化的找出合適的分群數目。實驗結果顯示，依照文章主題語意來推薦標籤有一定程度的可行性，且以混淆度所協助選取的主題數取得一致性較好的結果。而依照階層式分群所分出的標籤群中，同一群中的標籤確實擁有相似、類似的概念語意。最後，在結合主題探勘與標籤聚合的方法上，其Top-1至Top-5的準確率平均提升了14.1%，且Top-1準確率也達到72.25%。代表本研究針對文章寫作及標記標籤的習性切入的做法，確實能幫助提升標籤推薦的準確率，也代表本研究確實建立了一個自動化的標籤推薦規則，能推薦出合適的標籤來幫助使用者在撰寫文章後，能夠更方便、精確的標上標籤。 / Tags are a popular way of interpreting and sharing information through use, and as a substitute for traditional classification methods, the convenience and flexibility of the community makes it easy for users to use. But it also has disadvantages, in addition to a considerable number of non-tagged content, there are also many fuzzy and inaccurate tags. To solve these two problems, this study proposes a tag recommendation method that combines the Topic Mining and Tag Clustering. In this study, we collected a total of 2500 articles by Pixnet as a corpus. In the Topic Mining section, this study uses the LDA Model to calculate the subject semantics of different articles to associate with existing tags, and we can predict topics for new articles to recommend topics related tags to them. Among them, the topics number of the LDA Model uses the Perplexity to help the selection. In the Tag Clustering section, this study uses the Hierarchical Clustering to collect the tags that have appeared together to find similar semantic concepts. The stop condition is set to a minimum of 1 co-occurrence times, which solves the problem that the clustering method needs to set the number of groups to have the result. First, the Topic Mining results show that it is feasible to recommend tags according to the semantics of the article, and the experiment proves that the number of topics chosen according to the Perplexity is superior to the other topics. Second, the Tag Clustering results show that the same group of tags does have similar conceptual semantics. Last, experiments show that the accuracy rate of Top-1 to Top-5 in combination with two methods increased average of 14.1%, and its Top-1 accuracy rate is 72.25%,and it tells that our tag recommendation method can recommend the appropriate tag for users to use. 標籤推薦主題模型階層式分群 Tag recommendation Topic model Hierarchical clustering
9	應用情感型態分析於指數股票型基金趨勢研究-以台灣卓越50基金為例 / A study on the trend of exchange traded funds by sentiment pattern analysis in Yuanta Taiwan Top 50 ETF 林詠翔, Lin, Yong-Xiang Unknown Date (has links) 根據研究指出 ETF 資產規模近幾年快速成長，元大台灣卓越 50 基金因市場規模大等優勢受到投資人的青睞，賴以巨量資料的發展使得文字探勘技術成熟，故本研究希冀提出一套情感分析的價格預測模型，提升投資者的報酬率。過往學者以文章中的單詞作為文字探勘的分析單位，常會產生同義詞、多義詞的問題，因此提出情感型態分析的監督式學習方法建立模型。另外為了解決監督式學習難以取得訓練資料的限制，本研究混合非監督式學習方法進行主題分群與情緒傾向標注。本研究建立台灣股市新聞文本資料集，並篩選熱門議題詞詞庫，進行非監督式的 LDA 主題模型，發現在 2016 年總統選舉期間，媒體對於公司相關議題的注意力降低，使得相關的文本數量大幅減少;另外在情緒傾向標注階段，因混和了 NTUSD、知網及自行擴充演算法的情感詞庫，能夠將 10%中性詞彙產生極性判斷、96%的文本標注情緒傾向。視覺化工具分析結果指出，DIF-MACD 能夠預測台灣卓越 50 基金的長期走勢，而新聞情緒指數則在短期的價格波動上表現良好，且在主題模型分群中，總體經濟、公司維運類別的新聞情緒指數具有約 1-2 日領先指標特性，對於後續的價格預測模型有所助益。在監督式情感分析方法，為解決上述同義詞、多義詞的問題，本研究採用型態分類模型於中文文本，並與向量空間模型、支援向量機等方法做比較。實驗結果指出優化的型態分類模型，並結合台灣加權股價指數，表現相對良好，F1- Measure 可達 85%。進一步討論新聞情緒對於價格預測的重要性，發現在非交易時間序列中的新聞情緒，能夠對 0050 的價格波動產生影響。 / The past research points out that the scale of ETF assets has been growing rapidly in recent years. Yuanta Taiwan Top 50 ETF is popular with investors because of the advantages of large market scale. Through the development of Big Data, the technology of Text Mining becomes mature. Thus, we analyze the price forecast model to raise the investors' rate of return. The research of Text Mining used to take the document term to analyze, but it often results in the problem with synonym and polysemy. Therefore, this research proposes a supervised learning method of sentiment pattern analysis. In addition, in order to solve the problem with training data about the supervised learning method, we mix the unsupervised learning method to carry out the subject grouping and sentimental tendency. In this study, we establish the news dataset and screen it as popular terms that are used to an unsupervised method of LDA model. The result points out that the number of news about company dropped significantly during the 2016 Taiwan president election because of the change of media sensation. Moreover, we create the sentiment dictionary that can determine the polarity of 10% neutral terms and the emotional tendency of 96% documents by mixing the NTUSD, HowNet knowledge Database and the self-expansion algorithm. Through the data visualization, the result shows that the curve of DIF-MACD is able to predict the long-term trend of 0050, while the sentiment index of the news makes a good showing in the short-term price volatility. Besides, the news sentiment index of the subjects that belong to general economy and company has about 1 to 2 day leading indicators. Eventually, we employ the Sentiment Pattern Taxonomy Model(PTM) in Chinese texts as supervised learning method and compare with VSM and SVM. The experiment result shows that PTM combined with Taiwan Weighted Stock Index is the best when its F1-Measure is up to 85%. Apart from this, we find that the sentiment index of the news in non-trading time can influence the price volatility of 0050. 情感分析 LDA主題模型型態模型指數股票型基金 Sentimental analysis LDA Pattern model ETF
10	運用財經文本情感分析於台灣電子類股價指數趨勢預測之研究 / Research of applying Sentimental Analysis on financial documents to predict Taiwan Electronic Sub-Index trend 劉羿廷 Unknown Date (has links) 電子工業為台灣最具競爭力之產業,使得電子類股在集中市場成交比重高達 69.49%,可見電子類股的波動足以對整個台股市場造成相當大的影響。而許多研究指出,網路上的文本訊息藉由社會網路的催化而快速傳遞,會對群眾情緒造成影響,進而影響股價波動,故對於投資者而言,如果能快速分析大量網路財經文本來推測投資大眾情緒進而預測股價走勢,即可提升獲利。然而,每天有近百篇的財經文本產生,傳統的人工抽樣分析方式效率不彰且過於耗力, 已不足以負荷此巨量資料。過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果,但監督式學習方法所使用的訓練資料集須有事先定義好的已知類別,故其有無法預期未知類別的限制,造成無法判斷文本中可能存在的未知主題,所以本研究提出一套針對財經文本的混合監督式學習與非監督式學習之情感分析方法,透過非監督式學習將 2014 整年度的電子工業財經文本進行文本主題判別、情緒指數計算與情緒傾向標注。之後配合視覺化工具作趨勢線圖分析,找出具有領先指標特性之主題,接著再用監督式學習將其結合國際指標、總體經濟指標、台股指標、技術指標等,建立分類模型以預測台灣電子類股價指數走勢。在實驗結果中,主題標注方面,本研究發現因文本數量遠大於議題詞數量造成 TFIDF 矩陣過於稀疏,使得 TFIDF-Kmeans 主題模型分類效果不佳;而文本具有多主題之特性造成 NPMI-Concor 分群之議題詞過於複雜不易歸納,然而LDA 主題模型基於所有主題被所有文章共享的特性,使得在字詞分群與主題分類準確度都優於 TFIDF-Kmeans 和 NPMI-Concor 主題模型,分類準確度高達 98%,故後續採用 LDA 主題模型進行主題標注。情緒傾向標注方面,證實本研究擴充後的情感詞集比起 NTUSD 有更好的字詞極性判斷效果,計算出的情緒指數之趨勢線也較投資人常用的 MACD 之趨勢線更符合電子類股價指數之趨勢。此外,亦發現並非所有文本的情緒指數皆具有領先特性,僅企業營運主題與總體經濟主題之文本的情緒指數能提前反應電子類股價指數趨勢,故本研究用此二主題之文本的情緒指數來建立分類模型。接著,本研究透過比較情緒指數結合技術指標之分類模型與單純技術指標分類模型的準確率發現,前者較後者高出 7%的準確率。進一步結合間接情緒指標的分類模型更有高達 71%準確率,故證實了情感分析確實能有效提升電子股價類股指數趨勢預測準確度,以提升投資人之投資報酬率。 / The electronic industry is the most competitive industry in Taiwan, and its large volume could have strong influence on the whole stock market. Many research show that text documents on the Internet have great effect on public emotion, and the public emotion could also affect the stock price. For investors, it is important to know how to analyze the potential emotion in text documents then use this information to predict the stock trend. However, the traditional way to analyze text documents by human resource cannot afford the large volume of financial text documents on the Internet. In past Sentimental Analysis research, supervised method is proven as a method could reach high accuracy, but there are limits about predicting the future trend. This research found a solution which mixed supervised and unsupervised methods to deal with these large financial text documents. First, we use unsupervised method to find out the topic of documents, and then calculate the sentimental index to judge the document’s emotional direction. After that we will produce trend line charts by visualization tools to find out which theme documents’ sentiment index are leading indicators. Furthermore, we use supervised method to integrate the sentimental index with other 24 indirect sentimental index to build the prediction model. According to the result, we found that LDA model’s performance is better than TFIDF-Kmeans model and NPMI-Concor mode because of document characteristic. Besides, sentimental dictionary I build has higher accuracy than NTUSD on judging word polarity. The trend of sentimental index and Taiwan electronic sub-index(TE) to each other is more similar than MACD line and TE to each other. We also discover that the sentiment index produced from documents about enterprise operation and macroeconomics are leading indicators, so we use these to build prediction model. Moreover, we found that the prediction model which include the sentiment index better than which only include the technical indicators. As mentioned above, the sentimental index could make the prediction of Taiwan electronic sub-index trend be more accurate and promote the return of investment. 情感分析巨量資料 LDA 主題模型支援向量機電子類股價指數 Sentimental analysis Big Data LDA SVM Taiwan Electronic Sub-Index Trend

1	適用於中文史料文本之標記式主題模型分析方法研究 / An Enhanced Topic Model Based on Labeled LDA for Chinese Historical Corpora 陳奕安 Unknown Date (has links) 本論文提出了一個適用於中文史料文本主題分析方法,主要是根據標記式隱含狄利克雷分布(Labeled Latent Dirichlet Allocation,LLDA) 演算法,使其可以透過人工標記的中文文本找出特定主題的相關詞彙。在我們提出的演算法中,我們加上主題種子字詞(Seed Words) 資訊,以增強 LDA 群聚過後的結果,使群聚過後的詞彙與主題的關聯度能夠獲得提昇。近年來,隨著網際網路的普及以及資訊檢索的蓬勃發展,同時由於數位典藏的資料成長,越來越多的實體書藉被編輯成數位版本並且加上後設資料(Metadata),在取得這些富有價值的歷史文本資料後,如何利用文字探勘技術(Text Mining)在這些資料上變成一項重要的研究議題。其中,如何從大量文本史料中辨識出文章主題更是許多學者感興趣的方向,而 LDA 主題模型則是在文字探勘領域中非常經典的方法。在此研究中我們發現傳統 LDA 對於群聚後的主題描述存在些許問題,包括主題類別的高隨機性以及個別主題的低易讀性,使得後續的解讀工作變得十分困難,因此我們採用了由 LDA 衍生出的標記式主題模型 Labeled LDA 演算法,限定能夠產生的主題類別以降低期隨機性,此外我們還加入了考量中文字詞的長度以及自定義的相關種子字詞等改進,使群聚出的主題詞彙能夠與主題更加相關,更加容易描述。實驗部分,我們利用改良後的演算法提取出主題詞彙,並進行人工標記,接著將標記的結果作為正確解答來計算平均準度均值(Mean Average Precision,MAP)等資訊檢索之評估方法作為評估,結果證實以長字詞以及種子字詞為考量所群聚出的結果皆優於傳統主題模型所群聚出的結果;此外,我們也將最終的結果與 TF-IDF 權重計算後的字詞進行比較,並由實驗結果可見其兩者之間的差異性。 / This paper proposes an enhanced topic model based on Labeled Latent Dirichlet Allocation (LLDA) for Chinese historical corpora to discover words related to specific topics. To enhance the traditional LDA performance and to increase the readability of its clustered words, we attempt to use the infor- mation of seed words and the Chinese word length into the traditional LDA algorithm. In this study, we find that the traditional LDA exists some prob- lems about topic descriptions after clustering. We therefore apply the Labeled LDA algorithm, which is derived from traditional LDA, with the proposed improvements of considering the lengths of the words and related seed words. In our experiments, Mean Average Precision (MAP) is used to evaluate our experiment results based on the topics words labeled manually by historical experts. The experimental results shows that the proposed method of consid- ering both Chinese word length information and seed words is better than the traditional LDA method. In addition, we compare the proposed results with the TF-IDF weighting scheme, and the proposed method also outperforms the TF-IDF method significantly. 主題模型標記式主題模型隱含狄利克雷分布
2	基於標記式主題模型之資料視覺化研究與實現 / A study of data visualization based on labeled topic model and its implementation 曾子芸 Unknown Date (has links) 隨著文字資訊的爆炸式增長，越來越多的訊息開始以電子文本的形式儲存及傳遞。但隨著文本內容資訊量不斷地增加，使用者也越來越難以快速地掌握文本全貌。因此本研究試圖透過主題模型（TopicModels）、標記式主題模型（Labeled Topic Models）演算法－在自然語言處理領域裡文本探勘的方法，識別出大規模文本中潛藏的主題訊息之後，再利用圖像視覺化在資訊表達上的優勢和效率，透過各種視覺化圖案的呈現從不同的角度來探索文本，形成一種嶄新的大規模文本閱讀與分析方式。本研究設計了兩階段實驗：第一階段任務導向性實驗、第二階段指定任務實驗，以及評估問卷來驗證本介面的易用性（ Ease-of-use ）和有用性（ Usefulness ）。並透過實驗問卷的分數結果驗證了，本研究所設計之介面在實務上的確能輔助專家學者進行文本相關研究，也能讓對文本熟悉程度不一的使用者在利用此介面探索文本的過程中，更快速地掌握大規模文本的事件全貌。 / With the explosion of text information, there are more and more data being recorded and transmitted in the form of texts. However, as the amount of textual information becomes larger, how to effectively and efficiently realize the information also becomes more difficult. This study attempts to use the Topics Models, text-mining techniques to identify the important topics in the large textual information. In addition, this study also aims to use the techniques of data visualization to present the most informative and valuable details within the large texts. There are two parts in this work: the first part is the introduction of text mining algorithms and the second part is the design of the data visualization.Moreover, in the experiments, we also conduct several surveys to verify the proficiency and usefulness and the visualization design. The results of the experiments and surveys, supports that our design provides an effective and efficient interface for users to understand a large set of texts, even for the experts familiar with the corpus. 資料視覺化文字資料視覺化主題模型
3	基於意見探勘與主題模型之部落格食記剖析研究 / A Study of Opinion Mining and Topic Model Analysis on Food Diaries 賴柏帆, Lai, Po Fan Unknown Date (has links) 隨著Web 2.0興起，社群網站在資訊傳遞與獲取所占比重相當高。以美食領域來看，人們在進餐廳前先行閱覽食記評論之情形越來越常見，而部落格文章因圖文並茂，常被消費者列入參考比較之來源。儘管這一類食記內容相對短篇食評來說較為完整，但評論分散於文章中，且多半沒有評分可供參考，讀者很難在第一時間獲悉評論樣貌，得花上一番心力進行閱覽，才能對餐廳整體有所評鑑。本研究提出一套基於意見探勘與主題模型的食記剖析方法，由部落格中各餐廳貼文情緒量來反映正負面評價，將提及評論歸納為「食物」、「服務」及「環境」三個評分面向，進而提供該家餐廳的整體推薦分數，供讀者快速參閱之。實驗語料自痞客邦美食類貼文中選定添好運台灣－台北站前店、京星港式飲茶PART2、金泰日式料理－內湖店以及喀佈貍（一店）大眾和風串燒居酒洋食堂，合計4家餐廳與200篇語料。透過LDA主題模型對食記敘述進行主題式分群，使擁有相近主題概念的句子分為一群，並歸類至各面向，例如喀佈貍（一店）之語料可分為10群主題語句，食物面向上有6群，服務與環境面向各為2群。另一方面，為了更有效辨別食記中含有的正負向情緒，本研究透過語意導向方法(SO-PMI)來計算食記中常出現情緒詞彙之極性，以建置該領域的意見詞詞庫。實驗結果方面，以線上餐廳評論網站－iPeen愛評網作為驗證對象，顯示其語料的平均情緒量相近，於大眾觀感與評價上傾向一致，且相較一般評論網站，本研究能從較細微的面向來切入，並以情緒量反映真實的餐廳評價。最後提出未來欲探討與改善之處，供後續研究參考之。 / As the time of Web 2.0 rise, social media platform plays a crucial role in transferring and receiving information. More and more people get used to reading the related posts before having meal. Because of its richness in content and referring photographs, blog posts are most frequently used for reference. Although the blog posts are more complete regarding their content than other short reviews, the actual reviews are scattered among words that are simply descriptions, and there are no grading scale to take as reference. These all together gives the reader a hard time to efficiently organize the overview of the review, and for them to, therefore, make the decision if they should go to the restaurant. Our study offers a method of analyzing food diaries based on opinion mining and topic model. The scale of emotion in a blog post about a restaurant is used as the reflection of its review's positive or negative. The comments are categorized into food, service and environment. And the restaurant will be graded based on these three aspects to further provide the user an overall score of recommendation. We collected total of 200 articles written on 4 restaurants in PIXNET, then categorized the contents using LDA (Latent Dirichlet Allocation) model base on their theme. The sentences with similar theme with be put into a group, then be further categorized to the three aspects that was mentioned earlier. On the other hand, to better distinguish if the emotion in certain food diary is positive or negative, our study calculated the polarity of common opinion-based words in food diaries using semantic orientation (SO-PMI), and built an opinion corpus specifically for food diaries. In terms of the result, using iPeen, a restaurant rating website, as test reference, it shows that the average scales of opinion of the restaurants we got using our method are close to iPeen, which in this case we can say are close to the public opinion and review. Furthermore, compare to common rating website, our study touches on even the minute aspect, and use the cumulative opinion to reflect the true blog authors' evaluation of the restaurant. Lastly, we would like to bring up what we intend to discuss and improve in the future for upcoming research's reference. 意見探勘 LDA 主題模型餐廳評分 Opinion Mining LDA Topic Model Restaurant Rating
4	AppReco: 基於行為識別的行動應用服務推薦系統 / AppReco: Behavior-aware Recommendation for iOS Mobile Applications 方子睿, Fang, Zih Ruei Unknown Date (has links) 在現在的社會裡，手機應用程式已經被人們接受與廣泛地利用，然而目前市面上的手機 App 推薦系統，多以使用者實際使用與回報作為參考，若有惡意行為軟體，在使用者介面後竊取使用者資料，這些推薦系統是難以查知其行為的，因此我們提出了 AppReco，一套可以系統化的推薦 iOS App 的推薦系統，而且不需要使用者去實際操作、執行 App。整個分析流程包括三個步驟：(1) 透過無監督式學習法的隱含狄利克雷分布(Latent Dirichlet Allocation, LDA)做出主題模型，再使用增長層級式自我組織映射圖(Growing Hierarchical Self-Organizing Map, GHSOM)進行分群。(2)使用靜態分析程式碼，去找出其應用程式所執行的行為。(3)透過我們的評分公式對於這些 App，進行評分。在分群 App 方面，AppReco 使用這些應用程式的官方敘述來進行分群，讓擁有類似屬性的手機應用程式群聚在一起；在檢視 App 方面，AppReco 透過靜態分析這些 App 的程式碼，來計算其使用行為的多寡；在推薦 App 方面，AppReco 分析類似屬性的 App 與其執行的行為，最後推薦使用者使用較少敏感行為(如使用廣告、使用個人資料、使用社群軟體開發包等)的 App。而本研究使用在 Apple App Store 上面數千個在各個類別中的前兩百名 App 做為我們的實驗資料集來進行實驗。 / Mobile applications have been widely used in life and become dominant software applications nowadays. However there are lack of systematic recommendation systems that can be leveraged in advance without users’ evaluations. We present AppReco, a systematic recommendation system of iOS mobile applications that can evaluate mobile applications without executions. AppReco evaluates apps that have similar interests with static binary analysis, revealing their behaviors according to the embedded functions in the executable. The analysis consists of three stages: (1) unsupervised learning on app descriptions with Latent Dirichlet Allocation for topic discovery and Growing Hierarchical Self-organizing Maps for hierarchical clustering, (2) static binary analysis on executables to discover embedded system calls and (3) ranking common-topic applications from their matched behavior patterns. To find apps that have similar interests, AppReco discovers (unsupervised) topics in official descriptions and clusters apps that have common topics as similar-interest apps. To evaluate apps, AppReco adopts static binary analysis on their executables to count invoked system calls and reveal embedded functions. To recommend apps, AppReco analyzes similar-interest apps with their behaviors of executables, and recommend apps that have less sensitive behaviors such as commercial advertisements, privacy information access, and internet connections, to users. We report our analysis against thousands of iOS apps in the Apple app store including most of the listed top 200 applications in each category. 推薦系統手機應用程式主題模型 Recommender System Mobile Application Topic Model
5	應用情感分析於媒體新聞傾向之研究-以中央社為例 / Applying sentiment analysis to the tendency of media news: a case study of central news agency 吳信維, Wu, Xin-Wei Unknown Date (has links) 本研究目的在於結合關聯規則新詞發掘演算法來擴增詞庫，並藉此提高結斷詞句的精確度以及透過非監督式情感分析方法，從中央通訊社中抓取國民黨以及民進黨的相關新聞文本，建立主題模型與情緒傾向的標注。再藉由監督式學習方法建立分類模型並驗證其成果。　　本研究藉由n-gram with a-priori algorithm來進行斷詞斷句的詞庫擴增。共有32007組詞被發掘，於這些詞中具有真正意義的詞共有28838筆，成功率可達88%。　　本研究比較兩種分群方法建立主題模型，分別為TFIDF-Kmeans以及LDA。在TFIDF-Kmeans分群結果中，因為文本數量遠大於議題詞數量，造成TFIDF矩陣過於稀疏，造成分群效果不佳。在LDA的分群結果底下，因為LDA模型其多文章多主題共享的特性，主題分類的精準度更高達八成以上。故本研究認為在分析具有多主題特性之文本，採用LDA模型來進行議題詞分群會有較佳的表現。　　本研究透過結合不同的資料時間區間，呈現出中央通訊社的新聞文本在我國近五次總統大選前後三個月間的新聞情緒傾向。同時探討各主題模型中各類別於大選前後三個月之情緒傾向變化。可以觀察到大致上文本的情感指數高峰值會出現於投票日的時候，而近三次總統大選的結果顯示，相關的政黨新聞情感值會於選舉過後趨於平緩。而從新聞文本的正負向情感統計以及以及整體情緒傾向分析可以看出，不論執政黨為何，中央通訊社的新聞對於國民黨以及民進黨皆呈現了正向且平穩的內容，大抵不會特別偏向單一政黨 / The purpose of this research is to combine association rules and new word mining algorithms to expand the lexicons so as to improve the accuracy of word segmentations, and by capturing the KMT and DPP news from the Central News Agency, it establishes the theme model and sentiment orientation through the unsupervised sentiment analysis method. Finally, by means of supervised learning methods, this research establishes classifications models and verifies its results. 　　This research uses n-gram with a-priori algorithm to segment words and sentences to expand the lexicons. A total of 32007 word are found, and among them, there have 28838 words with real meaning. The success rate is up to 88%. 　　In this research, we compare two different clustering methods to form the theme model, which are the TFIDF-Kmeans, and the LDA. From the results of TFIDF-Kmeans, the TFIDF matrix is too sparse, resulting in poor clustering because the number of texts is a lot larger than that of the issues. Unlike TFIDF-Kmeans, because of LDA model with more features of multi-topic sharing, the accuracy of topic classification is more than 80%. Therefore, this research suggests that it will have a better performance to analyze the multi-subjective texts with LDA model to classify the word clustering. 　　Through the combination of different data time interval, this research presents the sentimental tendencies of Central News Agency’s news in three months before and after the last five presidential elections in Taiwan. At the same time, it also explores the changes of the sentimental tendencies in the various theme models in the three months before and after the election. It can be observed the sentimental peak of the text will be appeared on the polling day, and nearly three times of the presidential election results show that the sentimental value of the relevant party’s news will become smooth after the election. From the positive and negative sentimental statistics of the news text and the analysis of the overall sentimental tendencies, no matter which the ruling party is, the news of the Central News Agency for the KMT and the DPP presents a positive and stable content, not particularly toward any political party. 情感分析 LDA主題模型 n-gram a-priori Sentiment analysis LDA N-gram A-priori
6	股市趨勢預測之研究 -財經評論文本情感分析 / Predict the trend in the stock by Sentiment analyzing financial posts 蔡宇祥, Tsai, Yu Shiang Unknown Date (has links) 根據過去研究指出，社群網站上的貼文訊息會對群眾情緒造成影響，進而影響股市波動，故對於投資者而言，如果能快速分析大量社群網站的財經文本來推測投資情緒進而預測股市走勢，將可提升投資獲利。過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果，但監督式學習方法所使用的訓練資料集須有事先定義好的已知類別，故其有無法預期未知類別的限制，所以本研究透過深度學習方法，從巨量資料及裡抓出有關於股市之文章，並透過財經文本的混合監督式學習與非監督式學習之情感分析方法，透過非監督式學習對微博財經貼文進行文本主題判別、情緒指數計算與情緒傾向標注，並且透過監督式學習的方式，建立分類模型以預測上海指數走勢，最後配合視覺化工具作趨勢線圖分析，找出具有領先指標特性之主題。在實驗結果中，深度學習方面，本研究透過word2vec抓取有效之股市主題文章，有效篩選了需要分析之文本，主題模型方面，我們最後使用LDA作為本研究標註主題之方法，因為其文本數量大於議題詞數量造成TFIDF矩陣過於稀疏，造成Kmeans分群效果不佳，故後續採用LDA主題模型進行主題標注。情緒傾向標注方面，透過擴充後的情感詞集比起NTUSD有更好的詞性分數判斷效果，計算出的情緒指數之趨勢線能有效預測上海指數之趨勢。此外，並非所有主題模型之情緒指數皆具有領先特性，僅公司表現與上海指數之主題模型的情緒指數能提前反應上海指數趨勢，故本研究用此二主題之文本的情緒指數來建立分類模型。本研究透過比較情緒指數與單純指數指標分類模型的準確度，前者較後者高出7%的準確率。故證實了情感分析確實能有效提升上海指數趨勢預測準確度，幫助投資者增加股市報酬率。情感分析 Word2vec LDA主題模型 K-means 上海股價指數
7	應用情感分析於指數型證券投資信託基金趨勢預測之研究 / Research into sentimental analysis to predict exchange-traded fund trend 黃泓銘, Huang, Hung-Ming Unknown Date (has links) 近年來ETF規模快速成長，亞洲區域經濟成長與穩步發展更是帶動國際ETF市場動力來源，而元大台灣50指數型證券投資信託基金因規模大，受到投資人的青睞。根據過去的研究指出，網路上的文本訊息會對群眾情緒造成影響，進而影響股價波動，對投資者而言，若能從大量網路財金快速分析投資者大眾情緒進而預測股價波動走勢，勢必可提高報酬率。然而，每日有上百篇的財金文本產生，人工分析耗時耗力，本研究採用文字探勘技術，提出一套情感分析的價格預測模型。過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果，然而，為解決監督式學習無法預期未知的限制，本研究透過非監督式學習將2016整年度的財金文本進行文章主題判別，計算情緒指數並標記文本情緒傾向，再來使用監督式學習結合台股資訊指標、國際指標、總體經濟指標、技術指標等，建立分類模型以預測元大台灣50ETF的價格趨勢。實驗結果中，主題標注方面，本研究發現因文本數量遠大於議題詞數量造成TF-IDF矩陣過於稀疏，使得TF-IDF結合K-means主題模型分類效果不佳。LDA主題模型基於所有主題被所有文章共享的特性，使得在字詞分群優於TF-IDF結合K-means。情緒傾向標注方面，證實本研究擴充後的情感詞集比起NTUSD有更好的字詞極性判斷效果。本研究透過比較情緒指數結合技術指標之分類模型與單純技術指標分類模型的準確率發現，前者較後者高出7%的準確率。進一步結合間接情緒指標的分類模型更有71%準確率，故證實財金文本的情感分析確實能有效提升元大台灣50的價格趨勢預測。 / Rapid and stable economic growth in Asia motivated the asset scale of ETF in the globe growing rapidly in the recent years. Yuanta Taiwan Top 50 ETF gains the investors’ favor because of the advantages of large market scale. Past research have shown that the text documents on the internet, e.g. news and tweets, would make great effect on public emotion, and the public emotion could even affect the stock price. For investors, it is important to know how to analyze the potential emotion in text documents to predict the stock trend. However, the traditional way to analyze text documents by human cannot afford the large volume of financial text documents on the internet. In past sentimental analysis research, supervised method is proven as a method with high accuracy, but there are limits about predicting unknown future trend. This research combined supervised and unsupervised methods to deal with these large financial text documents. By using unsupervised method to find out the topic of documents, and then calculate the sentimental index of each documents to differentiate the sentiment polarity. Afterwards, using supervised method to build a prediction model with the sentimental index. According to the result, we found that the performance of LDA model is better than the TF-IDF with K-means model. Moreover, the prediction model which include the sentiment index has higher accuracy than the one include the technical indicators only. 情感分析 LDA主題模型支援向量機 ETF Sentimental analysis LDA SVM ETF
8	應用主題探勘與標籤聚合於標籤推薦之研究 / Application of topic mining and tag clustering for tag recommendation 高挺桂, Kao, Ting Kuei Unknown Date (has links) 標記社群標籤是Web2.0以來流行的一種透過使用者詮釋和分享資訊的方式，作為傳統分類方法的替代，其方便、靈活的特色使得使用者能夠輕易地因應內容標註標籤。不過其也有缺點，除了有相當多無標籤標註的內容，也存在大量模糊、不精確的標籤，降低了系統本身組織分類標籤的能力。為了解決上述兩項問題，本研究提出了一種結合主題探勘與標籤聚合的自動化標籤推薦方法，期望能夠建立一個去人工過程的自動化標籤推薦規則，來推薦合適的標籤給使用者。本研究蒐集了痞客邦部落格中，點閱次數大於5000次的熱門中文文章共2500篇，經過前處理，並以其中1939篇訓練模型及400篇作為測試語料來驗證方法。在主題探勘部分，本研究利用LDA主題模型計算不同文章的主題語意，來與既有標籤作出關聯，而能夠針對新進文章預測主題並推薦主題相關標籤給它。其中，本研究利用了能評斷模型表現情形的混淆度(Perplexity)來協助選取LDA的主題數，改善了LDA需要人主觀決定主題數的問題；在標籤聚合部分，本研究以階層式分群法，將有共同出現過的標籤群聚起來，以便找出有相似語意概念的標籤。其中，本研究將分群停止條件設定為共現次數最少為1次，改善了分群方法需要設定分群數量才能有結果的問題，也使本方法能夠自動化的找出合適的分群數目。實驗結果顯示，依照文章主題語意來推薦標籤有一定程度的可行性，且以混淆度所協助選取的主題數取得一致性較好的結果。而依照階層式分群所分出的標籤群中，同一群中的標籤確實擁有相似、類似的概念語意。最後，在結合主題探勘與標籤聚合的方法上，其Top-1至Top-5的準確率平均提升了14.1%，且Top-1準確率也達到72.25%。代表本研究針對文章寫作及標記標籤的習性切入的做法，確實能幫助提升標籤推薦的準確率，也代表本研究確實建立了一個自動化的標籤推薦規則，能推薦出合適的標籤來幫助使用者在撰寫文章後，能夠更方便、精確的標上標籤。 / Tags are a popular way of interpreting and sharing information through use, and as a substitute for traditional classification methods, the convenience and flexibility of the community makes it easy for users to use. But it also has disadvantages, in addition to a considerable number of non-tagged content, there are also many fuzzy and inaccurate tags. To solve these two problems, this study proposes a tag recommendation method that combines the Topic Mining and Tag Clustering. In this study, we collected a total of 2500 articles by Pixnet as a corpus. In the Topic Mining section, this study uses the LDA Model to calculate the subject semantics of different articles to associate with existing tags, and we can predict topics for new articles to recommend topics related tags to them. Among them, the topics number of the LDA Model uses the Perplexity to help the selection. In the Tag Clustering section, this study uses the Hierarchical Clustering to collect the tags that have appeared together to find similar semantic concepts. The stop condition is set to a minimum of 1 co-occurrence times, which solves the problem that the clustering method needs to set the number of groups to have the result. First, the Topic Mining results show that it is feasible to recommend tags according to the semantics of the article, and the experiment proves that the number of topics chosen according to the Perplexity is superior to the other topics. Second, the Tag Clustering results show that the same group of tags does have similar conceptual semantics. Last, experiments show that the accuracy rate of Top-1 to Top-5 in combination with two methods increased average of 14.1%, and its Top-1 accuracy rate is 72.25%,and it tells that our tag recommendation method can recommend the appropriate tag for users to use. 標籤推薦主題模型階層式分群 Tag recommendation Topic model Hierarchical clustering
9	應用情感型態分析於指數股票型基金趨勢研究-以台灣卓越50基金為例 / A study on the trend of exchange traded funds by sentiment pattern analysis in Yuanta Taiwan Top 50 ETF 林詠翔, Lin, Yong-Xiang Unknown Date (has links) 根據研究指出 ETF 資產規模近幾年快速成長，元大台灣卓越 50 基金因市場規模大等優勢受到投資人的青睞，賴以巨量資料的發展使得文字探勘技術成熟，故本研究希冀提出一套情感分析的價格預測模型，提升投資者的報酬率。過往學者以文章中的單詞作為文字探勘的分析單位，常會產生同義詞、多義詞的問題，因此提出情感型態分析的監督式學習方法建立模型。另外為了解決監督式學習難以取得訓練資料的限制，本研究混合非監督式學習方法進行主題分群與情緒傾向標注。本研究建立台灣股市新聞文本資料集，並篩選熱門議題詞詞庫，進行非監督式的 LDA 主題模型，發現在 2016 年總統選舉期間，媒體對於公司相關議題的注意力降低，使得相關的文本數量大幅減少;另外在情緒傾向標注階段，因混和了 NTUSD、知網及自行擴充演算法的情感詞庫，能夠將 10%中性詞彙產生極性判斷、96%的文本標注情緒傾向。視覺化工具分析結果指出，DIF-MACD 能夠預測台灣卓越 50 基金的長期走勢，而新聞情緒指數則在短期的價格波動上表現良好，且在主題模型分群中，總體經濟、公司維運類別的新聞情緒指數具有約 1-2 日領先指標特性，對於後續的價格預測模型有所助益。在監督式情感分析方法，為解決上述同義詞、多義詞的問題，本研究採用型態分類模型於中文文本，並與向量空間模型、支援向量機等方法做比較。實驗結果指出優化的型態分類模型，並結合台灣加權股價指數，表現相對良好，F1- Measure 可達 85%。進一步討論新聞情緒對於價格預測的重要性，發現在非交易時間序列中的新聞情緒，能夠對 0050 的價格波動產生影響。 / The past research points out that the scale of ETF assets has been growing rapidly in recent years. Yuanta Taiwan Top 50 ETF is popular with investors because of the advantages of large market scale. Through the development of Big Data, the technology of Text Mining becomes mature. Thus, we analyze the price forecast model to raise the investors' rate of return. The research of Text Mining used to take the document term to analyze, but it often results in the problem with synonym and polysemy. Therefore, this research proposes a supervised learning method of sentiment pattern analysis. In addition, in order to solve the problem with training data about the supervised learning method, we mix the unsupervised learning method to carry out the subject grouping and sentimental tendency. In this study, we establish the news dataset and screen it as popular terms that are used to an unsupervised method of LDA model. The result points out that the number of news about company dropped significantly during the 2016 Taiwan president election because of the change of media sensation. Moreover, we create the sentiment dictionary that can determine the polarity of 10% neutral terms and the emotional tendency of 96% documents by mixing the NTUSD, HowNet knowledge Database and the self-expansion algorithm. Through the data visualization, the result shows that the curve of DIF-MACD is able to predict the long-term trend of 0050, while the sentiment index of the news makes a good showing in the short-term price volatility. Besides, the news sentiment index of the subjects that belong to general economy and company has about 1 to 2 day leading indicators. Eventually, we employ the Sentiment Pattern Taxonomy Model(PTM) in Chinese texts as supervised learning method and compare with VSM and SVM. The experiment result shows that PTM combined with Taiwan Weighted Stock Index is the best when its F1-Measure is up to 85%. Apart from this, we find that the sentiment index of the news in non-trading time can influence the price volatility of 0050. 情感分析 LDA主題模型型態模型指數股票型基金 Sentimental analysis LDA Pattern model ETF
10	運用財經文本情感分析於台灣電子類股價指數趨勢預測之研究 / Research of applying Sentimental Analysis on financial documents to predict Taiwan Electronic Sub-Index trend 劉羿廷 Unknown Date (has links) 電子工業為台灣最具競爭力之產業,使得電子類股在集中市場成交比重高達 69.49%,可見電子類股的波動足以對整個台股市場造成相當大的影響。而許多研究指出,網路上的文本訊息藉由社會網路的催化而快速傳遞,會對群眾情緒造成影響,進而影響股價波動,故對於投資者而言,如果能快速分析大量網路財經文本來推測投資大眾情緒進而預測股價走勢,即可提升獲利。然而,每天有近百篇的財經文本產生,傳統的人工抽樣分析方式效率不彰且過於耗力, 已不足以負荷此巨量資料。過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果,但監督式學習方法所使用的訓練資料集須有事先定義好的已知類別,故其有無法預期未知類別的限制,造成無法判斷文本中可能存在的未知主題,所以本研究提出一套針對財經文本的混合監督式學習與非監督式學習之情感分析方法,透過非監督式學習將 2014 整年度的電子工業財經文本進行文本主題判別、情緒指數計算與情緒傾向標注。之後配合視覺化工具作趨勢線圖分析,找出具有領先指標特性之主題,接著再用監督式學習將其結合國際指標、總體經濟指標、台股指標、技術指標等,建立分類模型以預測台灣電子類股價指數走勢。在實驗結果中,主題標注方面,本研究發現因文本數量遠大於議題詞數量造成 TFIDF 矩陣過於稀疏,使得 TFIDF-Kmeans 主題模型分類效果不佳;而文本具有多主題之特性造成 NPMI-Concor 分群之議題詞過於複雜不易歸納,然而LDA 主題模型基於所有主題被所有文章共享的特性,使得在字詞分群與主題分類準確度都優於 TFIDF-Kmeans 和 NPMI-Concor 主題模型,分類準確度高達 98%,故後續採用 LDA 主題模型進行主題標注。情緒傾向標注方面,證實本研究擴充後的情感詞集比起 NTUSD 有更好的字詞極性判斷效果,計算出的情緒指數之趨勢線也較投資人常用的 MACD 之趨勢線更符合電子類股價指數之趨勢。此外,亦發現並非所有文本的情緒指數皆具有領先特性,僅企業營運主題與總體經濟主題之文本的情緒指數能提前反應電子類股價指數趨勢,故本研究用此二主題之文本的情緒指數來建立分類模型。接著,本研究透過比較情緒指數結合技術指標之分類模型與單純技術指標分類模型的準確率發現,前者較後者高出 7%的準確率。進一步結合間接情緒指標的分類模型更有高達 71%準確率,故證實了情感分析確實能有效提升電子股價類股指數趨勢預測準確度,以提升投資人之投資報酬率。 / The electronic industry is the most competitive industry in Taiwan, and its large volume could have strong influence on the whole stock market. Many research show that text documents on the Internet have great effect on public emotion, and the public emotion could also affect the stock price. For investors, it is important to know how to analyze the potential emotion in text documents then use this information to predict the stock trend. However, the traditional way to analyze text documents by human resource cannot afford the large volume of financial text documents on the Internet. In past Sentimental Analysis research, supervised method is proven as a method could reach high accuracy, but there are limits about predicting the future trend. This research found a solution which mixed supervised and unsupervised methods to deal with these large financial text documents. First, we use unsupervised method to find out the topic of documents, and then calculate the sentimental index to judge the document’s emotional direction. After that we will produce trend line charts by visualization tools to find out which theme documents’ sentiment index are leading indicators. Furthermore, we use supervised method to integrate the sentimental index with other 24 indirect sentimental index to build the prediction model. According to the result, we found that LDA model’s performance is better than TFIDF-Kmeans model and NPMI-Concor mode because of document characteristic. Besides, sentimental dictionary I build has higher accuracy than NTUSD on judging word polarity. The trend of sentimental index and Taiwan electronic sub-index(TE) to each other is more similar than MACD line and TE to each other. We also discover that the sentiment index produced from documents about enterprise operation and macroeconomics are leading indicators, so we use these to build prediction model. Moreover, we found that the prediction model which include the sentiment index better than which only include the technical indicators. As mentioned above, the sentimental index could make the prediction of Taiwan electronic sub-index trend be more accurate and promote the return of investment. 情感分析巨量資料 LDA 主題模型支援向量機電子類股價指數 Sentimental analysis Big Data LDA SVM Taiwan Electronic Sub-Index Trend

Search results