中文文本探勘工具:主題分析、詞組關聯強度、相關句擷取 / Tools for Chinese Text Mining: Topic Analysis, Association Strengths of Collocations, Extraction of Relevant Statements

林書佑, Lin, Shu Yu Unknown Date (has links)
現今資料大量且快速數位化的時代,各領域對資訊探勘分析技術越趨倚重。而在數位人文中領域中從2009年「數位典藏與數位人文國際研討會」開始,此議題逐漸受到重視,主要目的為將數位文物結合資訊分析與圖像化輔助,透過不同層面的詮釋建構出更完整的文物資訊。 本研究建構一個針對各種中文語料分析的工具,藉由latent semantic analysis、pointwise mutual information、Person’s chi-squared test、typed dependencies distance、word2vec、Gibbs sampling for latent Dirichlet allocation等計算語料中關鍵詞彙關聯強度的方法,並結合分群方法找出可能的主題,最後擷取符合分群結果的相關句子予以輔助人文學者分析詮釋。透過提供各種觀察語料的面向,進而提升語料相關研究學者的效率。 我們利用《人民日報》、《新青年》、《聯合報》、《中國時報》作為實驗與測試的中文語料。且將《新青年》藉由此套工具分析後的結果提供給專業人文學者,做為分析詮釋的參考資訊與佐證依據,並在「2015年數位典藏與數位人文國際研討會」中發表論文。目前我們透過各種中文語料評估工具的效能,且在未來將公開此套工具提供給更多學者使用,節省對於語料分析的時間。 / In recent years, a wide variety of text documents have been transformed into digital format. Hence, using data mining techniques to analyze data is becoming more and more popular in many research fields. The digital humanities gradually have taken seriously since "International Conference of Digital Archives and Digital Humanities" began in 2009. The main purpose of the digital heritage combined with information analysis and visualization could improve the effectiveness of cultural information through different levels of interpretation. In this study, we construct a set of tools for Chinese text mining, calculating associated strengths of collocations work through latent semantic analysis, pointwise mutual information, Person’s chi-squared test, typed dependencies distance, word2vec, and Gibbs sampling for latent Dirichlet allocation etc. The tools employ clustering method to identify the possible topics, meanwhile, the tools will extract the relevant statements according to the clustering results. These clustering and relevant statements contribute and improve the efficiency of humanities scholars’ analysis through providing a variety of observations about the corpora. At the experimental stage of this study, we considered the "People's Daily", "New Youth", "United Daily News", and "China Times" as as the corpora for testing. Among the research, humanities scholars analyzed the "New Youth" by the tools and published a paper in the "2015 International Conference of Digital Archives and Digital Humanities". Currently, we assess the effectiveness of the tools through a variety of Chinese corpora. In the future, we will make the tools freely available on the Internet for Chinese text mining. We hope these time-saving tools can assist in humanities scholars’ study of Chinese corpora.

消費者輿情對跨境網購產品銷售量之影響:以淘寶網為例 / The Effects of Consumer Comments and Sentiments on Product Sales of Cross-border Shopping Websites: The Taobao Case

呂奕勳 Unknown Date (has links)
近年來傳統線上購物正面臨著一連串的市場困境,如削價競爭、廉價品競爭等,因此導致銷售量之成長趨緩,反觀跨境線上購物卻出現了蓬勃發展的態勢,因而讓跨境線上購物成為驅動經濟活動與國際貿易的新引擎。另一方面,由於跨境線上購物的情境複雜性遠高於傳統的境內線上購物,業者們欲開發一海外新市場,必須先了解該地消費者行為與其購買決策過程後,才能制定出好的商業策略,並且進一步將產品導向的服務轉化成為以顧客導向的服務,才有機會為傳統線上購物之困境另闢生機。因此,引取並了解消費者所體認的內在價值是經營跨境線上購物最重要的成功因素。 本研究將試圖將傳統境內線上購物研究擴展到跨境線上購物議題,藉由文字探勘(Text Mining)分析、語意情感分析與 k-means 分群演算法,挖掘出消費者對於所購買商品之評論的常見內容型態與所購買商品之類別,並試圖找出跨境網購平台上各項因素及商品評論對於產品銷售量間之關連性,提供未來研究者及跨境網購平台業者決策之依據。 / While online shopping websites are facing the difficulties of price and low-quality competition, cross-border online shopping is on a vigorous development trend, showing that cross-border online shopping is an important trend of online shopping field. Due to the complexity of cross-border online shopping is much higher than the traditional domestic online shopping, so understanding the value of cross-border online shopping consumers is the most important success factors. Companies want to develop new markets abroad, must understand the local consumer’s behaviour and their decision-making process in order to make good business strategies. This study uses text mining analytic technology, semantic analysis techniques, and k-means clustering algorithm to identify characteristics of consumers’ reviews and the common categories of goods they purchased. After getting the reason why consumers use cross-border online shopping service and what values they got in this service. Researcher can predict and analyse the evolution and development of cross-border online shopping, provide reference for future online shopping academic studies and online shopping industry’s decision-making.

運用資料探勘分析社會輿情與廣告影響房地產行情短期波動行為之研究 / A Study of Applying Data Mining to Find the Influence of Public Opinion and Advertisement on the Sales of Real Estate in the Short Run

張修維, Chang, Hsiu Wei Unknown Date (has links)
網際網路時代資訊接收的便利性,使得大眾容易接收到媒體所發布的媒體資訊,而這些資料具含的意見詞彙間接反應出群眾對特定主題的情緒傾向。在針對房地產的媒體當中,當特定區域的房地產市場具有良好的發展空間而成為交易熱區時,這些針對特定區域且帶含情緒的房市篇章報導或其他影響房市之相關新聞以及廣告往往會影響我們的購屋決策。 本研究將以桃園市及台中市-兩個近五年來台灣房市較為熱門的區域作為研究區域進行分析及研究,期望找出在短期時間新聞輿情及廣告和房市交易價量的相關性以及會影響該房地產市場之因素。首先蒐集桃園市及台中市的實價登錄的房地產交易資料以及廣告後,運用文字探勘分析房市整體輿情與兩都市房地產價量之關聯性,再將新聞分群後找出特徵詞,個別建立時間序列來了解各種情緒及房地產價量的共同移動性,並結合廣告投入量找出房地產市場價量以及影響因素的領先關係。並透過自建的類神經網路模型建立針對桃園市和台中市的交易量預測模型以及針對特定房市熱門區域-青埔和七期的交易量預測模型,並透過計算輸入變數的權重總和來判別新聞情緒對於房地產成交價量的影響程度。 研究首先提供了對於新聞情緒的分類包含區域經濟情緒、區域社會情緒、區域環境情緒、區域政治情緒、稅制情緒、選舉情緒。接著進行時間序列分析指出總情緒序列與成交量的時間序列相關係數都有高於70%以上,桃園市成交量與桃園市情緒的相關係數為0.73,台中市成交量與台中市情緒的相關係數為0.81,皆呈現高度正相關,顯示桃園及台中的房市交易量與情緒現存在高度相關性。在特定新聞類別當中,透過兩個城市的相關係數比對顯示稅制新聞情緒,區域環境相關情緒,區域社會相關情緒,以上三個情緒跟房市的交易量共同移動較為明顯,相關係數皆在0.5左右甚至以上,可見這些類別的新聞能夠適時反映大眾對於特定區域的房地產的看好及看壞。在此階段也透過領先指標驗證了情緒以及廣告是會領先房市交易量,桃園以及台中兩個區域都有情緒領先交易量一個月的現象。針對特定區域的交易量研究包含青埔特區及七期重劃區,也發現到兩地的交易量高峰前一至兩個月都有一波廣告的高峰。 而在類神經網路模型方面的研究結果能夠良好地預測漲跌趨勢,利用桃園資料進行訓練並以台中資料做為測試的模型在19次的漲跌中預測出17次,而將百分之七十的桃園及台中混合資料進行訓練並其餘百分之三十做為測試的模型結果也成功在14次漲跌中預測出10次,顯示模型效果預測能力良好,並透過將輸入權重加總的方式來衡量各輸入變數的影響程度,研究結果指出總情緒,稅制情緒量,區域環境情緒量與兩地房地產市場交易量最有關聯且影響最重。最後利用時間序列得知廣告高峰會領先總交易高峰一至兩個月的特性,利用從2012年10月至2016年2月的青埔特區資料及2012年10月至2013年12月的七期重劃區資料混合進行訓練並以2014年1月至2016年2月七期重劃區資料做為測試資料的模型能夠有效在兩年內預測中三次交易高峰,顯示該模型能透過預測出下一期的廣告投入量做為中介變數進而推估出交易量高峰的時間透過此模型可在未來應用於相關政策投入市場後對市場交易量的影響,也能夠快速有效的得到預測結果,而在針對特定市場我們也可以透過預測廣告以及運用廣告為交易量的領先特性來了解在近期何時會有交易量高峰,如能配合了解市場輿情脈絡,可為房屋仲介以及建商在更精確的時間點投放廣告時機點達到廣告的最大效益。

以文件分類技術預測股價趨勢 / Predicting Trends of Stock Prices with Text Classification Techniques

陳俊達, Chen, Jiun-da Unknown Date (has links)
股價的漲跌變化是由於證券市場中眾多不同投資人及其投資決策後所產生的結果。然而,影響股價變動的因素眾多且複雜,新聞也屬於其中一種,新聞事件不但是投資人用來得知該股票上市公司的相關營運資訊的主要媒介,同時也是影響投資人決定或變更其股票投資策略的主要因素之一。本研究提出以新聞文件做為股價漲跌預測系統的基礎架構,透過文字探勘技術及分類技術來建置出能預測當日個股收盤股價漲跌趨勢之系統。 本研究共提出三種分類模型,分別是簡易貝氏模型、k最近鄰居模型以及混合模型,並設計了三組實驗,分別是分類器效能的比較、新聞樣本資料深度的比較、以及新聞樣本資料廣度的比較來檢驗系統的預測效能。實驗結果顯示,本研究所提出的分類模型可以有效改善相關研究中整體正確率高但各個類別的預測效能卻差異甚大的情況。而對於影響投資人獲利與否的關鍵類別"漲"及類別"跌"的平均預測效能上,本研究所提出的這三種分類模型亦同時具有良好的成效,可以做為投資人進行投資決策時的有效參考依據。 / Stocks' closing price levels can provide hints about investors' aggregate demands and aggregate supplies in the stock trading markets. If the level of a stock's closing price is higher than its previous closing price, it indicates that the aggregate demand is stronger than the aggregate supply in this trading day. Otherwise, the aggregate demand is weaker than the aggregate supply. It would be profitable if we can predict the individual stock's closing price level. For example, in case that one stock's current price is lower than its previous closing price. We can do the proper strategies(buy or sell) to gain profit if we can predict the stock's closing price level correctly in advance. In this thesis, we propose and evaluate three models for predicting individual stock's closing price in the Taiwan stock market. These models include a naïve Bayes model, a k-nearest neighbors model, and a hybrid model. Experimental results show the proposed methods perform better than the NewsCATS system for the "UP" and "DOWN" categories.


連子杰, Lien,Tzu-Chieh Unknown Date (has links)
投資人在投資決策之過程中,所分析之資料可分為財務性與非財務性資訊兩大類,然而受限於傳統財務資料格式之不一致,可能需花費額外之財力與物力來處理,甚至浪費精力於資料的重新輸入。另一方面,非財務資訊在投資決策過程中日益重要,但其龐大的資訊揭露量卻往往徒增投資人閱讀與搜尋上之不便,甚至降低了可閱讀性。 有鑑於上述兩大投資分析不便之處,本研究運用文字探勘(Text mining)技術,嘗試處理股東會年報中與企業策略相關之非財務性資訊,以協助閱讀者有效率地分析、整理這些半結構化,甚至是非結構化文字資訊。另一方面,本研究利用可延伸企業報導語言(eXtensible Business Reporting Language, XBRL)不受軟體平台限制,可於網路上自由下載流通等特性,作為財務資訊之資料來源,同時建立一種新的分析模式,透過連結機制之設計以連接非財務性與財務性資訊,並運用ROMC系統分析法與雛型系統設計法完成本企業策略分析決策支援系統,希冀能協助投資人能於短時間內瞭解並印證標的公司之產業發展與競爭策略,提升決策品質。 / There are two main data types in investment decision process: financial and non-financial. Because the inconsistent of data type in traditional financial data, investors may have more additional costs to solve this problem. In addition, non-financial data become more and more important in investment decision process, but huge amount of non-financial disclosure may reduce the readability and increase the difficulty of searching. To solve the above problems, we try to use text mining technology to handle the semi-structured or unstructured non-financial data related to business strategies in the annual reports of public companies effectively and efficiently. In addition, we use XBRL (eXtensible Business Reporting Language) to be our financial data resources because of its interoperability and re-usability. We also develop a new analytic method to link financial and non-financial data together. Finally, we use two system methodologies: R.O.M.C. and prototyping to design and build our business strategy analysis decision support system in order to help investors understand and prove strategies in companies, and improve the decision quality which they make.

透過專利、學術論文分析技術發展趨勢-以蝕刻技術為例 / Technology Trends Analysis via Patent and Scientific Publication - A Case Study of Etching

徐竣祈 Unknown Date (has links)
競爭是現代社會中無所不在的行為,國家或企業透過產業競爭分析、企業競爭分析,乃至市場分析及技術預測(Technological Forecast),才能知己知彼並且擬定正確的決策。對科技產業而言,若企業無法隨時掌握技術發展的趨勢,儘早投入技術研發或調整企業的經營策略。不久之後,市場便會被其他競爭對手所佔據。所幸,沒有一項技術發明是直接由發明者的腦袋直接跳到廣泛應用的境地。其間總是經過好幾個連續階段,每一個階段都使得「實用性」及「有用性」更成熟。因此若能掌握科技發展的脈絡,早期投入研發,便能維持企業的競爭優勢。 專利資訊可以用來評估與預測技術發展、規劃研發或技術發展項目、避免誤觸專利權而浪費研發資源、掌握企業發展動向及市場需求。許多企業和政府機關已經發現專利分析的重要性,並且投入大量的人力、物力來進行專利分析的工作。然而,專利的申請日和公開日之間還是存在至少18個月的時滯,若企業過渡倚賴專利資訊的分析,容易使後續的研發資源投入競爭激烈的技術紅海當中。因此若要充分掌握前瞻技術發展的脈動,基礎研究趨勢分析相對於專利趨勢分析,其重要性有過之而無不及。 在分析方法方面,現存的書目記錄以科學與技術類佔大多數,因此,以書目計量學為工具,自然成為研究「科學」技術整體發展的主流。除了傳統的計量分析之外,利用自動化的方法,挖掘大量文件中的隱含及有用知識,也是最近熱門的研究議題。對探勘技術而言,關聯分析、分群、預測等探勘技術,也漸漸成為技術預測不可或缺的工具之一。 過去曾有眾多的研究利用書目計量來分析學術論文或專利資訊,而最近幾年則陸續出現利用文字探勘來分析學術論文或專利資訊,但這樣的分析結果是片段而不完全的。本研究提出整合性的概念,同時結合計量分析(Bibliometrics)與文字探勘(Text Mining)兩種方法,分別對學術論文(Science Citation Index Expanded)與專利資訊(Derwent World Patents Index)這兩種文獻資料作分析,透過互相比較來瞭解技術發展的趨勢。除此之外,也希望透過個案分析,對本研究所提出之方法論本身,探討之間的關聯性。 在選擇個案方面,奈米科技是目前最熱門的科技產業發展方向。其中最具代表性的產業即「半導體產業」和「微機電系統產業」。蝕刻(Etching)製程與設備技術的良劣,直接影響晶圓產品良率的高低,是影響奈米科技未來發展的重要技術之一。故蝕刻技術之發展趨勢值得深入研究。 而本研究之研究結論如下: 1.技術趨勢分析在層次上宜由遠(計量分析)而近(文獻探勘),理論(論文資料)與應用(專利資料)應並重,分析結果才能互補長短。 2.科學發展與市場需求為專利技術生命週期的領先指標。 3.科學發展增加技術商業化的應用,但市場需求則強化了創新擴散的效果。 4.蝕刻技術的基礎研究目前處於成熟期,而技術發展目前處於成長期。在電子產品輕薄短小、高效能的需求下,預期蝕刻技術將持續被商業化應用。 5.蝕刻技術的領先地位,美商已逐漸被亞洲企業所取代,尤其近來南韓的半導體廠商最為積極。台灣的台積電和聯電過去已累積雄厚的技術發展基礎,惟台灣在基礎研究與產學合作方面仍待加強。

應用kNN文字探勘技術於分析新聞評論 影響股價漲跌趨勢之研究 / The Study of Analyzing Comments of News for Influence of Stock Price Trends Prediction by Using Knn Text Mining

詹智勝, Chan, Chih Sheng Unknown Date (has links)
在網際網路快速發展下,大量使用者在獲取知識與新聞的管道,已由傳統媒體轉移到網路上。網路活動下使用者互動後所留下的訊息,也就是網路口碑,也逐漸受到重視。而隨著經濟發展,國人在固定薪資下無法負擔高房價、高物價的生活,如何透過投資理財來增加自身財富,已是非常普遍,其中又以股市投資為大眾所重視之途徑。 網路新聞的發布,除了具有網路的即時性外,配合使用者閱讀內化後所留下的評論,應含有比網路新聞本身內容更多的資訊,投資者便可藉此找尋隱含之中大量市場消息與資訊。 本研究為了在龐大的資料量中,幫助使用者挖掘其背後之涵義,進而提供投資預測,將蒐集網路新聞及其閱讀者評論共1068篇,並分為訓練資料與測試資料,使用文字探勘及相關技術做前處理,再透過kNN分群技術,計算訓練資料文件間相似度,將大量未知資料依其相似度做分群後,利用歷史股價訊息對群集結果之特徵分析解釋之並建立預測模型,最後透過測試資料將模型分群結果進行評估,進而對股價趨勢做出預測。 / With the rapid development of the Internet, the way of user access to knowledge and news transfer from traditional media to the network. Internet word-of-mouth, the message generated from users' interaction on internet, attracts more and more people's attention. With economic development, people in the fixed salary cannot afford high prices and high price in live. People increase their own wealth through investment is very common, among which the stock market is the way to public attention. Internet news has the immediacy of the Internet. And the comments left with the user to read the internalization should contain more information than the Internet news. Investors can find the market news and information by Internet news and comments. In this study, in order to help the user to find the meaning behind the huge amount of data, and thus provide investment forecast. We will collect 1068 of internet news and reader reviews to divide into training data and test data using text mining and related technologies to do the pre-treatment, and then calculate the similarity between the training data by kNN, a lot of unknown data according to their similarity clustering. Cluster through the historical share price analysis and modeling. Finally, the model clustering results were evaluated through the test data to predict price trends. The prediction model from training data clustering, use test data to do the evaluation found: k = 15, the similarity threshold value = 0.05, cluster the results of the F-measure performance up to 56% rise in the cluster. K values and the similarity threshold will be adjusted to obtain the most favorable results of the model

Textual data mining applications for industrial knowledge management solutions

Ur-Rahman, Nadeem January 2010 (has links)
In recent years knowledge has become an important resource to enhance the business and many activities are required to manage these knowledge resources well and help companies to remain competitive within industrial environments. The data available in most industrial setups is complex in nature and multiple different data formats may be generated to track the progress of different projects either related to developing new products or providing better services to the customers. Knowledge Discovery from different databases requires considerable efforts and energies and data mining techniques serve the purpose through handling structured data formats. If however the data is semi-structured or unstructured the combined efforts of data and text mining technologies may be needed to bring fruitful results. This thesis focuses on issues related to discovery of knowledge from semi-structured or unstructured data formats through the applications of textual data mining techniques to automate the classification of textual information into two different categories or classes which can then be used to help manage the knowledge available in multiple data formats. Applications of different data mining techniques to discover valuable information and knowledge from manufacturing or construction industries have been explored as part of a literature review. The application of text mining techniques to handle semi-structured or unstructured data has been discussed in detail. A novel integration of different data and text mining tools has been proposed in the form of a framework in which knowledge discovery and its refinement processes are performed through the application of Clustering and Apriori Association Rule of Mining algorithms. Finally the hypothesis of acquiring better classification accuracies has been detailed through the application of the methodology on case study data available in the form of Post Project Reviews (PPRs) reports. The process of discovering useful knowledge, its interpretation and utilisation has been automated to classify the textual data into two classes.

Managing Information System Integration Technologies--A Study of Text Mined Industry White Papers

Ravindran, Balaji 16 May 2003 (has links)
Industry white papers are increasingly being used to explain the philosophy and operation of a product in marketplace or technology context. This explanation is used by senior managers for strategic planning in an organization. This research explores the effectiveness of white papers and strategies for managers to learn about technologies using white papers. The research is conducted by collecting industry white papers in the area of Information System Integration and gleaned relevant information through text-mining tool, Vantage Point. The text mined information is analyzed to provide solutions for practical problems in systems integration market. The indirect findings of the research are New System Integration Business Models, Methods for Calculating ROI of System Integration Project, and Managing Implementation Failures.

O algoritmo de aprendizado semi-supervisionado co-training e sua aplicação na rotulação de documentos / The semi-supervised learning algorithm co-training applied to label text documents

Matsubara, Edson Takashi 26 May 2004 (has links)
Em Aprendizado de Máquina, a abordagem supervisionada normalmente necessita de um número significativo de exemplos de treinamento para a indução de classificadores precisos. Entretanto, a rotulação de dados é freqüentemente realizada manualmente, o que torna esse processo demorado e caro. Por outro lado, exemplos não-rotulados são facilmente obtidos se comparados a exemplos rotulados. Isso é particularmente verdade para tarefas de classificação de textos que envolvem fontes de dados on-line tais como páginas de internet, email e artigos científicos. A classificação de textos tem grande importância dado o grande volume de textos disponível on-line. Aprendizado semi-supervisionado, uma área de pesquisa relativamente nova em Aprendizado de Máquina, representa a junção do aprendizado supervisionado e não-supervisionado, e tem o potencial de reduzir a necessidade de dados rotulados quando somente um pequeno conjunto de exemplos rotulados está disponível. Este trabalho descreve o algoritmo de aprendizado semi-supervisionado co-training, que necessita de duas descrições de cada exemplo. Deve ser observado que as duas descrições necessárias para co-training podem ser facilmente obtidas de documentos textuais por meio de pré-processamento. Neste trabalho, várias extensões do algoritmo co-training foram implementadas. Ainda mais, foi implementado um ambiente computacional para o pré-processamento de textos, denominado PreTexT, com o objetivo de utilizar co-training em problemas de classificação de textos. Os resultados experimentais foram obtidos utilizando três conjuntos de dados. Dois conjuntos de dados estão relacionados com classificação de textos e o outro com classificação de páginas de internet. Os resultados, que variam de excelentes a ruins, mostram que co-training, similarmente a outros algoritmos de aprendizado semi-supervisionado, é afetado de maneira bastante complexa pelos diferentes aspectos na indução dos modelos. / In Machine Learning, the supervised approach usually requires a large number of labeled training examples to learn accurately. However, labeling is often manually performed, making this process costly and time-consuming. By contrast, unlabeled examples are often inexpensive and easier to obtain than labeled examples. This is especially true for text classification tasks involving on-line data sources, such as web pages, email and scientific papers. Text classification is of great practical importance today given the massive volume of online text available. Semi-supervised learning, a relatively new area in Machine Learning, represents a blend of supervised and unsupervised learning, and has the potential of reducing the need of expensive labeled data whenever only a small set of labeled examples is available. This work describes the semi-supervised learning algorithm co-training, which requires a partitioned description of each example into two distinct views. It should be observed that the two different views required by co-training can be easily obtained from textual documents through pre-processing. In this works, several extensions of co-training algorithm have been implemented. Furthermore, we have also implemented a computational environment for text pre-processing, called PreTexT, in order to apply the co-training algorithm to text classification problems. Experimental results using co-training on three data sets are described. Two data sets are related to text classification and the other one to web-page classification. Results, which range from excellent to poor, show that co-training, similarly to other semi-supervised learning algorithms, is affected by modelling assumptions in a rather complicated way.

