Global ETD Search

1	應用資訊擷取技術於企業評價財務項資料之取得 / An Application of Information Extraction in Collecting Financial Data for Business Valuation 賴哲霆, Lai,Jhe-Ting Unknown Date (has links) 由於近幾年來網際網路電子資源的數量大量成長下，搜尋引擎技術的誕生為使用者帶來檢索資料文件上極高的便利與效率。但網路資源和使用者大量成長下，現有的關鍵字檢索技術已無法滿足使用者需求。然而「資訊擷取」就是將從檢索文件中擷取重要特定訊息或產生資訊間特定關係的一種技術。其不僅從文件中能過濾不必要的資訊，而且產生有興趣或特定的重要訊息和摘要。企業評價即為一套收集、分析與應用財務或非財務資訊來評價企業的價值，其評估的結果可做為企業決策和無形資產買賣訂價之依據。目前在國內企業的財務報表、財務附註和財經新聞內容皆有與企業評價所需重要訊息和資料，並以網頁和PDF格式呈現。因此，本研究將對國內企業財務報表、財務附註和財經新聞為資料來源，以企業評價概念基礎下建立中文財務項資料的資訊擷取系統。從這些不同的異質資料來源中，擷取正確的財務項資料與其所對應之企業評價模型，以達成自動擷取企業評價資料。使用者能在最短的時間內取得相關有效評價資訊和學習評價模型，使資訊處理品質能夠提昇正確性和效率性。 / Due to an increase in the wealth of electronic resources on the Internet in the past several years, the birth of the search engine has brought the utmost convenience and efficiency for users. However, searching for data by keyword retrieval techniques in information retrieval is not contented with some users’ specific demands due to a large number of network resources and users on the Internet. Information extraction (IE) is an improvement method which extracts the important specific event or produces specific relations among information from documents. IE can not only filter unnecessary information in any documents but also produce specific important messages and summaries that users are interested in. Business valuation is collecting, analyzing, and applying to financial or non- financial integral information to appraise the business value. The evaluated results are used in the commerce pricing for the business decision and intangible assets. There are specific information and events about business valuation stored in the Chinese financial statements, notes to financial statements, and financial news of Taiwan’s companies at present and data is presented by the HTML and PDF files. Hence, we developed an information extraction system of Chinese financial data for business valuation from the domestic business financial statements, notes to financial statements, and financial news as our data sources. We extracted the correct financial data and their corresponding business valuation model to achieve an automatic extraction in the financial data from these different heterogeneous data sources. Users can collect the relevant valid valuation information and learn valuation models concepts within a very short time to improve accuracy and efficiency in information processing quality. 資訊擷取企業評價財務項資料 Information Extraction Business Valuation Financial Data
2	以型態辨識為主的中文資訊擷取技術研究翁嘉緯, Chia-Wei Weng Unknown Date (has links) 隨著網際網路的蓬勃發展，資訊擷取(Information Extraction)已經成為一個非常重要的技術。資訊擷取的目標為從非結構化的文字資料中，為特定的主題整理出相關之結構化資訊，其所牽涉的問題，包括分析文件的內容，篩選、擷取出相關的文字及其對應的意義。到目前為止，大部份的資訊擷取系統都著重在英文文件上，對於中文文件資訊擷取技術的研究才正在如火如荼的展開，加上全世界至少超過1/5的人說中文，積極投入中文資訊擷取的研究就顯得非常重要。中文的描述方式與英文有著很大的不同。在英文，詞跟詞之間有著明顯的『空白』，電腦可以很輕易的區隔輸入字串中每個詞。但是在中文，詞跟詞之間並沒有明顯的界限，一般的處理情形為利用詞典，將一個輸入字串中的文字，比對詞典內的詞來當做斷詞的依據，不過由於字組成詞的變化程度相當大，斷詞錯誤的情形仍很可能出現。因此，在本篇研究論文我們提出不做斷詞、不做詞性分析，而利用『型態辨識』的方法搭配『有限狀態自動機』的運作方式，來處理中文資訊擷取的問題。在實驗方面，我們以『總政府人事任免公報』當作測試資料，其精確度高達98%，而回收率也達到了97%。此外，我們也應用到其他不同的資料領域，對於建立跨領域之中文資訊擷取系統有了初步的研究進展，充分印證了本資訊擷取方法處理中文資訊擷取問題的可行性。 / With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting relevant parts of text and the corresponding meaning. Most information extraction research mainly focuses on English text. On the other hand, research on Chinese information extraction has not received as much attention. Considering the fact that one-fifth population in the world are Chinese-speaking people, Chinese information extraction technology will become increasingly important. Chinese language is different with English in many aspects. In English, words are separated with space such that computers can easily distinguish each word in the input string. In Chinese, there are no spaces between characters to segment them into meaningful words. A general solution is to match characters of the input string to the words in the dictionary to find proper word boundary. Yet, much flexibility and ambiguity exist in the combination of characters into words. Many errors may occur in word segmentation. . In this thesis, we propose an approach to Chinese information extraction based on pattern matching and finite state automata, without relying on word segmentation and part-of-speech tagging. The approach was evaluated with “government personnel directives in official gazettes” as test data, and it achieved performance measure of 98% precision and 97% recall. Moreover, the approach was extended to other data domains. The results have showed initial progress on the research of multiple- domain Chinese information extraction system. 資訊擷取型態辨識有限狀態自動機 Information Extraction Pattern based Finite State Automata
3	中文資訊擷取結果之錯誤偵測 / Error Detection on Chinese Information Extraction Results 鄭雍瑋, Cheng, Yung-Wei Unknown Date (has links) 資訊擷取是從自然語言文本中辨識出特定的主題或事件的描述，進而萃取出相關主題或事件元素中的對應資訊，再將其擷取之結果彙整至資料庫中，便能將自然語言文件轉換成結構化的核心資訊。然而資訊擷取技術的結果會有錯誤情況發生，若單只依靠人工檢查及更正錯誤的方式進行，將會是耗費大量人力及時間的工作。在本研究論文中，我們提出字串圖形結構與字串特徵值兩種錯誤資料偵測方法。前者是透過圖形結構比對各資料內字元及字元間關聯，接著由公式計算出每筆資料的比對分數，藉由分數高低可判斷是否為錯誤資料；後者則是利用字串特徵值，來描述字串外表特徵，再透過SVM和C4.5機器學習分類方法歸納出決策樹，進而分類正確與錯誤二元資料。而此兩種偵測方法的差異在於前者隱含了圖學理論之節點位置與鄰點概念，直接比對原始字串內容；後者則是將原始字串轉換成特徵數值，進行分類等動作。在實驗方面，我們以「總統府人事任免公報」之資訊擷取成果資料庫作為測試資料。實驗結果顯示，本研究所提出的錯誤偵測方法可以有效偵測出不合格的值組，不但能節省驗證資料所花費的成本，甚至可確保高資料品質的資訊擷取成果產出，促使資訊擷取技術更廣泛的實際應用。 / Given a targeted subject and a text collection, information extraction techniques provide the capability to populate a database in which each record entry is a subject instance documented in the text collection. However, even with the state-of-the-art IE techniques, IE task results are expected to contain errors. Manual error detection and correction are labor intensive and time consuming. This validation cost remains a major obstacle to actual deployment of practical IE applications with high validity requirement. In this paper, we propose string graph structure and string feature-based methods. The former takes advantage of graph structure to compare characters and the relation between characters. Next step, we count the corresponding score via formula, and then the scores are takes to estimate the data correctness. The latter uses string features to describe a certain characteristics of each string, after that decision tree is generated by the C4.5 and SVM machine learning algorithms. And then classify the data is valid or not. These two detection methods have the ability to describe the feature of data and verify the correctness further. The difference between these two methods is that, we deal with string of row data directly in the previous method. Besides, it indicates the concept of node position and neighbor node in graphic theory. By contrast, the row string was transformed into feature value, and then be classified in the latter method. In our experiments, we use IE task results of government personnel directives as test data. We conducted experiments to verify that effective detection of IE invalid values can be achieved by using the string graph structure and string feature-based methods. The contribution of our work is to reduce validation cost and enhance the quality of IE results, even provide both analytical and empirical evidences for supporting the effective enhancement of IE results usability as well. 錯誤偵測資訊擷取文本資料描述 Error Detection Information Extraction Textual Data Profiling
4	轉換年報資料以擷取企業評價模型之非財務性資料項 / A Transformation Approach to Extract Annual Report for Non-Financial Category in Business Valuation 吳思宏, Wu, Szu-Hung Unknown Date (has links) 現今由於之前企業併購熱潮，使得企業到底價值多少？企業是否能夠還有前景？這些問題不僅僅是投資者所關心的問題，也同樣是會計師及企業評價者所關心的問題。又現今已邁入知識經濟時代，企業已從過去以土地、廠房、設備等固定資產來產生企業價值，轉而以服務、品牌、專利等無形資產為主要的企業價值時，企業的價值又要如何來估算。而這些問題都一再的顯示出“企業評價”的重要性。在進行企業評價之前，企業評價模型中之資料項的取得更是關係著最後評價結果的好壞。在企業評價資料項中，可分為財務性及非財務性。財務性資料項由於定義清楚，所以在資料的收集上較非財務性資料容易。但我們發現過往之資料收集方式並不足以應用在企業評價非財務性資料項的收集上，且現行大多採用人工處理資料的方式，不僅耗費大量時間及成本，又因人工輸入而有資料輸入錯誤之風險，使得資料的正確性大幅降低。故本研究提出一自動化擷取年報中企業評價非財務性資料項之方法，希望藉此方法達到簡化資料收集過程，提高資料的正確性。 / Because of the trend of the business combination, now, more and more people concern about “how much value does a business have?” And “does the business still have any perspectives?” This not only get investors’’ interest, but also the accountant and business valuator. Now we already get into a new economy, called knowledge-based economy. When the businesses are not just use fixed asset, such as facility, factory and land to earn money, but also earn their money by providing services, making brand, or sell patents for live, how to measure the business’s real value and what the real value for the business is. These problems all shows that the importance of “Business Valuation.” Before calculate the business value, the most important thing is to collect the data or data category for business valuation. There are two kinds of business valuation data item. One is financial data item; the other is non-financial data item. Because of the financial data item’s clear definition, the data collection process of financial data item is easier than non-financial data item. And the data collection in the past is not fit for today, and now most valuators use manual way to process these data. This way not only wastes the time and money, but also lowers the correctness and raises the risk of mistype during the process of data collection. In this thesis, we propose an approach to automatic extract business valuation data category from annual report by using the technology of data extraction. 企業評價資訊擷取 Portable Document Format ( PDF ) 資訊檢索斷詞 Business valuation Data extraction Portable Document Format ( PDF ) Information Retrieval Word Segmentation
5	科技政策網站內容分析之研究賴昌彥, Lai, Chang-Yen Unknown Date (has links) 面對全球資訊網(WWW)應用蓬勃發展，網際網路上充斥著各種類型的資訊資源。而如何有效地管理及檢索這些資料，就成為當前資訊管理的重要課題之一。在發掘資訊時，最常用的便是搜尋引擎，透過比對查詢字串與索引表格(index table)，找出相關的網頁文件，並回傳結果。但因為網頁描述資訊的不足，導致其回覆大量不相關的查詢結果，浪費使用者許多時間。為了解決上述問題，就資訊搜尋的角度而言，本研究提出以文字開採技術實際分析網頁內容，並將其轉換成維度資訊來描述，再以多維度資料庫方式儲存的架構。做為改進現行資訊檢索的參考架構。就資訊描述的角度，本研提出採用RDF(Resource Description Framework)來描述網頁Metadata的做法。透過此通用的資料格式來描述網路資源，做為跨領域使用、表達資訊的標準，便於Web應用程式間的溝通。期有效改善現行網際網路資源描述之缺失，大幅提昇搜尋之品質。資訊檢索資訊擷取元資料文字開採資源描述架構多維度資料庫 Information Retrieval Information Extraction Metadata Text Mining Resource Description Framework Multi-Dimensional Database
6	探索美國財務報表的主觀性詞彙與盈餘的關聯性:意見分析之應用 / Exploring the relationships between annual earnings and subjective expressions in US financial statements: opinion analysis applications 陳建良, Chen, Chien Liang Unknown Date (has links) 財務報表中的主觀性詞彙往往影響市場中的參與者對於報導公司價值和獲利能力衡量的決策判斷。因此，公司的管理階層往往有高度的動機小心謹慎的選擇用詞以隱藏負面的消息而宣揚正面的消息。然而使用人工方式從文字量極大的財務報表挖掘有用的資訊往往不可行，因此本研究採用人工智慧方法驗證美國財務報表中的主觀性多字詞 (subjective MWEs) 和公司的財務狀況是否具有關聯性。多字詞模型往往比傳統的單字詞模型更能掌握句子中的語意情境，因此本研究應用條件隨機域模型 (conditional random field) 辨識多字詞形式的意見樣式。另外，本研究的實證結果發現一些跡象可以印證一般人對於財務報表的文字揭露往往與真實的財務數字存在有落差的印象；更發現在負向的盈餘變化情況下，公司管理階層通常輕描淡寫當下的短拙卻堅定地承諾璀璨的未來。 / Subjective assertions in financial statements influence the judgments of market participants when they assess the value and profitability of the reporting corporations. Hence, the managements of corporations may attempt to conceal the negative and to accentuate the positive with "prudent" wording. To excavate this accounting phenomenon hidden behind financial statements, we designed an artificial intelligence based strategy to investigate the linkage between financial status measured by annual earnings and subjective multi-word expressions (MWEs). We applied the conditional random field (CRF) models to identify opinion patterns in the form of MWEs, and our approach outperformed previous work employing unigram models. Moreover, our novel algorithms take the lead to discover the evidences that support the common belief that there are inconsistencies between the implications of the written statements and the reality indicated by the figures in the financial statements. Unexpected negative earnings are often accompanied by ambiguous and mild statements and sometimes by promises of glorious future. 意見探勘自然語言處理語意分析財務報表文字探勘資訊擷取 opinion mining natural language processing sentiment analysis financial text mining information extraction
7	利用馬可夫邏輯網路模型與自動化生成的模板加強生醫文獻之語意角色標註 / Biomedical semantic role labeling with a Markov Logic network and automatically generated patterns 賴柏廷 Unknown Date (has links) 背景: 生醫文獻語意角色標註（Semantic Role Labeling, SRL）是一種自然語言處理的技術，其可用來將描述生物過程的語句以predicate-argument structures ( PASs ) 表示。SRL 經常受限於arguments的unbalance problem而且需要花費許多的時間和記憶體空間在學習 arguments 之間的相依性。方法: 我們提出一Markov Logic Network ( MLN ) -based SRL之系統，且此系統使用自動化生成之SRL 模板同時辨識constituents與候選之語意角色。結果及結論: 我們的方法在BioProp語料上來評估。實驗結果顯示我們的方法勝過目前最先進的系統。此外，使用SRL模板後，在時間及記憶體之花費上亦大幅的減少，而且我們自動化生成之模板亦能幫助建立這些模板。我們認為本論文提出之方法可以透過增加新的SRL模板例如：由生物學家手動寫的模板，而得到進一步的提升，而且本方法也為於需要處理大量SRL 語料時，提供一種可能的解法。 / Background: Biomedical semantic role labeling ( SRL ) is a natural language processing technique that expresses the sentences that describe biological processes as predicate-argument structures ( PASs ) . SRL usually suffers from the unbalanced problem of arguments and consuming time and memory on learning the dependencies between the arguments. Method: We constructed a Markov Logic Network ( MLN ) -based SRL system, and the system uses SRL patterns, which utilizes automatically generated approaches, to simultaneously recognize the constituents and candidates of semantic roles. Results and conclusions: Our method is evaluated on the BioProp corpus. The experimental result shows that our method outperforms the state-of-the-art system. Furthermore, after applying SRL patterns, the costs of the time and memory are greatly reduced, and our automatically generated patterns are helpful in the development of these patterns. We consider that our method can be further improved by adding new SRL patterns such as biological experts manually written patterns and it also provide a possible solution to process large SRL corpus. 語意角色標註自然語言處理馬可夫邏輯網路機器學習資訊擷取 Semantic Role Labeling Natural Language Processing Markov Logic Network Machine Learning Information Extraction
8	中文流行音樂詞曲情意關聯分析 / Conception association analysis between lyrics and music of Chinese popular music 林志傑, Lin, Chih Chieh Unknown Date (has links) 本篇論文旨在研究中文流行音樂歌詞與歌曲之間情意的關聯性，並設計一個能推薦出符合歌曲情意的「以曲找詞歌詞推薦系統」。流行音樂（Popular Music）在廣義上的定義為透過大眾媒體傳播、以大眾為閱聽對象的歌曲。其大眾化的特徵，使得流行音樂歌詞的主題多與日常生活息息相關且能清楚表達歌曲的情意，並以其所引起的共鳴性決定歌曲是否具出版的商業價值，人們也常常使用流行音樂歌曲來唱出屬於自己的故事、屬於自己的心聲。因此，本篇論文提出自動為流行音樂歌曲推薦符合歌曲情意的歌詞，讓舊有的歌曲搭配上新的歌詞，而當一首歌曲搭配了不同的歌詞就有了不同的故事，也帶給了原曲新的生命，達成一曲多詞的數位加值效果。由文獻及專業音樂創作者的論述中，我們可以了解流行音樂詞曲有相關的搭配關係，其中又以詞曲情意的搭配關係最為重要，因此詞曲情意之間的關聯性為本研究問題的核心所在。透過大量分析市面上的流行歌曲，我們便可以從中看出詞曲之間情意搭配的線索。我們利用 LSA（Latent Semantic Analysis）演算法萃取出歌詞的情意特徵值，並比較其與語言學領域中隱喻融合理論的相似性，而在歌曲方面萃取出音高、調性、速度、節奏、和弦及音色等與等能展現歌曲情意的相關特徵值。然後利用了 CFA（Cross-Modal Factor Analysis）演算法來建立詞曲之間情意特徵值的關聯模型，最後我們便可以利用關聯模型來建立推薦系統，如此便完成了詞曲情意關聯為基礎的以曲找詞歌詞推薦系統。實驗結果顯示，考慮詞曲情意特徵關聯所訓練出的關聯模型（CFA Feature Model）在以曲找詞推薦符合情意歌詞的前五名準確率平均達 60.1 %，前五十名也有 41.4 % 的準確率，比起僅考慮歌曲情意特徵（Audio Feature Model）以曲找詞推薦符合情意歌詞的前五名準確率 45.1% 及前五十名準確率28.6 % 準確率高，代表本研究所提出的詞曲情意關聯模型確實能有效推薦出符合歌曲情意的歌詞。我們也對本研究提出的詞曲情意關聯模型進行歌詞推薦結果的案例分析，我們輸入幾首學生創作的歌曲觀察詞曲情意關聯模型歌詞推薦結果，我們發現推薦出的流行音樂歌詞與學生創作的原詞在歌詞情意上非常類似，再次顯示本研究所提出的詞曲情意關聯模型確實能有效推薦出符合歌曲情意的歌詞，在詞曲創作上將能為創作者帶來靈感支援，幫助創作者詞曲創作。 / Nowadays lots of people use popular music to sing out their own story, and their own aspirations. In this thesis, we propose an approach to analyze the conception association between lyrics and music of Chinese popular music. And for applications, we design a lyrics recommendation system which can automatically recommend lyrics which is suitable to accompany with query music according to the affection and conception between lyrics and music. So, the old song with new lyrics, just like the song with different stories, brings the original song with new life. There are accompany association between lyrics and music, and the affection and conception association is most important among all. Therefore, analyze the conception association between lyrics and music is our goal. To do this, we can find out the association clues between lyrics and music from analyzing lots of popular music. For lyrics, we use LSA (Latent Semantic Analysis) algorithm to extract lyrics conception features. For music, we extracted the pitch, tonality, speed, rhythm, chords features which can show the music’s conception in the music. Then we use the CFA (Cross-Modal Factor Analysis) algorithm to analyze and learn the conception association between lyrics and music and establish the conception association model . Finally, we will be able to take advantage of the conception association model to establish the lyrics recommendation system. In the experimental results, when recommend the same conception lyrics to the query music, our proposed approach (CFA Feature Model) reaches accuracy of 60.1% on average in the top 5 recommended lyrics. Compared to control group approach (Audio Feature Model) only reaches accuracy of 45.1% on average in the top 5 recommended lyrics, our approach get better accuracy. We also presented some interesting lyrics recommendation results in case study. We upload some popular music created by students, and we found out that the affection and conception of the recommended lyrics are similar to the original song lyric which is created by the students. The experimental results show that the lyrics and music conception association model we proposed in this study does recommended lyrics suitable to the query music conception. 詞曲情意關聯分析音樂情意分析歌詞情意分析跨模態關聯探勘以曲找詞音樂資訊擷取 Music conception analysis Lyrics conception analysis Cross modal association mining Recommendation lyrics by song Music Information Retrieval
9	文字背後的意含-資訊的量化測量公司基本面與股價（以中鋼為例） / Behind the words - quantifying information to measure firms' fundamentals and stock return (taking the China steel corporation as example) 傅奇珅, Fu, Chi Shen Unknown Date (has links) 本研究蒐集經濟日報、聯合報、與聯合晚報的新聞文章，以中研院的中文斷詞性統進行結構性的處理，參考並延伸Tetlock、Saar-Tsechansky和Macskassy(2008)的研究方法，檢驗使用一個簡單的語言量化方式是否能夠用來解釋與預測個別公司的會計營收與股票報酬。有以下發現： 1. 正面詞彙(褒義詞)在新聞報導中的比例能夠預測高的公司營收。 2. 公司的股價對負面詞彙(貶義詞)有過度反應的現象，對正面詞彙(褒義詞)則有效率地充分反應。綜合以上發現，本論文得到，新聞媒體的文字內容能夠捕捉到一些關於公司基本面難以量化的部份，而投資者迅速地將這些資訊併入股價。 / This research collects all of the news stories about China Steel Corporation from Economic Daily News, United Daily News, and United Evening News. These articles I collect are segmented by a Chinese Word Segmentation System of Academia Sinica and used by the methodology of Tetlock, Saar-Tsechansky, and Macskassy(2008). I examine whether a simple quantitative measure fo language can be used to predict individual firms’ accounting sales and stock returns. My two main findings are: 1. the fraction of positive words (commendatory term) in firm-specific news stories forecasts high firm sales; 2. firm’s stock prices briefly overreaction to the information embedded in negative words (Derogatory term); on the other hand, firm’s stock prices efficiently incorporate the information embedded in positive words (commendatory term). All of the above, we conclude this linguistic media content captures otherwise hard-toquantify aspects of firms’ fundamentals, which investors quickly incorporate into stock prices. 內容分析法文字資訊資訊內涵文件資料探勘關鍵資訊擷取資訊效果褒義詞貶義詞正面詞彙負面詞彙基本面分析股票報酬分析 Content Analysis Textual Information Informative Content Text Mining Information Effect Critical Information Extraction Commendatory Term Derogatory Term Positive words Negative words Fundamental Analysis Stock Return Analysis

1	應用資訊擷取技術於企業評價財務項資料之取得 / An Application of Information Extraction in Collecting Financial Data for Business Valuation 賴哲霆, Lai,Jhe-Ting Unknown Date (has links) 由於近幾年來網際網路電子資源的數量大量成長下，搜尋引擎技術的誕生為使用者帶來檢索資料文件上極高的便利與效率。但網路資源和使用者大量成長下，現有的關鍵字檢索技術已無法滿足使用者需求。然而「資訊擷取」就是將從檢索文件中擷取重要特定訊息或產生資訊間特定關係的一種技術。其不僅從文件中能過濾不必要的資訊，而且產生有興趣或特定的重要訊息和摘要。企業評價即為一套收集、分析與應用財務或非財務資訊來評價企業的價值，其評估的結果可做為企業決策和無形資產買賣訂價之依據。目前在國內企業的財務報表、財務附註和財經新聞內容皆有與企業評價所需重要訊息和資料，並以網頁和PDF格式呈現。因此，本研究將對國內企業財務報表、財務附註和財經新聞為資料來源，以企業評價概念基礎下建立中文財務項資料的資訊擷取系統。從這些不同的異質資料來源中，擷取正確的財務項資料與其所對應之企業評價模型，以達成自動擷取企業評價資料。使用者能在最短的時間內取得相關有效評價資訊和學習評價模型，使資訊處理品質能夠提昇正確性和效率性。 / Due to an increase in the wealth of electronic resources on the Internet in the past several years, the birth of the search engine has brought the utmost convenience and efficiency for users. However, searching for data by keyword retrieval techniques in information retrieval is not contented with some users’ specific demands due to a large number of network resources and users on the Internet. Information extraction (IE) is an improvement method which extracts the important specific event or produces specific relations among information from documents. IE can not only filter unnecessary information in any documents but also produce specific important messages and summaries that users are interested in. Business valuation is collecting, analyzing, and applying to financial or non- financial integral information to appraise the business value. The evaluated results are used in the commerce pricing for the business decision and intangible assets. There are specific information and events about business valuation stored in the Chinese financial statements, notes to financial statements, and financial news of Taiwan’s companies at present and data is presented by the HTML and PDF files. Hence, we developed an information extraction system of Chinese financial data for business valuation from the domestic business financial statements, notes to financial statements, and financial news as our data sources. We extracted the correct financial data and their corresponding business valuation model to achieve an automatic extraction in the financial data from these different heterogeneous data sources. Users can collect the relevant valid valuation information and learn valuation models concepts within a very short time to improve accuracy and efficiency in information processing quality. 資訊擷取企業評價財務項資料 Information Extraction Business Valuation Financial Data
2	以型態辨識為主的中文資訊擷取技術研究翁嘉緯, Chia-Wei Weng Unknown Date (has links) 隨著網際網路的蓬勃發展，資訊擷取(Information Extraction)已經成為一個非常重要的技術。資訊擷取的目標為從非結構化的文字資料中，為特定的主題整理出相關之結構化資訊，其所牽涉的問題，包括分析文件的內容，篩選、擷取出相關的文字及其對應的意義。到目前為止，大部份的資訊擷取系統都著重在英文文件上，對於中文文件資訊擷取技術的研究才正在如火如荼的展開，加上全世界至少超過1/5的人說中文，積極投入中文資訊擷取的研究就顯得非常重要。中文的描述方式與英文有著很大的不同。在英文，詞跟詞之間有著明顯的『空白』，電腦可以很輕易的區隔輸入字串中每個詞。但是在中文，詞跟詞之間並沒有明顯的界限，一般的處理情形為利用詞典，將一個輸入字串中的文字，比對詞典內的詞來當做斷詞的依據，不過由於字組成詞的變化程度相當大，斷詞錯誤的情形仍很可能出現。因此，在本篇研究論文我們提出不做斷詞、不做詞性分析，而利用『型態辨識』的方法搭配『有限狀態自動機』的運作方式，來處理中文資訊擷取的問題。在實驗方面，我們以『總政府人事任免公報』當作測試資料，其精確度高達98%，而回收率也達到了97%。此外，我們也應用到其他不同的資料領域，對於建立跨領域之中文資訊擷取系統有了初步的研究進展，充分印證了本資訊擷取方法處理中文資訊擷取問題的可行性。 / With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting relevant parts of text and the corresponding meaning. Most information extraction research mainly focuses on English text. On the other hand, research on Chinese information extraction has not received as much attention. Considering the fact that one-fifth population in the world are Chinese-speaking people, Chinese information extraction technology will become increasingly important. Chinese language is different with English in many aspects. In English, words are separated with space such that computers can easily distinguish each word in the input string. In Chinese, there are no spaces between characters to segment them into meaningful words. A general solution is to match characters of the input string to the words in the dictionary to find proper word boundary. Yet, much flexibility and ambiguity exist in the combination of characters into words. Many errors may occur in word segmentation. . In this thesis, we propose an approach to Chinese information extraction based on pattern matching and finite state automata, without relying on word segmentation and part-of-speech tagging. The approach was evaluated with “government personnel directives in official gazettes” as test data, and it achieved performance measure of 98% precision and 97% recall. Moreover, the approach was extended to other data domains. The results have showed initial progress on the research of multiple- domain Chinese information extraction system. 資訊擷取型態辨識有限狀態自動機 Information Extraction Pattern based Finite State Automata
3	中文資訊擷取結果之錯誤偵測 / Error Detection on Chinese Information Extraction Results 鄭雍瑋, Cheng, Yung-Wei Unknown Date (has links) 資訊擷取是從自然語言文本中辨識出特定的主題或事件的描述，進而萃取出相關主題或事件元素中的對應資訊，再將其擷取之結果彙整至資料庫中，便能將自然語言文件轉換成結構化的核心資訊。然而資訊擷取技術的結果會有錯誤情況發生，若單只依靠人工檢查及更正錯誤的方式進行，將會是耗費大量人力及時間的工作。在本研究論文中，我們提出字串圖形結構與字串特徵值兩種錯誤資料偵測方法。前者是透過圖形結構比對各資料內字元及字元間關聯，接著由公式計算出每筆資料的比對分數，藉由分數高低可判斷是否為錯誤資料；後者則是利用字串特徵值，來描述字串外表特徵，再透過SVM和C4.5機器學習分類方法歸納出決策樹，進而分類正確與錯誤二元資料。而此兩種偵測方法的差異在於前者隱含了圖學理論之節點位置與鄰點概念，直接比對原始字串內容；後者則是將原始字串轉換成特徵數值，進行分類等動作。在實驗方面，我們以「總統府人事任免公報」之資訊擷取成果資料庫作為測試資料。實驗結果顯示，本研究所提出的錯誤偵測方法可以有效偵測出不合格的值組，不但能節省驗證資料所花費的成本，甚至可確保高資料品質的資訊擷取成果產出，促使資訊擷取技術更廣泛的實際應用。 / Given a targeted subject and a text collection, information extraction techniques provide the capability to populate a database in which each record entry is a subject instance documented in the text collection. However, even with the state-of-the-art IE techniques, IE task results are expected to contain errors. Manual error detection and correction are labor intensive and time consuming. This validation cost remains a major obstacle to actual deployment of practical IE applications with high validity requirement. In this paper, we propose string graph structure and string feature-based methods. The former takes advantage of graph structure to compare characters and the relation between characters. Next step, we count the corresponding score via formula, and then the scores are takes to estimate the data correctness. The latter uses string features to describe a certain characteristics of each string, after that decision tree is generated by the C4.5 and SVM machine learning algorithms. And then classify the data is valid or not. These two detection methods have the ability to describe the feature of data and verify the correctness further. The difference between these two methods is that, we deal with string of row data directly in the previous method. Besides, it indicates the concept of node position and neighbor node in graphic theory. By contrast, the row string was transformed into feature value, and then be classified in the latter method. In our experiments, we use IE task results of government personnel directives as test data. We conducted experiments to verify that effective detection of IE invalid values can be achieved by using the string graph structure and string feature-based methods. The contribution of our work is to reduce validation cost and enhance the quality of IE results, even provide both analytical and empirical evidences for supporting the effective enhancement of IE results usability as well. 錯誤偵測資訊擷取文本資料描述 Error Detection Information Extraction Textual Data Profiling
4	轉換年報資料以擷取企業評價模型之非財務性資料項 / A Transformation Approach to Extract Annual Report for Non-Financial Category in Business Valuation 吳思宏, Wu, Szu-Hung Unknown Date (has links) 現今由於之前企業併購熱潮，使得企業到底價值多少？企業是否能夠還有前景？這些問題不僅僅是投資者所關心的問題，也同樣是會計師及企業評價者所關心的問題。又現今已邁入知識經濟時代，企業已從過去以土地、廠房、設備等固定資產來產生企業價值，轉而以服務、品牌、專利等無形資產為主要的企業價值時，企業的價值又要如何來估算。而這些問題都一再的顯示出“企業評價”的重要性。在進行企業評價之前，企業評價模型中之資料項的取得更是關係著最後評價結果的好壞。在企業評價資料項中，可分為財務性及非財務性。財務性資料項由於定義清楚，所以在資料的收集上較非財務性資料容易。但我們發現過往之資料收集方式並不足以應用在企業評價非財務性資料項的收集上，且現行大多採用人工處理資料的方式，不僅耗費大量時間及成本，又因人工輸入而有資料輸入錯誤之風險，使得資料的正確性大幅降低。故本研究提出一自動化擷取年報中企業評價非財務性資料項之方法，希望藉此方法達到簡化資料收集過程，提高資料的正確性。 / Because of the trend of the business combination, now, more and more people concern about “how much value does a business have?” And “does the business still have any perspectives?” This not only get investors’’ interest, but also the accountant and business valuator. Now we already get into a new economy, called knowledge-based economy. When the businesses are not just use fixed asset, such as facility, factory and land to earn money, but also earn their money by providing services, making brand, or sell patents for live, how to measure the business’s real value and what the real value for the business is. These problems all shows that the importance of “Business Valuation.” Before calculate the business value, the most important thing is to collect the data or data category for business valuation. There are two kinds of business valuation data item. One is financial data item; the other is non-financial data item. Because of the financial data item’s clear definition, the data collection process of financial data item is easier than non-financial data item. And the data collection in the past is not fit for today, and now most valuators use manual way to process these data. This way not only wastes the time and money, but also lowers the correctness and raises the risk of mistype during the process of data collection. In this thesis, we propose an approach to automatic extract business valuation data category from annual report by using the technology of data extraction. 企業評價資訊擷取 Portable Document Format ( PDF ) 資訊檢索斷詞 Business valuation Data extraction Portable Document Format ( PDF ) Information Retrieval Word Segmentation
5	科技政策網站內容分析之研究賴昌彥, Lai, Chang-Yen Unknown Date (has links) 面對全球資訊網(WWW)應用蓬勃發展，網際網路上充斥著各種類型的資訊資源。而如何有效地管理及檢索這些資料，就成為當前資訊管理的重要課題之一。在發掘資訊時，最常用的便是搜尋引擎，透過比對查詢字串與索引表格(index table)，找出相關的網頁文件，並回傳結果。但因為網頁描述資訊的不足，導致其回覆大量不相關的查詢結果，浪費使用者許多時間。為了解決上述問題，就資訊搜尋的角度而言，本研究提出以文字開採技術實際分析網頁內容，並將其轉換成維度資訊來描述，再以多維度資料庫方式儲存的架構。做為改進現行資訊檢索的參考架構。就資訊描述的角度，本研提出採用RDF(Resource Description Framework)來描述網頁Metadata的做法。透過此通用的資料格式來描述網路資源，做為跨領域使用、表達資訊的標準，便於Web應用程式間的溝通。期有效改善現行網際網路資源描述之缺失，大幅提昇搜尋之品質。資訊檢索資訊擷取元資料文字開採資源描述架構多維度資料庫 Information Retrieval Information Extraction Metadata Text Mining Resource Description Framework Multi-Dimensional Database
6	探索美國財務報表的主觀性詞彙與盈餘的關聯性:意見分析之應用 / Exploring the relationships between annual earnings and subjective expressions in US financial statements: opinion analysis applications 陳建良, Chen, Chien Liang Unknown Date (has links) 財務報表中的主觀性詞彙往往影響市場中的參與者對於報導公司價值和獲利能力衡量的決策判斷。因此，公司的管理階層往往有高度的動機小心謹慎的選擇用詞以隱藏負面的消息而宣揚正面的消息。然而使用人工方式從文字量極大的財務報表挖掘有用的資訊往往不可行，因此本研究採用人工智慧方法驗證美國財務報表中的主觀性多字詞 (subjective MWEs) 和公司的財務狀況是否具有關聯性。多字詞模型往往比傳統的單字詞模型更能掌握句子中的語意情境，因此本研究應用條件隨機域模型 (conditional random field) 辨識多字詞形式的意見樣式。另外，本研究的實證結果發現一些跡象可以印證一般人對於財務報表的文字揭露往往與真實的財務數字存在有落差的印象；更發現在負向的盈餘變化情況下，公司管理階層通常輕描淡寫當下的短拙卻堅定地承諾璀璨的未來。 / Subjective assertions in financial statements influence the judgments of market participants when they assess the value and profitability of the reporting corporations. Hence, the managements of corporations may attempt to conceal the negative and to accentuate the positive with "prudent" wording. To excavate this accounting phenomenon hidden behind financial statements, we designed an artificial intelligence based strategy to investigate the linkage between financial status measured by annual earnings and subjective multi-word expressions (MWEs). We applied the conditional random field (CRF) models to identify opinion patterns in the form of MWEs, and our approach outperformed previous work employing unigram models. Moreover, our novel algorithms take the lead to discover the evidences that support the common belief that there are inconsistencies between the implications of the written statements and the reality indicated by the figures in the financial statements. Unexpected negative earnings are often accompanied by ambiguous and mild statements and sometimes by promises of glorious future. 意見探勘自然語言處理語意分析財務報表文字探勘資訊擷取 opinion mining natural language processing sentiment analysis financial text mining information extraction
7	利用馬可夫邏輯網路模型與自動化生成的模板加強生醫文獻之語意角色標註 / Biomedical semantic role labeling with a Markov Logic network and automatically generated patterns 賴柏廷 Unknown Date (has links) 背景: 生醫文獻語意角色標註（Semantic Role Labeling, SRL）是一種自然語言處理的技術，其可用來將描述生物過程的語句以predicate-argument structures ( PASs ) 表示。SRL 經常受限於arguments的unbalance problem而且需要花費許多的時間和記憶體空間在學習 arguments 之間的相依性。方法: 我們提出一Markov Logic Network ( MLN ) -based SRL之系統，且此系統使用自動化生成之SRL 模板同時辨識constituents與候選之語意角色。結果及結論: 我們的方法在BioProp語料上來評估。實驗結果顯示我們的方法勝過目前最先進的系統。此外，使用SRL模板後，在時間及記憶體之花費上亦大幅的減少，而且我們自動化生成之模板亦能幫助建立這些模板。我們認為本論文提出之方法可以透過增加新的SRL模板例如：由生物學家手動寫的模板，而得到進一步的提升，而且本方法也為於需要處理大量SRL 語料時，提供一種可能的解法。 / Background: Biomedical semantic role labeling ( SRL ) is a natural language processing technique that expresses the sentences that describe biological processes as predicate-argument structures ( PASs ) . SRL usually suffers from the unbalanced problem of arguments and consuming time and memory on learning the dependencies between the arguments. Method: We constructed a Markov Logic Network ( MLN ) -based SRL system, and the system uses SRL patterns, which utilizes automatically generated approaches, to simultaneously recognize the constituents and candidates of semantic roles. Results and conclusions: Our method is evaluated on the BioProp corpus. The experimental result shows that our method outperforms the state-of-the-art system. Furthermore, after applying SRL patterns, the costs of the time and memory are greatly reduced, and our automatically generated patterns are helpful in the development of these patterns. We consider that our method can be further improved by adding new SRL patterns such as biological experts manually written patterns and it also provide a possible solution to process large SRL corpus. 語意角色標註自然語言處理馬可夫邏輯網路機器學習資訊擷取 Semantic Role Labeling Natural Language Processing Markov Logic Network Machine Learning Information Extraction
8	中文流行音樂詞曲情意關聯分析 / Conception association analysis between lyrics and music of Chinese popular music 林志傑, Lin, Chih Chieh Unknown Date (has links) 本篇論文旨在研究中文流行音樂歌詞與歌曲之間情意的關聯性，並設計一個能推薦出符合歌曲情意的「以曲找詞歌詞推薦系統」。流行音樂（Popular Music）在廣義上的定義為透過大眾媒體傳播、以大眾為閱聽對象的歌曲。其大眾化的特徵，使得流行音樂歌詞的主題多與日常生活息息相關且能清楚表達歌曲的情意，並以其所引起的共鳴性決定歌曲是否具出版的商業價值，人們也常常使用流行音樂歌曲來唱出屬於自己的故事、屬於自己的心聲。因此，本篇論文提出自動為流行音樂歌曲推薦符合歌曲情意的歌詞，讓舊有的歌曲搭配上新的歌詞，而當一首歌曲搭配了不同的歌詞就有了不同的故事，也帶給了原曲新的生命，達成一曲多詞的數位加值效果。由文獻及專業音樂創作者的論述中，我們可以了解流行音樂詞曲有相關的搭配關係，其中又以詞曲情意的搭配關係最為重要，因此詞曲情意之間的關聯性為本研究問題的核心所在。透過大量分析市面上的流行歌曲，我們便可以從中看出詞曲之間情意搭配的線索。我們利用 LSA（Latent Semantic Analysis）演算法萃取出歌詞的情意特徵值，並比較其與語言學領域中隱喻融合理論的相似性，而在歌曲方面萃取出音高、調性、速度、節奏、和弦及音色等與等能展現歌曲情意的相關特徵值。然後利用了 CFA（Cross-Modal Factor Analysis）演算法來建立詞曲之間情意特徵值的關聯模型，最後我們便可以利用關聯模型來建立推薦系統，如此便完成了詞曲情意關聯為基礎的以曲找詞歌詞推薦系統。實驗結果顯示，考慮詞曲情意特徵關聯所訓練出的關聯模型（CFA Feature Model）在以曲找詞推薦符合情意歌詞的前五名準確率平均達 60.1 %，前五十名也有 41.4 % 的準確率，比起僅考慮歌曲情意特徵（Audio Feature Model）以曲找詞推薦符合情意歌詞的前五名準確率 45.1% 及前五十名準確率28.6 % 準確率高，代表本研究所提出的詞曲情意關聯模型確實能有效推薦出符合歌曲情意的歌詞。我們也對本研究提出的詞曲情意關聯模型進行歌詞推薦結果的案例分析，我們輸入幾首學生創作的歌曲觀察詞曲情意關聯模型歌詞推薦結果，我們發現推薦出的流行音樂歌詞與學生創作的原詞在歌詞情意上非常類似，再次顯示本研究所提出的詞曲情意關聯模型確實能有效推薦出符合歌曲情意的歌詞，在詞曲創作上將能為創作者帶來靈感支援，幫助創作者詞曲創作。 / Nowadays lots of people use popular music to sing out their own story, and their own aspirations. In this thesis, we propose an approach to analyze the conception association between lyrics and music of Chinese popular music. And for applications, we design a lyrics recommendation system which can automatically recommend lyrics which is suitable to accompany with query music according to the affection and conception between lyrics and music. So, the old song with new lyrics, just like the song with different stories, brings the original song with new life. There are accompany association between lyrics and music, and the affection and conception association is most important among all. Therefore, analyze the conception association between lyrics and music is our goal. To do this, we can find out the association clues between lyrics and music from analyzing lots of popular music. For lyrics, we use LSA (Latent Semantic Analysis) algorithm to extract lyrics conception features. For music, we extracted the pitch, tonality, speed, rhythm, chords features which can show the music’s conception in the music. Then we use the CFA (Cross-Modal Factor Analysis) algorithm to analyze and learn the conception association between lyrics and music and establish the conception association model . Finally, we will be able to take advantage of the conception association model to establish the lyrics recommendation system. In the experimental results, when recommend the same conception lyrics to the query music, our proposed approach (CFA Feature Model) reaches accuracy of 60.1% on average in the top 5 recommended lyrics. Compared to control group approach (Audio Feature Model) only reaches accuracy of 45.1% on average in the top 5 recommended lyrics, our approach get better accuracy. We also presented some interesting lyrics recommendation results in case study. We upload some popular music created by students, and we found out that the affection and conception of the recommended lyrics are similar to the original song lyric which is created by the students. The experimental results show that the lyrics and music conception association model we proposed in this study does recommended lyrics suitable to the query music conception. 詞曲情意關聯分析音樂情意分析歌詞情意分析跨模態關聯探勘以曲找詞音樂資訊擷取 Music conception analysis Lyrics conception analysis Cross modal association mining Recommendation lyrics by song Music Information Retrieval
9	文字背後的意含-資訊的量化測量公司基本面與股價（以中鋼為例） / Behind the words - quantifying information to measure firms' fundamentals and stock return (taking the China steel corporation as example) 傅奇珅, Fu, Chi Shen Unknown Date (has links) 本研究蒐集經濟日報、聯合報、與聯合晚報的新聞文章，以中研院的中文斷詞性統進行結構性的處理，參考並延伸Tetlock、Saar-Tsechansky和Macskassy(2008)的研究方法，檢驗使用一個簡單的語言量化方式是否能夠用來解釋與預測個別公司的會計營收與股票報酬。有以下發現： 1. 正面詞彙(褒義詞)在新聞報導中的比例能夠預測高的公司營收。 2. 公司的股價對負面詞彙(貶義詞)有過度反應的現象，對正面詞彙(褒義詞)則有效率地充分反應。綜合以上發現，本論文得到，新聞媒體的文字內容能夠捕捉到一些關於公司基本面難以量化的部份，而投資者迅速地將這些資訊併入股價。 / This research collects all of the news stories about China Steel Corporation from Economic Daily News, United Daily News, and United Evening News. These articles I collect are segmented by a Chinese Word Segmentation System of Academia Sinica and used by the methodology of Tetlock, Saar-Tsechansky, and Macskassy(2008). I examine whether a simple quantitative measure fo language can be used to predict individual firms’ accounting sales and stock returns. My two main findings are: 1. the fraction of positive words (commendatory term) in firm-specific news stories forecasts high firm sales; 2. firm’s stock prices briefly overreaction to the information embedded in negative words (Derogatory term); on the other hand, firm’s stock prices efficiently incorporate the information embedded in positive words (commendatory term). All of the above, we conclude this linguistic media content captures otherwise hard-toquantify aspects of firms’ fundamentals, which investors quickly incorporate into stock prices. 內容分析法文字資訊資訊內涵文件資料探勘關鍵資訊擷取資訊效果褒義詞貶義詞正面詞彙負面詞彙基本面分析股票報酬分析 Content Analysis Textual Information Informative Content Text Mining Information Effect Critical Information Extraction Commendatory Term Derogatory Term Positive words Negative words Fundamental Analysis Stock Return Analysis

Search results