Global ETD Search

561	應用文字探勘技術於英文文章難易度分類 / The Classification of the Difficulty of English Articles with Text Mining 許珀豪, Hsu, Po Hao Unknown Date (has links) 英語學習者如何能在普及的網路環境中，挑選難易度符合自身英文閱讀能力的文章，便是一個值得探討的議題。為了提升文章難易度分類的準確度，近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵，各自歸類和綜合歸類後與原先官方文章類別比較，檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果，來提高準度。本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分：語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度，並算出各語文特徵值後，再使用kNN將文章歸類成初級、中級或中高級，並做為比較準確度的依據；再以GEPT文章斷詞，並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類；最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47，最後一個、也是表現最好的結果是以兩者結合後歸類，F-measure有0.68。如何從大量的英文文章當中找到適合自己程度循序漸進的學習，是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類，並可以從中分類出不同類別且不同程度的英文文章，讓使用者自行選擇並閱讀，使學習成效進而提升。 / It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results. The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68. The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning. 文字探勘 kNN 英文文章適讀性英文語文難易度特徵文字特徵 text mining kNN the difficulty of English articles the characteristics of the text
562	應用文字探勘文件分類分群技術於股價走勢預測之研究─以台灣股票市場為例 / A Study of Stock Price Prediction with Text Mining, Classification and Clustering Techniques in Taiwan Stock Market 薛弘業, Hsueh, Hung Yeh Unknown Date (has links) 本研究欲探究個股新聞影響台灣股票市場之關係，透過蒐集宏達電、台積電與鴻海等三間上市公司從2012年6月至2013年5月的歷史交易資料和個股新聞，使用文字探勘技術找出各新聞內容的特徵，再透過歷史資料、技術分析指標與kNN和2-way kNN演算法將新聞先做分類後分群，建立預測模型，分析新聞對股價漲跌的影響與程度，以及漲跌幅度較高之群集與股價漲跌和轉折的關係。研究結果發現，加入技術分析指標後能夠提升分類的準確率，而漲跌類別內的分群能夠界定各群集與股價漲跌之間的關係，且漲跌幅度較高之群集的分析則能大幅提升投資準確率至80%左右，而股價轉折點之預測則能提供一個明確的投資進場時間點，並確保當投資人依照此預測模型的結果進行7交易日投資時，可以在風險極低的前提下，穩當且迅速的獲取2.82%至22.03%不等的投資報酬。 / This study investigated the relation that the stock news effect on Taiwan Stock Market. Through collected the historical transaction data and stock news from July, 2012 to May, 2013, and use text mining、kNN Classification and 2-Way kNN Clustering technique analyzing the stock news, build a forecast model to analyze the degree of news effect on the stock price, and find the relation between the cluster which has great degree and the reversal points of stock price. The result shows that using the change range and Technical Indicator rise classification’s accuracy, and clustering in the ”up” group and “down” group can identify the range stock price move, and rise the invested accuracy up to about 80 percent. The forecast of reversal points of stock price offers a specific time to invest, and insure the investors who execute a 7 trading day investment depend on this model can get 2.82 to 22.03 percent return reliably and quickly with low risk. 個股新聞文字探勘 kNN 2-way kNN 股價轉折點 Stock news Text Mining kNN 2-way kNN Reversal Points of Stock Price
563	Étude comparative du vocabulaire de description de la danse dans les archives et du vocabulaire de représentation de la danse dans la littérature Paquette-Bigras, Ève 03 1900 (has links) Notre recherche s’insère dans la mouvance des humanités numériques; nous y faisons dialoguer les arts et les sciences de l’information. Depuis quelques décennies, la danse est un sujet d’études et de recherche à part entière. Il devient donc nécessaire de mieux décrire la danse dans les archives, sachant que la description en amont influe grandement sur l’accès en aval. Les méthodes d’extraction automatique de connaissances nous semblent offrir de nouvelles possibilités. L’objectif de notre recherche est de contribuer au développement d’outils de gestion de l’information dans les archives de la danse en comparant un vocabulaire de description de la danse dans les archives et un vocabulaire de représentation de la danse dans la littérature, recueilli grâce à des méthodes d’extraction automatique de connaissances, pour en distinguer une possible complémentarité, particulièrement en ce qui a trait au vocabulaire de l’expérience esthétique. D’abord, nous analysons un vocabulaire de description de la danse dans les archives. Nous décrivons certains outils de description des archives de la danse et nous analysons le thésaurus de descripteurs Collier. Nous constatons que le vocabulaire de description de la danse dans les archives ne semble pas prendre en compte l’expérience esthétique. Ensuite, nous analysons un vocabulaire de représentation de la danse dans la littérature. Un vocabulaire structuré de l’expérience esthétique de la danse moderne est ainsi extrait d’un corpus de textes de l’écrivain français Stéphane Mallarmé et analysé. Puis nous comparons les deux vocabulaires afin d'en distinguer la complémentarité quant à la description de l’expérience esthétique. Nous formulons une première suggestion d’amélioration de certains thésaurus employés dans les archives de la danse : un thésaurus au vocabulaire essentiellement factuel, comme le thésaurus de descripteurs Collier, peut être enrichi de termes à propos de l’expérience esthétique. Le vocabulaire de représentation de la danse dans la littérature est jusqu’à un certain point complémentaire au vocabulaire de description de l’expérience esthétique de la danse dans les archives. Nous menons ainsi une première expérimentation qui justifie en partie la pertinence de certaines méthodes d’extraction de connaissances dans le développement et la maintenance de ressources documentaires pour le domaine des arts d’interprétation tels que la danse. / This research falls within the field of digital humanities; arts and information science engage in dialogue. In the last few decades, dance has become a distinct research subject. Dance description in archives needs to be improved, because the quality of the description impacts access to the documentation. Knowledge extraction seems to offer new opportunities in this regard. The goal of this research is to contribute to the development of information management tools by comparing a vocabulary for describing dance in archives with a vocabulary for representing dance in literature obtained through knowledge extraction. We look for possible complementarity, particularly in regard to the aesthetic experience. First, some tools for describing dance in archives are described, and the Collier Descriptor Thesaurus is analyzed. We observe that this vocabulary for describing dance in archives does not take into account aesthetic experience. Second, a vocabulary for representing dance in literature is analyzed. More specifically, a structured vocabulary of aesthetic experience of modern dance is drawn from a corpus of texts from the French writer Stéphane Mallarmé, and the vocabulary obtained is analyzed. Finally, the two vocabularies are compared to consider their complementarity. We conclude that some vocabularies for describing dance in archives, consisting mainly of factual terms, such as the Collier Descriptor Thesaurus, can be enriched with terms related to aesthetic experience. The vocabulary for representing dance in literature complements to a certain extent the vocabulary for describing dance in archives. Thus this initial experiment supports the relevance of knowledge extraction in information resources maintenance and development for performing arts such as dance. / Diese Arbeit beschäftigt sich mit dem Fachgebiet der Digital Humanities und verbindet dabei Kunst mit informationswissenschaftlichen Methoden. In den letzten Jahrzehnten ist Tanz ein eigenständiges Forschungsgebiet geworden. Da sich die Qualität der Beschreibung direkt auf den Zugang zu Dokumenten im Archiv auswirkt, bedarf die Beschreibung von Tanz in Archiven Verbesserung. Ziel der Forschung ist es zur Entwicklung von Informationsverwaltungs-Tools beizutragen, indem das Vokabular der Beschreibung von Tanz im Archiv mit Vokabular aus der Literatur, extrahiert aus textuellen Datenbanken, verglichen wird. Dabei liegt der Fokus auf der Komplementarität beider Quellen, besonders in Bezug auf die Beschreibung von ästhetischen Erfahrungen. Zunächst werden Tools für die Beschreibung von Tanz in Archiven beschrieben und der Collier Descriptor Thesaurus analysiert. Dabei zeigt sich, dass das Vokabular der Tanz-Beschreibung im Archiv ästhetische Erfahrung generell nicht berücksichtigt. Daraufhin wird das Vokabular der Tanz-Darstellung in der Literatur am Beispiel der Text-Sammlung des franzözischen Dichters Stéphane Mallarmé analysiert. Im Anschluss werden die zwei Wortschätze verglichen, um die Komplementarität beider Quellen zu beschreiben. Die Arbeit kommt zu dem Schluss, dass das Vokabular der Tanz-Beschreibung im Archiv hauptsächlich aus sachbezogenen Begriffen besteht (z.B. der Collier Descriptor Thesaurus), welche um Begriffe zur ästhetischen Erfahrung ergänzt werden können. Die Begriffe für die Tanz-Beschreibung in der Literatur komplementieren bis zu einem gewissen Grad das Vokabular der Tanz-Beschreibung im Archiv. Demzufolge bildet diese Arbeit eine Grundlage für weitere Forschung im Bereich der Wissensextraktion in textuellen Datenbanken im Fachgebiet darstellender Künste wie Tanz. Archives Arts Dance Danse Description des documents Digital humanities Document description Extraction automatique de connaissances Fouille de textes Humanités numériques Knowledge extraction Stéphane Mallarmé Text mining
564	應用探勘技術於社會輿情以預測捷運週邊房地產市場之研究 / A Study of Applying Public Opinion Mining to Predict the Housing Market Near the Taipei MRT Stations 吳佳芸, Wu, Chia Yun Unknown Date (has links) 因網際網路帶來的便利性與即時性，網路新聞成為社會大眾吸收與傳遞新聞資訊的重要管道之一，而累積的巨量新聞亦可反映出社會輿論對某特定新聞議題之即時反應、熱門程度以及情緒走向等。因此，本研究期望借由意見探勘與情緒分析技術，從特定領域新聞中挖掘出有價值的關聯，並結合傳統機器學習建立一個房地產市場的預測模式，提供購屋決策的參考依據。本研究搜集99年1月1日至103年6月30日共1,1150筆房地產新聞，以及8,165件捷運週邊250公尺內房屋買賣交易資料，運用意見探勘萃取意見詞彙進行情緒分析，並建立房市情緒與成交價量時間序列，透過半年移動平均、二次移動平均及成長斜率，瞭解社會輿情對房市行情抱持樂觀或悲觀，分析社會情緒與實際房地產成交間關聯性，以期能找出房地產買賣時機點，並進一步結合情緒及房地產的環境影響因素，藉由支援向量機建立站點房市的預測模型。實證結果中，本研究發現房市情緒與成交價量之波動有一定的週期與相關性，且新捷運開通前一年將連帶影響整體捷運房市波動，當成交線穿越情緒線且斜率同時向上時，可做為適當的房市進場時機點。而本研究針對站點情緒與環境變數所建立之預測模型，其預測新捷運線站點之平均準確率為69.2％，而預測新捷運線熱門站點之準確率為78％，顯示模型於預測熱門站點上具有不錯的預測能力。 / Nowadays, E-News have become an important way for people to get daily information. These enormous amounts of news could reflect public opinions on a particular attention or sentiment trends in news topics. Therefore, how to use opinion mining and sentiment analysis technology to dig out valuable information from particular news becomes the latest issue. In this study, we collected 1,1150 house news and 8,165 house transaction records around the MRT stations within 250 meters over the last five years. We extracted the emotion words from the news by manipulating opinion mining. Furthermore, we built moving average lines and the slope of the moving average in order to explore the relationship and entry point between public opinion and housing market. In conclusion, we indicated that there is a high correlation between the news sentiment and housing market. We also uses SVM algorithm to construct a model to predict housing hotspots. The results demonstrate that the SVM model reaches average accuracy at 69.2% and the model accuracy increases up to 78% for predicting housing hotspots. Besides, we also provide investors with a basis of entry point into the housing market by utilizing the moving average cross overs and slopes analysis and a better way of predicting housing hotspots. 文字探勘情緒探勘房地產移動平均支援向量機 Text Mining Opinion Mining Housing Market Moving Average Support Vector Machine
565	Learning with Sparcity: Structures, Optimization and Applications Chen, Xi 01 July 2013 (has links) The development of modern information technology has enabled collecting data of unprecedented size and complexity. Examples include web text data, microarray & proteomics, and data from scientific domains (e.g., meteorology). To learn from these high dimensional and complex data, traditional machine learning techniques often suffer from the curse of dimensionality and unaffordable computational cost. However, learning from large-scale high-dimensional data promises big payoffs in text mining, gene analysis, and numerous other consequential tasks. Recently developed sparse learning techniques provide us a suite of tools for understanding and exploring high dimensional data from many areas in science and engineering. By exploring sparsity, we can always learn a parsimonious and compact model which is more interpretable and computationally tractable at application time. When it is known that the underlying model is indeed sparse, sparse learning methods can provide us a more consistent model and much improved prediction performance. However, the existing methods are still insufficient for modeling complex or dynamic structures of the data, such as those evidenced in pathways of genomic data, gene regulatory network, and synonyms in text data. This thesis develops structured sparse learning methods along with scalable optimization algorithms to explore and predict high dimensional data with complex structures. In particular, we address three aspects of structured sparse learning: 1. Efficient and scalable optimization methods with fast convergence guarantees for a wide spectrum of high-dimensional learning tasks, including single or multi-task structured regression, canonical correlation analysis as well as online sparse learning. 2. Learning dynamic structures of different types of undirected graphical models, e.g., conditional Gaussian or conditional forest graphical models. 3. Demonstrating the usefulness of the proposed methods in various applications, e.g., computational genomics and spatial-temporal climatological data. In addition, we also design specialized sparse learning methods for text mining applications, including ranking and latent semantic analysis. In the last part of the thesis, we also present the future direction of the high-dimensional structured sparse learning from both computational and statistical aspects. Machine Learning Sparse Learning Optimization Structure Regression Multi-task Regression Canonical Correlation Analysis Undirected Graphical Models First-order Method Stochastic Optimization Text Mining Ranking Latent Semantic Analysis Spatial-temporal Data Computational Genomics Computer Sciences
566	巨量資料環境下之新聞主題暨輿情與股價關係之研究 / A Study of the Relevance between News Topics & Public Opinion and Stock Prices in Big Data 張良杰, Chang, Liang Chieh Unknown Date (has links) 近年來科技、網路以及儲存媒介的發達，產生的資料量呈現爆炸性的成長，也宣告了巨量資料時代的來臨。擁有巨量資料代表了不必再依靠傳統抽樣的方式來蒐集資料，分析數據也不再有資料收集不足以致於無法代表母題的限制。突破傳統的限制後，巨量資料的精隨在於如何從中找出有價值的資訊。以擁有大量輿論和人際互動資訊的社群網站為例，就有相關學者研究其情緒與股價具有正相關性，本研究也試著利用同樣具有巨量資料特性的網路新聞，抓取中央新聞社2013年7月至2014年5月之經濟類新聞共計30,879篇，結合新聞主題偵測與追蹤技術及情感分析，利用新聞事件相似的概念，透過連結匯聚成網絡並且分析新聞的情緒和股價指數的關係。研究結果顯示，新聞事件間可以連結成一特定新聞主題，且能在龐大的網絡中找出不同的新聞主題，並透過新聞主題之連結產生新聞主題脈絡。對此提供一種新的方式來迅速了解巨量新聞內容，也能有效的回溯新聞主題及新聞事件。在新聞情緒和股價指數方面，研究發現新聞情緒影響了股價指數之波動，其相關係數達到0.733562；且藉由情緒與心理線及買賣意願指標之比較，顯示新聞的情緒具有一定的程度能夠成為股價判斷之參考依據。 / In recent years, the technology, network, and storage media developed, the amount of generated data with the explosive growth, and also declared the new era of big data. Having big data let us no longer rely on the traditional sample ways to collect data, and no longer have the issue that could not represent the population which caused by the inadequate data collection. Once we break the limitations, the main spirit of big data is how to find out the valuable information in big data. For example, the social network sites (SNS) have a lot of public opinions and interpersonal information, and scholars have founded that the emotions in SNS have a positive correlation with stock prices. Therefore, the thesis tried to focus on the news which have the same characteristic of big data, using the web crawl to catch total of 30,879 economics news articles form the Central News Agency, furthermore, took the “Topic Detection & Tracking” and “Sentiment Analysis” technology on these articles. Finally, based on the concept of the similarity between news articles, through the links converging networks and analyze the relevant between news sentiment and stock prices. The results shows that news events can be linked to specific news topics, identify different news topics in a large network, and form the news topic context by linked news topics together. The thesis provides a new way to quickly understand the huge amount of news, and backtracking news topics and news event with effective. In the aspect of news sentiment and stock prices, the results shows that the news sentiments impact the fluctuations of stock prices, and the correlation coefficient is 0.733562. By comparing the emotion with psychological lines & trading willingness indicators, the emotion is better than the two indicators in the stock prices determination. 巨量資料文字探勘新聞主題偵測與追蹤連結分析情感分析 Big data Text mining News topic detection and tracking Link analysis Sentiment analysis
567	法人說明會資訊對供應鏈機構投資人投資行為之影響-以我國半導體產業為例 / The Effect of Up-stream Company’s Conference Call Information on Down-stream’s Company’s institutional investors– An Example From Semi-conductor Industry in Taiwan 劉士豪, Liu, Shih Hao Unknown Date (has links) 本篇研究試圖探討半導體產業供應鏈上游的IC設計業者召開法人說明會後，基於我國半導體產業供應鏈緊密連結之特性，同屬半導體供應鏈的其他中、下游製造和封測廠之機構投資人的交易行為是否將受到IC設計業者宣告之法人說明會資訊影響，亦即證明法人說明會資訊在半導體供應鏈中是否具有垂直資訊移轉效果。實證結果發現在法人說明會召開訊息首次見報日時，供應鏈上游公司之法人說明會訊息確實會影響其中、下游公司之機構投資人的持股變化，於宣告好(壞)消息時買進(賣出)，顯示機構投資人藉由其專業團隊和私有資訊能早一般大眾提前調整其交易策略，而此資訊移轉效果也會隨著公司在供應鏈上之距離而逐漸稀釋。此外，結果亦顯示外資由於地緣限制，相較於投信和自營商更會倚賴法人說明會宣告之資訊調整其持股策略，於宣告好(壞)消息時買進(賣出)。 / This research examine the conference call which hold by the IC design companies will transfer useful information to the institutional investors of IC manufacturing and packaging companies in the supply chain downstream. I am interested in if there is a vertical information transfer in the semi-conduct industry. The empirical results show that the conference call information is significantly influence the holding percentage of the institutional investors of the downstream supply chain companies after the information of conference call first reported in the newspapers. The institutional investors will increase the holding percentage after the good news released and vice versa. It is showed that the institutional investors can gather more information before the conference call and adjust their invest strategy in advance. Furthermore, this vertical information transfer effect will dilute by degrees as distance increases. Lastly, the result also shows that the foreign institutional investors more rely on the information released from the conference call to adjust their invest strategy than native institutional investors. 半導體供應鏈垂直資訊移轉法人說明會機構投資人文字探勘 Semi-conduct supply chain Vertical information transfer Conference Call Institutional investors Text mining
568	USO DE TEORIAS NO CAMPO DE SISTEMAS DE INFORMAÇÃO: MAPEAMENTO USANDO TÉCNICAS DE MINERAÇÃO DE TEXTOS Pinheiro, José Claudio dos Santos 17 September 2009 (has links) Made available in DSpace on 2016-08-02T21:42:57Z (GMT). No. of bitstreams: 1 Jose Claudio dos Santos Pinheiro.pdf: 5349646 bytes, checksum: 057189cedae5b7fc79c3e7cec83d51aa (MD5) Previous issue date: 2009-09-17 / This work aim to map the use of information system s theories, based on analytic resources that came from information retrieval techniques and data mining and text mining methodologies. The theories addressed by this research were Transactions Costs Economics (TCE), Resource-based view (RBV) and Institutional Theory (IT), which were chosen given their usefulness, while alternatives of approach in processes of allocation of investments and implementation of information systems. The empirical data are based on the content of textual data in abstract and review sections, of articles from ISR, MISQ and JIMS along the period from 2000 to 2008. The results stemming from the text mining technique combined with data mining were compared with the advanced search tool EBSCO and demonstrated greater efficiency in the identification of content. Articles based on three theories accounted for 10% of all articles of the three journals and the most useful publication was the 2001 and 2007.(AU) / Esta dissertação visa apresentar o mapeamento do uso das teorias de sistemas de informações, usando técnicas de recuperação de informação e metodologias de mineração de dados e textos. As teorias abordadas foram Economia de Custos de Transações (Transactions Costs Economics TCE), Visão Baseada em Recursos da Firma (Resource-Based View-RBV) e Teoria Institucional (Institutional Theory-IT), sendo escolhidas por serem teorias de grande relevância para estudos de alocação de investimentos e implementação em sistemas de informação, tendo como base de dados o conteúdo textual (em inglês) do resumo e da revisão teórica dos artigos dos periódicos Information System Research (ISR), Management Information Systems Quarterly (MISQ) e Journal of Management Information Systems (JMIS) no período de 2000 a 2008. Os resultados advindos da técnica de mineração textual aliada à mineração de dados foram comparadas com a ferramenta de busca avançada EBSCO e demonstraram uma eficiência maior na identificação de conteúdo. Os artigos fundamentados nas três teorias representaram 10% do total de artigos dos três períodicos e o período mais profícuo de publicação foi o de 2001 e 2007.(AU) Análise semântica latente teorias de sistemas de informação mineração de dados Latent semantic analysis theories of information system data-mining text mining and semantic content
569	Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico Duque, Juliana Lilian 24 February 2012 (has links) Made available in DSpace on 2016-06-02T19:05:56Z (GMT). No. of bitstreams: 1 4310.pdf: 3265738 bytes, checksum: 6650fb70eee9b096860bcac6b5ed596c (MD5) Previous issue date: 2012-02-24 / Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven. / Atualmente na área médica existe uma grande quantidade de informações não estruturadas (i.e., em formato textual) sendo produzidas na literatura médica. Com o grande volume de dados, torna-se impossível que os médicos e especialistas da área analisem toda a literatura de forma manual, exigindo técnicas para automatizar a análise destes documentos. Com o intuito de identificar as informações relevantes, estruturar e armazenar estas informações em um banco de dados, para posteriormente identificar relacionamentos interessantes entre as informações extraídas, nesta dissertação é proposto um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. A hipótese é que a busca inicial de sentenças que possuem termos de complicação melhora a eficiência na identificação e na extração de termos de tratamento. Isso acontece porque tratamentos ocorrem principalmente na mesma sentença de complicação ou em sentenças próximas no mesmo parágrafo. Esta metodologia utiliza três abordagens de extração de informação encontradas na literatura: abordagem baseada em aprendizado de máquina para classificar as sentenças de interesse; abordagem baseada em dicionário com termos validados pelo especialista da área e abordagem baseada em regras. A metodologia foi validada como prova de conceito, utilizando artigos do domínio biomédico, mais especificamente da doença Anemia Falciforme. A prova de conceito foi realizada na classificação de sentenças e identificação de termos relevantes. O valor da acurácia obtida na classificação de sentenças foi de 79% para o classificador de complicação e 71% para o classificador de tratamento. Estes valores condizem com os resultados obtidos com a combinação do algoritmo de aprendizado de máquina Support Vector Machine juntamente com a aplicação do filtro Remoção de Ruído e Balanceamento das Classes. Na identificação de termos relevantes, os resultados da metodologia proposta obteve percentual superior de 42% de medida-F comparado à classificação manual (31%) e comparado ao processo parcial, ou seja, sem utilizar o classificador de complicação (36%). Mesmo com a baixa revocação, foi possível obter 100% de revocação para os termos distintos de tratamento, não impactando o processo de extração, e portanto a hipótese considerada neste trabalho foi comprovada. Inteligência artificial Banco de dados Mineração de textos Reconhecimento de padrões Extração de informação Anemia falciforme Tratamentos Pré-Processamento Domínio Biomédico Information Extraction Treatments Text Mining Preprocessing Biomedical Domain Sickle Cell Anemia
570	Anotação semântica baseada em ontologia: um estudo do português brasileiro em documentos históricos do final do século XIX Pereira, Juliana Wolf 01 July 2014 (has links) Made available in DSpace on 2016-06-02T19:06:12Z (GMT). No. of bitstreams: 1 5898.pdf: 11774674 bytes, checksum: 3cc87530008d9b42c105781f8a1068a3 (MD5) Previous issue date: 2014-07-01 / Financiadora de Estudos e Projetos / This dissertation presents an approach to proceed with semantic annotation in historical documents from the 19th century that discuss the constitution of the mother tongue, the Portuguese Language in Brazil. The objective is to generate a group of semantically annotated documents in agreement with a domain ontology. To provide this domain ontology, the IntrumentoLinguistico Ontology was built, and it supported the process of automatic semantic annotation. The results obtained with the annotation were analyzed in comparison with the Gold Standard and they presented an elevated level of coincidence, between 0.86 and 1.00 for the Fl-score measure. Besides that, it was possible to locate new documents about the discussed domain in a sample of the Revistas Brazileiras. These results prove the efficacy of the approach of automatic semantic annotation. / Esta dissertação apresenta uma abordagem de anotação semântica automática em documentos históricos do século XIX que discutem a constituição da língua pátria, a Língua Portuguesa no Brasil. O objetivo e gerar um conjunto de documentos semanticamente anotados em acordo com uma ontologia de domínio. Para prover essa ontologia de domínio, foi construída a Ontologia Instrumento Linguístico que apoiou o processo para a realização da anotação semântica automática. Os resultados obtidos com a anotação foram analisados em comparação com o Gold Standard e apresentaram alto grau de coincidência, entre 0.86 e 1.00 para a medida F1-Score. Além disso, foi possível localizar novos documentos sobre o domínio discutido em uma amostra das Revistas Brazileiras. Esses resultados comprovam a eficácia da abordagem de anotação semântica automática. Processamento de textos (Computação) Extração de relações semânticas Ontologia Documentos históricos Mineração de textos Semantic annotation Ontology-based information extraction Ontology Historical documents Text mining Natural language processing

Search results