Spelling suggestions: "subject:"text minining"" "subject:"text chanining""
501 |
Um método para extração de palavras-chave de documentos representados em grafosAbilhoa, Willyan Daniel 05 February 2014 (has links)
Made available in DSpace on 2016-03-15T19:37:48Z (GMT). No. of bitstreams: 1
Willyan Daniel Abilhoa.pdf: 1956528 bytes, checksum: 5d317e6fd19aebfc36180735bcf6c674 (MD5)
Previous issue date: 2014-02-05 / Fundação de Amparo a Pesquisa do Estado de São Paulo / Twitter is a microblog service that generates a huge amount of textual content daily. All this content needs to be explored by means of techniques, such as text mining, natural language processing and information retrieval. In this context, the automatic keyword extraction is a task of great usefulness that can be applied to indexing, summarization and knowledge extrac-tion from texts. A fundamental step in text mining consists of building a text representation model. The model known as vector space model, VSM, is the most well-known and used among these techniques. However, some difficulties and limitations of VSM, such as scalabil-ity and sparsity, motivate the proposal of alternative approaches. This dissertation proposes a keyword extraction method, called TKG (Twitter Keyword Graph), for tweet collections that represents texts as graphs and applies centrality measures for finding the relevant vertices (keywords). To assess the performance of the proposed approach, two different sets of exper-iments are performed and comparisons with TF-IDF and KEA are made, having human clas-sifications as benchmarks. The experiments performed showed that some variations of TKG are invariably superior to others and to the algorithms used for comparisons. / O Twitter é um serviço de microblog que gera um grande volume de dados textuais. Todo esse conteúdo precisa ser explorado por meio de técnicas de mineração de textos, processamento de linguagem natural e recuperação de informação com o objetivo de extrair um conhecimento que seja útil de alguma forma ou em algum processo. Nesse contexto, a extração automática de palavras-chave é uma tarefa que pode ser usada para a indexação, sumarização e compreensão de documentos. Um passo fundamental nas técnicas de mineração de textos consiste em construir um modelo de representação de documentos. O modelo chamado mode-lo de espaço vetorial, VSM, é o mais conhecido e utilizado dentre essas técnicas. No entanto, algumas dificuldades e limitações do VSM, tais como escalabilidade e esparsidade, motivam a proposta de abordagens alternativas. O presente trabalho propõe o método TKG (Twitter Keyword Graph) de extração de palavras-chave de coleções de tweets que representa textos como grafos e aplica medidas de centralidade para encontrar vértices relevantes, correspondentes às palavras-chave. Para medir o desempenho da abordagem proposta, dois diferentes experimentos são realizados e comparações com TF-IDF e KEA são feitas, tendo classifica-ções humanas como referência. Os experimentos realizados mostraram que algumas variações do TKG são superiores a outras e também aos algoritmos usados para comparação.
|
502 |
Selecionando candidatos a descritores para agrupamentos hierárquicos de documentos utilizando regras de associação / Selecting candidate labels for hierarchical document clusters using association rulesFabiano Fernandes dos Santos 17 September 2010 (has links)
Uma forma de extrair e organizar o conhecimento, que tem recebido muita atenção nos últimos anos, é por meio de uma representação estrutural dividida por tópicos hierarquicamente relacionados. Uma vez construída a estrutura hierárquica, é necessário encontrar descritores para cada um dos grupos obtidos pois a interpretação destes grupos é uma tarefa complexa para o usuário, já que normalmente os algoritmos não apresentam descrições conceituais simples. Os métodos encontrados na literatura consideram cada documento como uma bag-of-words e não exploram explicitamente o relacionamento existente entre os termos dos documento do grupo. No entanto, essas relações podem trazer informações importantes para a decisão dos termos que devem ser escolhidos como descritores dos nós, e poderiam ser representadas por regras de associação. Assim, o objetivo deste trabalho é avaliar a utilização de regras de associação para apoiar a identificação de descritores para agrupamentos hierárquicos. Para isto, foi proposto o método SeCLAR (Selecting Candidate Labels using Association Rules), que explora o uso de regras de associação para a seleção de descritores para agrupamentos hierárquicos de documentos. Este método gera regras de associação baseadas em transações construídas à partir de cada documento da coleção, e utiliza a informação de relacionamento existente entre os grupos do agrupamento hierárquico para selecionar candidatos a descritores. Os resultados da avaliação experimental indicam que é possível obter uma melhora significativa com relação a precisão e a cobertura dos métodos tradicionais / One way to organize knowledge, that has received much attention in recent years, is to create a structural representation divided by hierarchically related topics. Once this structure is built, it is necessary to find labels for each of the obtained clusters, since most algorithms do not produce simple descriptions and the interpretation of these clusters is a difficult task for users. The related works consider each document as a bag-of-words and do not explore explicitly the relationship between the terms of the documents. However, these relationships can provide important information to the decision of the terms that must be chosen as descriptors of the nodes, and could be represented by rass. This works aims to evaluate the use of association rules to support the identification of labels for hierarchical document clusters. Thus, this paper presents the SeCLAR (Selecting Candidate Labels using Association Rules) method, which explores the use of association rules for the selection of good candidates for labels of hierarchical clusters of documents. This method generates association rules based on transactions built from each document in the collection, and uses the information relationship between the nodes of hierarchical clustering to select candidates for labels. The experimental results show that it is possible to obtain a significant improvement with respect to precision and recall of traditional methods
|
503 |
Extraction des relations de causalité dans les textes économiques par la méthode de l’exploration contextuelle / Extraction of causal relations in economic texts by the contextual exploration methodSingh, Dory 21 October 2017 (has links)
La thèse décrit un processus d’extraction d’informations causales dans les textes économiques qui, contrairement à l’économétrie, se fonde essentiellement sur des ressources linguistiques. En effet, l’économétrie appréhende la notion causale selon des modèles mathématiques et statistiques qui aujourd’hui sont sujets à controverses. Aussi, notre démarche se propose de compléter ou appuyer les modèles économétriques. Il s’agit d’annoter automatiquement des segments textuels selon la méthode de l’exploration contextuelle (EC). L’EC est une stratégie linguistique et computationnelle qui vise à extraire des connaissances selon un point de vue. Par conséquent, cette contribution adopte le point de vue discursif de la causalité où les catégories sont structurées dans une carte sémantique permettant l’élaboration des règles abductives implémentées dans les systèmes EXCOM2 et SEMANTAS. / The thesis describes a process of extraction of causal information, which contrary to econometric, is essentially based on linguistic knowledge. Econometric exploits mathematic or statistic models, which are now, subject of controversy. So, our approach intends to complete or to support the econometric models. It deals with to annotate automatically textual segments according to Contextual Exploration (CE) method. The CE is a linguistic and computational strategy aimed at extracting knowledge according to points of view. Therefore, this contribution adopts the discursive point of view of causality where the categories are structured in a semantic map. These categories allow to elaborate abductive rules implemented in the systems EXCOM2 and SEMANTAS.
|
504 |
文字探勘在學生評鑑教師教學之應用研究 / A Study of Students’ Evaluation on Teacher’s Teaching with Text Mining彭英錡, Peng, Ying Chi Unknown Date (has links)
本研究旨在瞭解探討北部某C大學實施學生評鑑教師教學之現況,並探討大學生回答開放性問題對該課程的優點與建議,進行文字探勘分析。
本研究利用問卷調查,在期末課程結束前,利用上網方式,對該課程進行填答。問卷所得資料進行敘述統計、因素分析、信度分析、獨立樣本t檢定、單因子變異數分析、皮爾森相關、多元迴歸與R軟體進行詞彙權重、文字雲、主題模型和群集分析。本研究結論如下:
一、學生評鑑教師教學現況以教學態度感受程度最高。
二、問卷各題項以「教師教學態度認真負責,且授足所需授課之時數」平均分數最高。
三、回饋性建議肯定「教學目標明確」最高,最需改善「彈性調整教學內容」。
四、學生評鑑教師教學因學生年級和課程類別不同而有顯著差異。
五、學生評鑑教師教學成效與學習成績呈低相關,以「教學評量」有預測力。
六、重要詞彙與文字雲發現「教學」、「內容」、「喜歡」及「同學」共同詞彙。
七、各學院主題模型命名,主要有觀察,考試與教學內容。
八、各學院集群分析結果,學生重視教學內容、學習過程與收穫及考試。
根據上述結果提出建議,以供教育行政主管機關、教師及未來研究者之參考。 / The purpose of this study was to explore the current situation of t in the C university of North, and finding the strength and suggestion of the class to opening question used text mining.
Before the class will be over , a questionnaire survey, using the internet, was used to gather personal information and the measurement applied in this research. The questionnaire is analyized by descriptive statistics analysis, independent t test, one-way ANOVA, Pearson correlation analysis, multiple regression, vocabulary weight, word cloud, topic model, and cluster analysis in R software. Conclusions obtained in this study are as in the followings:
1. The situation of student ratings of instruction scored over average on the effectiveness of teaching, with “teaching atttitude” the highest.
2.. The highest average scores of the items in the questionnaire were "serious and responsible teachers' teaching attitude and the number of hours required for teaching grants."
3. The feedback of suggestions is “The current of teaching objectives” and need to improve the “filxible adjustment of teaching content”.
4. The student ratings of instruction were vary significant in terms of student grade and course type.
5. Student ratings of instruction effectiveness and academic performance is low correlation, with "Teaching evaluation" predictive.
6. The findings on the important phrases and word clouds were “Teaching”, “Content”, “Likes”, and “Classmates”.
7. The naming of the theme model in each college is “Observation”, “Examination”, and “Teaching content”.
8. The results of cluster analysis each college were focused on “Teaching content”, “Learning process and gain”, and “Examination”.
Based on the findings above, suggestions and recommendation were provided as a reference for educational administrators, and teachers, and as a guide for future research.
|
505 |
財報文字分析之句子風險程度偵測研究 / Risk-related Sentence Detection in Financial Reports柳育彣, Liu, Yu-Wen Unknown Date (has links)
本論文的目標是利用文本情緒分析技巧,針對美國上市公司的財務報表進行以句子為單位的風險評估。過去的財報文本分析研究裡,大多關注於詞彙層面的風險偵測。然而財務文本中大多數的財務詞彙與前後文具有高度的語意相關性,僅靠閱讀單一詞彙可能無法完全理解其隱含的財務訊息。本文將研究層次由詞彙拉升至句子,根據基於嵌入概念的~fastText~與~Siamese CBOW~兩種句子向量表示法學習模型,利用基於嵌入概念模型中,使用目標詞與前後詞彙關聯性表示目標詞語意的特性,萃取出財報句子裡更深層的財務意涵,並學習出更適合用於財務文本分析的句向量表示法。實驗驗證部分,我們利用~10-K~財報資料與本文提出的財務標記資料集進行財務風險分類器學習,並以傳統詞袋模型(Bag-of-Word)作為基準,利用精確度(Accuracy)與準確度(Precision)等評估標準進行比較。結果證實基於嵌入概念模型的表示法在財務風險評估上比傳統詞袋模型有著更準確的預測表現。由於近年大數據時代的來臨,網路中的資訊量大幅成長,依賴少量人力在短期間內分析海量的財務資訊變得更加困難。因此如何協助專業人員進行有效率的財務判斷與決策,已成為一項重要的議題。為此,本文同時提出一個以句子為分析單位的財報風險語句偵測系統~RiskFinder~,依照~fastText~與~Siamese CBOW~兩種模型,經由~10-K~財務報表與人工標記資料集學習出適當的風險語句分類器後,對~1996~至~2013~年的美國上市公司財務報表進行財報句子的自動風險預測,讓財務專業人士能透過系統的協助,有效率地由大量財務文本中獲得有意義的財務資訊。此外,系統會依照公司的財報發布日期動態呈現股票交易資訊與後設資料,以利使用者依股價的時間走勢比較財務文字型與數值型資料的關係。 / The main purpose of this paper is to evaluate the risk of financial report of listed companies in sentence-level. Most of past sentiment analysis studies focused on word-level risk detection. However, most financial keywords are highly context-sensitive, which may likely yield biased results. Therefore, to advance the understanding of financial textual information, this thesis broadens the analysis from word-level to sentence level. We use two sentence-level models, fastText and Siamese-CBOW, to learn sentence embedding and attempt to facilitate the financial risk detection. In our experiment, we use the 10-K corpus and a financial sentiment dataset which were labeled by financial professionals to train our financial risk classifier. Moreover, we adopt the Bag-of-Word model as a baseline and use accuracy, precision, recall and F1-score to evaluate the performance of financial risk prediction. The experimental results show that the embedding models could lead better performance than the Bag-of-word model. In addition, this paper proposes a web-based financial risk detection system which is constructed based on fastText and Siamese CBOW model called RiskFinder. There are total 40,708 financial reports inside the system and each risk-related sentence is highlighted based on different sentence embedding models. Besides, our system also provides metadata and a visualization of financial time-series data for the corresponding company according to release day of financial report. This system considerably facilitates case studies in the field of finance and can be of great help in capturing valuable insight within large amounts of textual information.
|
506 |
Statistické metody ve stylometrii / Statistical methods in stylometryDupal, Pavel January 2017 (has links)
The aim of this thesis is to provide an overview of some of the commonly used methods in the area of authorship attribution (stylometry). The text begins with a recap of history from the end of the 19th century to present time and the required terminology from the field of text mining is presented and explained. What follows is a list of selected methods from the field of multidimensional statistics (principal components analysis, cluster analysis) and machine learning (Support Vector Machines, Naive Bayes) and their application as pertains to stylometrical problems, including several methods created specifically for use in this field (bootstrap consensus tree, contrast analysis). Finally these same methods are applied to a practical problem of authorship verification based on a corpus bulit from the works of four internet writers.
|
507 |
An ontological approach for monitoring and surveillance systems in unregulated marketsYounis Zaki, Mohamed January 2013 (has links)
Ontologies are a key factor of Information management as they provide a common representation to any domain. Historically, finance domain has suffered from a lack of efficiency in managing vast amounts of financial data, a lack of communication and knowledge sharing between analysts. Particularly, with the growth of fraud in financial markets, cases are challenging, complex, and involve a huge volume of information. Gathering facts and evidence is often complex. Thus, the impetus for building a financial fraud ontology arises from the continuous improvement and development of financial market surveillance systems with high analytical capabilities to capture frauds which is essential to guarantee and preserve an efficient market.This thesis proposes an ontology-based approach for financial market surveillance systems. The proposed ontology acts as a semantic representation of mining concepts from unstructured resources and other internet sources (corpus). The ontology contains a comprehensive concept system that can act as a semantically rich knowledge base for a market monitoring system. This could help fraud analysts to understand financial fraud practices, assist open investigation by managing relevant facts gathered for case investigations, providing early detection techniques of fraudulent activities, developing prevention practices, and sharing manipulation patterns from prosecuted cases with investigators and relevant users. The usefulness of the ontology will be evaluated through three case studies, which not only help to explain how manipulation in markets works, but will also demonstrate how the ontology can be used as a framework for the extraction process and capturing information related to financial fraud, to improve the performance of surveillance systems in fraud monitoring. Given that most manipulation cases occur in the unregulated markets, this thesis uses a sample of fraud cases from the unregulated markets. On the empirical side, the thesis presents examples of novel applications of text-mining tools and data-processing components, developing off-line surveillance systems that are fully working prototypes which could train the ontology in the most recent manipulation techniques.
|
508 |
Relevance is in the Eye of the Beholder: Design Principles for the Extraction of Context-Aware InformationCastellanos, Arturo 07 July 2016 (has links)
Since the1970s many approaches of representing domains have been suggested. Each approach maintains the assumption that the information about the objects represented in the Information System (IS) is specified and verified by domain experts and potential users. Yet, as more IS are developed to support a larger diversity of users such as customers, suppliers, and members of the general public (such as many multi-user online systems), analysts can no longer rely on a stable single group of people for complete specification of domains –to the extent that prior research has questioned the efficacy of conceptual modeling in these heterogeneous settings. We formulated principles for identifying basic classes in a domain. These classes can guide conceptual modeling, database design, and user interface development in a wide variety of traditional and emergent domains. Moreover, we used a case study of a large foster organization to study how unstructured data entry practices result in differences in how information is collected across organizational units. We used institutional theory to show how institutional elements enacted by individuals can generate new practices that can be adopted over time as best practices. We analyzed free-text notes to prioritize potential cases of psychotropic drug use—our tactical need. We showed that too much flexibility in how data can be entered into the system, results in different styles, which tend to be homogenous across organizational units but not across organizational units. Theories in Psychology help explain the implications of the level of specificity and the inferential utility of the text encoded in the unstructured note.
|
509 |
Locating Information in Heterogeneous log files / Localisation d'information dans les fichiers logs hétérogènesSaneifar, Hassan 02 December 2011 (has links)
Cette thèse s'inscrit dans les domaines des systèmes Question Réponse en domaine restreint, la recherche d'information ainsi que TALN. Les systèmes de Question Réponse (QR) ont pour objectif de retrouver un fragment pertinent d'un document qui pourrait être considéré comme la meilleure réponse concise possible à une question de l'utilisateur. Le but de cette thèse est de proposer une approche de localisation de réponses dans des masses de données complexes et évolutives décrites ci-dessous.. De nos jours, dans de nombreux domaines d'application, les systèmes informatiques sont instrumentés pour produire des rapports d'événements survenant, dans un format de données textuelles généralement appelé fichiers log. Les fichiers logs représentent la source principale d'informations sur l'état des systèmes, des produits, ou encore les causes de problèmes qui peuvent survenir. Les fichiers logs peuvent également inclure des données sur les paramètres critiques, les sorties de capteurs, ou une combinaison de ceux-ci. Ces fichiers sont également utilisés lors des différentes étapes du développement de logiciels, principalement dans l'objectif de débogage et le profilage. Les fichiers logs sont devenus un élément standard et essentiel de toutes les grandes applications. Bien que le processus de génération de fichiers logs est assez simple et direct, l'analyse de fichiers logs pourrait être une tâche difficile qui exige d'énormes ressources de calcul, de temps et de procédures sophistiquées. En effet, il existe de nombreux types de fichiers logs générés dans certains domaines d'application qui ne sont pas systématiquement exploités d'une manière efficace en raison de leurs caractéristiques particulières. Dans cette thèse, nous nous concentrerons sur un type des fichiers logs générés par des systèmes EDA (Electronic Design Automation). Ces fichiers logs contiennent des informations sur la configuration et la conception des Circuits Intégrés (CI) ainsi que les tests de vérification effectués sur eux. Ces informations, très peu exploitées actuellement, sont particulièrement attractives et intéressantes pour la gestion de conception, la surveillance et surtout la vérification de la qualité de conception. Cependant, la complexité de ces données textuelles complexes, c.-à-d. des fichiers logs générés par des outils de conception de CI, rend difficile l'exploitation de ces connaissances. Plusieurs aspects de ces fichiers logs ont été moins soulignés dans les méthodes de TALN et Extraction d'Information (EI). Le grand volume de données et leurs caractéristiques particulières limitent la pertinence des méthodes classiques de TALN et EI. Dans ce projet de recherche nous cherchons à proposer une approche qui permet de répondre à répondre automatiquement aux questionnaires de vérification de qualité des CI selon les informations se trouvant dans les fichiers logs générés par les outils de conception. Au sein de cette thèse, nous étudions principalement "comment les spécificités de fichiers logs peuvent influencer l'extraction de l'information et les méthodes de TALN?". Le problème est accentué lorsque nous devons également prendre leurs structures évolutives et leur vocabulaire spécifique en compte. Dans ce contexte, un défi clé est de fournir des approches qui prennent les spécificités des fichiers logs en compte tout en considérant les enjeux qui sont spécifiques aux systèmes QR dans des domaines restreints. Ainsi, les contributions de cette thèse consistent brièvement en :〉Proposer une méthode d'identification et de reconnaissance automatique des unités logiques dans les fichiers logs afin d'effectuer une segmentation textuelle selon la structure des fichiers. Au sein de cette approche, nous proposons un type original de descripteur qui permet de modéliser la structure textuelle et le layout des documents textuels.〉Proposer une approche de la localisation de réponse (recherche de passages) dans les fichiers logs. Afin d'améliorer la performance de recherche de passage ainsi que surmonter certains problématiques dûs aux caractéristiques des fichiers logs, nous proposons une approches d'enrichissement de requêtes. Cette approches, fondée sur la notion de relevance feedback, consiste en un processus d'apprentissage et une méthode de pondération des mots pertinents du contexte qui sont susceptibles d'exister dans les passage adaptés. Cela dit, nous proposons également une nouvelle fonction originale de pondération (scoring), appelée TRQ (Term Relatedness to Query) qui a pour objectif de donner un poids élevé aux termes qui ont une probabilité importante de faire partie des passages pertinents. Cette approche est également adaptée et évaluée dans les domaines généraux.〉Etudier l'utilisation des connaissances morpho-syntaxiques au sein de nos approches. A cette fin, nous nous sommes intéressés à l'extraction de la terminologie dans les fichiers logs. Ainsi, nous proposons la méthode Exterlog, adaptée aux spécificités des logs, qui permet d'extraire des termes selon des patrons syntaxiques. Afin d'évaluer les termes extraits et en choisir les plus pertinents, nous proposons un protocole de validation automatique des termes qui utilise une mesure fondée sur le Web associée à des mesures statistiques, tout en prenant en compte le contexte spécialisé des logs. / In this thesis, we present contributions to the challenging issues which are encounteredin question answering and locating information in complex textual data, like log files. Question answering systems (QAS) aim to find a relevant fragment of a document which could be regarded as the best possible concise answer for a question given by a user. In this work, we are looking to propose a complete solution to locate information in a special kind of textual data, i.e., log files generated by EDA design tools.Nowadays, in many application areas, modern computing systems are instrumented to generate huge reports about occurring events in the format of log files. Log files are generated in every computing field to report the status of systems, products, or even causes of problems that can occur. Log files may also include data about critical parameters, sensor outputs, or a combination of those. Analyzing log files, as an attractive approach for automatic system management and monitoring, has been enjoying a growing amount of attention [Li et al., 2005]. Although the process of generating log files is quite simple and straightforward, log file analysis could be a tremendous task that requires enormous computational resources, long time and sophisticated procedures [Valdman, 2004]. Indeed, there are many kinds of log files generated in some application domains which are not systematically exploited in an efficient way because of their special characteristics. In this thesis, we are mainly interested in log files generated by Electronic Design Automation (EDA) systems. Electronic design automation is a category of software tools for designing electronic systems such as printed circuit boards and Integrated Circuits (IC). In this domain, to ensure the design quality, there are some quality check rules which should be verified. Verification of these rules is principally performed by analyzing the generated log files. In the case of large designs that the design tools may generate megabytes or gigabytes of log files each day, the problem is to wade through all of this data to locate the critical information we need to verify the quality check rules. These log files typically include a substantial amount of data. Accordingly, manually locating information is a tedious and cumbersome process. Furthermore, the particular characteristics of log files, specially those generated by EDA design tools, rise significant challenges in retrieval of information from the log files. The specific features of log files limit the usefulness of manual analysis techniques and static methods. Automated analysis of such logs is complex due to their heterogeneous and evolving structures and the large non-fixed vocabulary.In this thesis, by each contribution, we answer to questions raised in this work due to the data specificities or domain requirements. We investigate throughout this work the main concern "how the specificities of log files can influence the information extraction and natural language processing methods?". In this context, a key challenge is to provide approaches that take the log file specificities into account while considering the issues which are specific to QA in restricted domains. We present different contributions as below:> Proposing a novel method to recognize and identify the logical units in the log files to perform a segmentation according to their structure. We thus propose a method to characterize complex logicalunits found in log files according to their syntactic characteristics. Within this approach, we propose an original type of descriptor to model the textual structure and layout of text documents.> Proposing an approach to locate the requested information in the log files based on passage retrieval. To improve the performance of passage retrieval, we propose a novel query expansion approach to adapt an initial query to all types of corresponding log files and overcome the difficulties like mismatch vocabularies. Our query expansion approach relies on two relevance feedback steps. In the first one, we determine the explicit relevance feedback by identifying the context of questions. The second phase consists of a novel type of pseudo relevance feedback. Our method is based on a new term weighting function, called TRQ (Term Relatedness to Query), introduced in this work, which gives a score to terms of corpus according to their relatedness to the query. We also investigate how to apply our query expansion approach to documents from general domains.> Studying the use of morpho-syntactic knowledge in our approaches. For this purpose, we are interested in the extraction of terminology in the log files. Thus, we here introduce our approach, named Exterlog (EXtraction of TERminology from LOGs), to extract the terminology of log files. To evaluate the extracted terms and choose the most relevant ones, we propose a candidate term evaluation method using a measure, based on the Web and combined with statistical measures, taking into account the context of log files.
|
510 |
Contribuições para a construção de taxonomias de tópicos em domínios restritos utilizando aprendizado estatístico / Contributions to topic taxonomy construction in a specific domain using statistical learningMaria Fernanda Moura 26 October 2009 (has links)
A mineração de textos vem de encontro à realidade atual de se compreender e utilizar grandes massas de dados textuais. Uma forma de auxiliar a compreensão dessas coleções de textos é construir taxonomias de tópicos a partir delas. As taxonomias de tópicos devem organizar esses documentos, preferencialmente em hierarquias, identificando os grupos obtidos por meio de descritores. Construir manual, automática ou semi-automaticamente taxonomias de tópicos de qualidade é uma tarefa nada trivial. Assim, o objetivo deste trabalho é construir taxonomias de tópicos em domínios de conhecimento restrito, por meio de mineração de textos, a fim de auxiliar o especialista no domínio a compreender e organizar os textos. O domínio de conhecimento é restrito para que se possa trabalhar apenas com métodos de aprendizado estatístico não supervisionado sobre representações bag of words dos textos. Essas representações independem do contexto das palavras nos textos e, conseqüentemente, nos domínios. Assim, ao se restringir o domínio espera-se diminuir erros de interpretação dos resultados. A metodologia proposta para a construção de taxonomias de tópicos é uma instanciação do processo de mineração de textos. A cada etapa do processo propôem-se soluções adaptadas às necessidades específicas de construçao de taxonomias de tópicos, dentre as quais algumas contribuições inovadoras ao estado da arte. Particularmente, este trabalho contribui em três frentes no estado da arte: seleção de atributos n-gramas em tarefas de mineração de textos, dois modelos para rotulação de agrupamento hierárquico de documentos e modelo de validação do processo de rotulação de agrupamento hierárquico de documentos. Além dessas contribuições, ocorrem outras em adaptações e metodologias de escolha de processos de seleção de atributos, forma de geração de atributos, visualização das taxonomias e redução das taxonomias obtidas. Finalmente, a metodologia desenvolvida foi aplicada a problemas reais, tendo obtido bons resultados. / Text mining provides powerful techniques to help on the current needs of understanding and organizing huge amounts of textual documents. One way to do this is to build topic taxonomies from these documents. Topic taxonomies can be used to organize the documents, preferably in hierarchies, and to identify groups of related documents and their descriptors. Constructing high quality topic taxonomies, either manually, automatically or semi-automatically, is not a trivial task. This work aims to use text mining techniques to build topic taxonomies for well defined knowledge domains, helping the domain expert to understand and organize document collections. By using well defined knowledge domains, only unsupervised statistical methods are used, with a bag of word representation for textual documents. These representations are independent of the context of the words in the documents as well as in the domain. Thus, if the domain is well defined, a decrease of mistakes of the result interpretation is expected. The proposed methodology for topic taxonomy construction is an instantiation of the text mining process. At each step of the process, some solutions are proposed and adapted to the specific needs of topic taxonomy construction. Among these solutions there are some innovative contributions to the state of the art. Particularly, this work contributes to the state of the art in three different ways: the selection of n-grams attributes in text mining tasks, two models for hierarchical document cluster labeling and a validation model of the hierarchical document cluster labeling. Additional contributions include adaptations and methodologies of attribute selection process choices, attribute representation, taxonomy visualization and obtained taxonomy reduction. Finally, the proposed methodology was also validated by successfully applying it to real problems
|
Page generated in 0.0958 seconds