Spelling suggestions: "subject:"text minining"" "subject:"text chanining""
451 |
Análise de algoritmos de agrupamento para base de dados textuais / Analysis of the clustering algorithms for the databasesAlmeida, Luiz Gonzaga Paula de 31 August 2008 (has links)
Made available in DSpace on 2015-03-04T18:50:55Z (GMT). No. of bitstreams: 1
DissertacaoLuizGonzaga.pdf: 3514446 bytes, checksum: 517d9c7b241b2bd9c799c807d6eac037 (MD5)
Previous issue date: 2008-08-31 / The increasing amount of digitally stored texts makes necessary the development of computational tools to allow the access of information and knowledge in an efficient and efficacious manner. This problem is extremely relevant in biomedicine research, since most of the generated knowledge is translated into scientific articles and it is necessary to have the most easy and fast access.
The research field known as Text Mining deals with the problem of identifying new information and knowledge in text databases. One of its tasks is to find in databases groups of texts that are correlated, an issue known as text clustering. To allow clustering, text databases must be transformed into the commonly used Vector Space Model, in which texts are represented by vectors composed by the frequency of occurrence of words and terms present in the databases. The set of vectors composing a matrix named document-term is usually sparse with high dimension. Normally, to attenuate the problems caused by these features, a subset of terms is selected, thus giving rise a new document-term matrix with reduced dimensions, which is then used by clustering algorithms.
This work presents two algorithms for terms selection and the evaluation of clustering algorithms: k-means, spectral and graph portioning, in five pre-classified databases. The databases were pre-processed by previously described methods. The results indicate that the term selection algorithms implemented increased the performance of the clustering algorithms used and that the k-means and spectral algorithms outperformed the graph portioning. / O volume crescente de textos digitalmente armazenados torna necessária a construção de ferramentas computacionais que permitam a organização e o acesso eficaz e eficiente à informação e ao conhecimento nele contidos. No campo do conhecimento da biomedicina este problema se torna extremamente relevante, pois a maior parte do conhecimento gerado é formalizada através de artigos científicos e é necessário que o acesso a estes seja o mais fácil e rápido possível.
A área de pesquisa conhecida como Mineração de Textos (do inglês Text Mining), se propõe a enfrentar este problema ao procurar identificar novas informações e conhecimentos até então desconhecidos, em bases de dados textuais. Uma de suas tarefas é a descoberta de grupos de textos correlatos em base de dados textuais e esse problema é conhecido como agrupamento de textos (do inglês Text Clustering). Para este fim, a representação das bases de dados textuais comumente utilizada no agrupamento de textos é o Modelo Espaço-vetorial, no qual cada texto é representado por um vetor de características, que são as freqüências das palavras ou termos que nele ocorrem. O conjunto de vetores forma uma matriz denominada de documento-termo, que é esparsa e de alta dimensionalidade. Para atenuar os problemas decorrentes dessas características, normalmente é selecionado um subconjunto de termos, construindo-se assim uma nova matriz documento-termo com um número reduzido de dimensões que é então utilizada nos algoritmos de agrupamento.
Este trabalho se desdobra em: i) introdução e implementação de dois algoritmos para seleção de termos e ii) avaliação dos algoritmos k-means, espectral e de particionamento de grafos, em cinco base de dados de textos previamente classificadas. As bases de dados são pré-processadas através de métodos descritos na literatura, produzindo-se as matrizes documento-termo. Os resultados indicam que os algoritmos de seleção propostos, para a redução das matrizes documento-termo, melhoram o desempenho dos algoritmos de agrupamento avaliados. Os algoritmos k-means e espectral têm um desempenho superior ao algoritmos de particionamento de grafos no agrupamento de bases de dados textuais, com ou sem a seleção de características.
|
452 |
O fenômeno blockchain na perspectiva da estratégia tecnológica: uma análise de conteúdo por meio da descoberta de conhecimento em textoFernandes, Marcelo Vighi 27 August 2018 (has links)
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-11-06T11:47:27Z
No. of bitstreams: 1
Marcelo Vighi Fernandes.pdf: 3509868 bytes, checksum: d6db1f1e680ba92bb965b2d327c5de04 (MD5) / Made available in DSpace on 2018-11-06T11:47:28Z (GMT). No. of bitstreams: 1
Marcelo Vighi Fernandes.pdf: 3509868 bytes, checksum: d6db1f1e680ba92bb965b2d327c5de04 (MD5)
Previous issue date: 2018-08-27 / Nenhuma / A revolução das Tecnologias de Informação e Comunicação (TIC) fez as empresas perceberem a importância da estratégia tecnológica para a sua sobrevivência. Blockchain é uma tecnologia descentralizada de gerenciamento de transações e dados desenvolvida, primeiramente, para a moeda digital bitcoin. O interesse na tecnologia blockchain tem aumentado desde que o termo foi cunhado. Esse interesse fez com que este fenômeno se tornasse, atualmente, um dos principais tópicos de pesquisa e publicação na Web. O objetivo principal deste trabalho é entender de que forma o fenômeno blockchain está impactando na estratégia tecnológica. Para tanto, foi realizado um estudo exploratório utilizando o processo de Descoberta de Conhecimento em Texto (DCT), com a utilização de ferramentas de mineração de textos, de forma a coletar e analisar o conteúdo de um conjunto de notícias publicadas na Web sobre a tecnologia blockchain. Foram extraídas 2.605 notícias da Web sobre blockchain, publicadas entre os anos 2015 e 2017, no idioma inglês. Como resultado do estudo, foram geradas 6 proposições, mostrando que este fenômeno está impactando a estratégia tecnológica da indústria financeira direcionando o foco deste setor para implementação de soluções em arquiteturas descentralizadas. Também foi verificado que o foco estratégico tecnológico das empresas impulsionou o desenvolvimento das tecnologias de blockchain privadas. Identificou-se, também, os benefícios trazidos por esta tecnologia para sistemas de pagamentos entre países, diminuindo os intermediários e melhorando os processos. Ainda, foi possível mapear que esta tecnologia tem potencial para afetar as transações através de uma plataforma eletrônica comum. Em relação ao grau de maturidade desta tecnologia, foi realizada uma discussão dos achados das análises das notícias com a teoria da difusão da inovação e concluiu-se que esta tecnologia está no limiar entre as categorias de Innovators e Early Adopters. O mapa produzido por esta pesquisa ajudará empresas e profissionais na identificação de oportunidades de direcionamento das suas estratégias tecnológicas para a tecnologia de blockchain. / The Information and Communication Technologies (ICT) revolution made companies realize the importance of technology strategy for their survival. Blockchain is a decentralized transaction and data management technology first developed for the bitcoin digital currency. The interest in blockchain technology has increased since the idea was coined. This interest has made this phenomenon one of the main topics of research and publication on the Web. The main objective of this paper is to understand how the blockchain phenomenon is impacting technology strategy. To do so, an exploratory study was conducted using the Knowledge Discovery in Text (KDT) process, with the use of text mining tools, to collect and analyze the contents of a set of news published on the Web about blockchain technology. At total, 2605 blockchain web news were extracted, all news were published between the years of 2015 and 2017, in the English language. As a result of the study, 6 propositions were generated, in which the results showed that this phenomenon is impacting the technology strategy of the financial industry, directing the focus of this sector to the implementation of solutions using decentralized architectures. It was also verified that the companies’ strategic technological focus boosted the development of private blockchain technologies. Additionally, was identified the benefits brought by this technology to cross-border payment systems, reducing intermediaries and improving processes. Also, it was possible to map out that this technology has the potential to affect the transactions through a common electronic platform. In relation to the degree of maturity of this technology, a discussion of the findings with the theory of the diffusion of innovation was made and it is concluded that this technology is in the threshold between the categories of Innovators and Early Adopters. The map produced by this research will help companies and professionals in identifying opportunities to target their technology strategies to blockchain technology.
|
453 |
Towards systems pharmacology models of druggable targets and disease mechanismsKnight-Schrijver, Vincent January 2019 (has links)
The development of essential medicines is being slowed by a lack of efficiency in drug development as ninety per cent of drugs fail at some stage during clinical evaluation. This attrition in drug development is seen not because of a reduction in pharmaceutical research expenditure nor is it caused by a declining understanding of biology, if anything, these are both increasing. Instead, drugs are failing because we are unable to effectively predict how they will work before they are given to patients. This is due to limitations of the current methods used to evaluate a drug's toxicity and efficacy prior to its development. Quite simply, these methods do not account for the full complexity of biology in humans. Systems pharmacology models are a likely candidate for increasing the efficiency of drug discovery as they seek to comprehensively model the fundamental biology of disease mechanisms in a quantit- ative manner. They are computational models, designed and hailed as a strategy for making well-informed and cost effective decisions on drug viability and target druggability and therefore attempt to reduce this time-consuming and costly attrition. Using text mining and text classification I present a growing landscape of systems pharmacology models in literature growing from humble roots because of step-wise increases in our understanding of biology. Furthermore, I develop a case for the capability of systems pharmacology models in making predictions by constructing a model of interleukin-6 signalling for rheumatoid arthritis. This model shows that druggable target selection is not necessarily an intuitive task as it results in an emergent but unanswered hypothesis for safety concerns in a monoclonal antibody. Finally, I show that predictive classification models can also be used to explore gene expression data in a novel work flow by attempting to predict patient response classes to an influenza vaccine.
|
454 |
Descoberta de conhecimento aplicado à base de dados textual de saúdeBarbosa, Alexandre Nunes 26 March 2012 (has links)
Submitted by William Justo Figueiro (williamjf) on 2015-07-18T12:21:33Z
No. of bitstreams: 1
42c.pdf: 1016491 bytes, checksum: 407619e0114b592531ee5a68ca0fd0f9 (MD5) / Made available in DSpace on 2015-07-18T12:21:33Z (GMT). No. of bitstreams: 1
42c.pdf: 1016491 bytes, checksum: 407619e0114b592531ee5a68ca0fd0f9 (MD5)
Previous issue date: 2012 / UNISINOS - Universidade do Vale do Rio dos Sinos / Este trabalho propõe um processo de investigação do conteúdo de uma base de dados, composta por dados descritivos e pré-estruturados do domínio da saúde, mais especificamente da área da Reumatologia. Para a investigação da base de dados, foram compostos 3 conjuntos de interesse. O primeiro composto por uma classe com conteúdo descritivo relativo somente a área da Reumatologia em geral, e outra cujo seu conteúdo pertence a outras áreas da medicina. O segundo e o terceiro conjunto, foram constituídos após análises estatísticas na base de dados. Um formado pelo conteúdo descritivo associado as 5 maiores frequências de códigos CID, e outro formado por conteúdo descritivo associado as 3 maiores frequências de códigos CID relacionados exclusivamente à área da Reumatologia. Estes conjuntos foram pré-processados com técnicas clássicas de Pré-processamento tais como remoção de Stopwords e Stemmer. Com o objetivo de extrair padrões que através de sua interpretação resultem na produção de conhecimento, foram aplicados aos conjuntos de interesse técnicas de classificação e associação, visando à relação entre o conteúdo textual que descreve sintomas de doenças com o conteúdo pré-estruturado, que define o diagnóstico destas doenças. A execução destas técnicas foi realizada através da aplicação do algoritmo de classificação Support Vector Machines e do algoritmo para extração de Regras de Associação Apriori. Para o desenvolvimento deste processo foi pesquisado referencial teórico relativo à mineração de dados, bem como levantamento e estudo de trabalhos científicos produzidos no domínio da mineração textual e relacionados a Prontuário Médico Eletrônico, focando o conteúdo das bases de dados utilizadas, técnicas de pré-processamento e mineração empregados na literatura, bem como os resultados relatados. A técnica de classificação empregada neste trabalho obteve resultados acima de 80% de Acurácia, demonstrando capacidade do algoritmo de rotular dados da saúde relacionados ao domínio de interesse corretamente. Também foram descobertas associações entre conteúdo textual e conteúdo pré-estruturado, que segundo a análise de especialistas, podem conduzir a questionamentos quanto à utilização de determinados CIDs no local de origem dos dados. / This study suggests a process of investigation of the content of a database, comprising descriptive and pre-structured data related to the health domain, more particularly in the area of Rheumatology. For the investigation of the database, three sets of interest were composed. The first one formed by a class of descriptive content related only to the area of Rheumatology in general, and another whose content belongs to other areas of medicine. The second and third sets were constituted after statistical analysis in the database. One of them formed by the descriptive content associated to the five highest frequencies of ICD codes, and another formed by descriptive content associated with the three highest frequencies of ICD codes related exclusively to the area of Rheumatology. These sets were pre-processed with classic Pre-processing techniques such as Stopword Removal and Stemming. In order to extract patterns that, through their interpretation, result in knowledge production, association and classification techniques were applied to the sets of interest, aiming at to relate the textual content that describes symptoms of diseases with pre-structured content, which defines the diagnosis of these diseases. The implementation of these techniques was carried out by applying the classification algorithm Support Vector Machines and the Association Rules Apriori Algorithm. For the development of this process, theoretical references concerning data mining were researched, including selection and review of scientific publications produced on text mining and related to Electronic Medical Record, focusing on the content of the databases used, techniques for pre-processing and mining used in the literature, as well as the reported results. The classification technique used in this study reached over 80% accurate results, demonstrating the capacity the algorithm has to correctly label health data related to the field of interest. Associations between text content and pre-structured content were also found, which, according to expert analysis, may be questioned as for the use of certain ICDs in the place of origin of the data.
|
455 |
Contribuições para a construção de taxonomias de tópicos em domínios restritos utilizando aprendizado estatístico / Contributions to topic taxonomy construction in a specific domain using statistical learningMoura, Maria Fernanda 26 October 2009 (has links)
A mineração de textos vem de encontro à realidade atual de se compreender e utilizar grandes massas de dados textuais. Uma forma de auxiliar a compreensão dessas coleções de textos é construir taxonomias de tópicos a partir delas. As taxonomias de tópicos devem organizar esses documentos, preferencialmente em hierarquias, identificando os grupos obtidos por meio de descritores. Construir manual, automática ou semi-automaticamente taxonomias de tópicos de qualidade é uma tarefa nada trivial. Assim, o objetivo deste trabalho é construir taxonomias de tópicos em domínios de conhecimento restrito, por meio de mineração de textos, a fim de auxiliar o especialista no domínio a compreender e organizar os textos. O domínio de conhecimento é restrito para que se possa trabalhar apenas com métodos de aprendizado estatístico não supervisionado sobre representações bag of words dos textos. Essas representações independem do contexto das palavras nos textos e, conseqüentemente, nos domínios. Assim, ao se restringir o domínio espera-se diminuir erros de interpretação dos resultados. A metodologia proposta para a construção de taxonomias de tópicos é uma instanciação do processo de mineração de textos. A cada etapa do processo propôem-se soluções adaptadas às necessidades específicas de construçao de taxonomias de tópicos, dentre as quais algumas contribuições inovadoras ao estado da arte. Particularmente, este trabalho contribui em três frentes no estado da arte: seleção de atributos n-gramas em tarefas de mineração de textos, dois modelos para rotulação de agrupamento hierárquico de documentos e modelo de validação do processo de rotulação de agrupamento hierárquico de documentos. Além dessas contribuições, ocorrem outras em adaptações e metodologias de escolha de processos de seleção de atributos, forma de geração de atributos, visualização das taxonomias e redução das taxonomias obtidas. Finalmente, a metodologia desenvolvida foi aplicada a problemas reais, tendo obtido bons resultados. / Text mining provides powerful techniques to help on the current needs of understanding and organizing huge amounts of textual documents. One way to do this is to build topic taxonomies from these documents. Topic taxonomies can be used to organize the documents, preferably in hierarchies, and to identify groups of related documents and their descriptors. Constructing high quality topic taxonomies, either manually, automatically or semi-automatically, is not a trivial task. This work aims to use text mining techniques to build topic taxonomies for well defined knowledge domains, helping the domain expert to understand and organize document collections. By using well defined knowledge domains, only unsupervised statistical methods are used, with a bag of word representation for textual documents. These representations are independent of the context of the words in the documents as well as in the domain. Thus, if the domain is well defined, a decrease of mistakes of the result interpretation is expected. The proposed methodology for topic taxonomy construction is an instantiation of the text mining process. At each step of the process, some solutions are proposed and adapted to the specific needs of topic taxonomy construction. Among these solutions there are some innovative contributions to the state of the art. Particularly, this work contributes to the state of the art in three different ways: the selection of n-grams attributes in text mining tasks, two models for hierarchical document cluster labeling and a validation model of the hierarchical document cluster labeling. Additional contributions include adaptations and methodologies of attribute selection process choices, attribute representation, taxonomy visualization and obtained taxonomy reduction. Finally, the proposed methodology was also validated by successfully applying it to real problems
|
456 |
Investigating data quality in question and answer reportsMohamed Zaki Ali, Mona January 2016 (has links)
Data Quality (DQ) has been a long-standing concern for a number of stakeholders in a variety of domains. It has become a critically important factor for the effectiveness of organisations and individuals. Previous work on DQ methodologies have mainly focused on either the analysis of structured data or the business-process level rather than analysing the data itself. Question and Answer Reports (QAR) are gaining momentum as a way to collect responses that can be used by data analysts, for instance, in business, education or healthcare. Various stakeholders benefit from QAR such as data brokers and data providers, and in order to effectively analyse and identify the common DQ problems in these reports, the various stakeholders' perspectives should be taken into account which adds another complexity for the analysis. This thesis investigates DQ in QAR through an in-depth DQ analysis and provide solutions that can highlight potential sources and causes of problems that result in "low-quality" collected data. The thesis proposes a DQ methodology that is appropriate for the context of QAR. The methodology consists of three modules: question analysis, medium analysis and answer analysis. In addition, a Question Design Support (QuDeS) framework is introduced to operationalise the proposed methodology through the automatic identification of DQ problems. The framework includes three components: question domain-independent profiling, question domain-dependent profiling and answers profiling. The proposed framework has been instantiated to address one example of DQ issues, namely Multi-Focal Question (MFQ). We introduce MFQ as a question with multiple requirements; it asks for multiple answers. QuDeS-MFQ (the implemented instance of QuDeS framework) has implemented two components of QuDeS for MFQ identification, these are question domain-independent profiling and question domain-dependent profiling. The proposed methodology and the framework are designed, implemented and evaluated in the context of the Carbon Disclosure Project (CDP) case study. The experiments show that we can identify MFQs with 90% accuracy. This thesis also demonstrates the challenges including the lack of domain resources for domain knowledge representation, such as domain ontology, the complexity and variability of the structure of QAR, as well as the variability and ambiguity of terminology and language expressions and understanding stakeholders or users need.
|
457 |
Towards the French Biomedical Ontology Enrichment / Vers l'enrichissement d'ontologies biomédicales françaisesLossio-Ventura, Juan Antonio 09 November 2015 (has links)
En biomedicine, le domaine du « Big Data » (l'infobésité) pose le problème de l'analyse de gros volumes de données hétérogènes (i.e. vidéo, audio, texte, image). Les ontologies biomédicales, modèle conceptuel de la réalité, peuvent jouer un rôle important afin d'automatiser le traitement des données, les requêtes et la mise en correspondance des données hétérogènes. Il existe plusieurs ressources en anglais mais elles sont moins riches pour le français. Le manque d'outils et de services connexes pour les exploiter accentue ces lacunes. Dans un premier temps, les ontologies ont été construites manuellement. Au cours de ces dernières années, quelques méthodes semi-automatiques ont été proposées. Ces techniques semi-automatiques de construction/enrichissement d'ontologies sont principalement induites à partir de textes en utilisant des techniques du traitement du langage naturel (TALN). Les méthodes de TALN permettent de prendre en compte la complexité lexicale et sémantique des données biomédicales : (1) lexicale pour faire référence aux syntagmes biomédicaux complexes à considérer et (2) sémantique pour traiter l'induction du concept et du contexte de la terminologie. Dans cette thèse, afin de relever les défis mentionnés précédemment, nous proposons des méthodologies pour l'enrichissement/la construction d'ontologies biomédicales fondées sur deux principales contributions.La première contribution est liée à l'extraction automatique de termes biomédicaux spécialisés (complexité lexicale) à partir de corpus. De nouvelles mesures d'extraction et de classement de termes composés d'un ou plusieurs mots ont été proposées et évaluées. L'application BioTex implémente les mesures définies.La seconde contribution concerne l'extraction de concepts et le lien sémantique de la terminologie extraite (complexité sémantique). Ce travail vise à induire des concepts pour les nouveaux termes candidats et de déterminer leurs liens sémantiques, c'est-à-dire les positions les plus pertinentes au sein d'une ontologie biomédicale existante. Nous avons ainsi proposé une approche d'extraction de concepts qui intègre de nouveaux termes dans l'ontologie MeSH. Les évaluations, quantitatives et qualitatives, menées par des experts et non experts, sur des données réelles soulignent l'intérêt de ces contributions. / Big Data for biomedicine domain deals with a major issue, the analyze of large volume of heterogeneous data (e.g. video, audio, text, image). Ontology, conceptual models of the reality, can play a crucial role in biomedical to automate data processing, querying, and matching heterogeneous data. Various English resources exist but there are considerably less available in French and there is a strong lack of related tools and services to exploit them. Initially, ontologies were built manually. In recent years, few semi-automatic methodologies have been proposed. The semi-automatic construction/enrichment of ontologies are mostly induced from texts by using natural language processing (NLP) techniques. NLP methods have to take into account lexical and semantic complexity of biomedical data : (1) lexical refers to complex phrases to take into account, (2) semantic refers to sense and context induction of the terminology.In this thesis, we propose methodologies for enrichment/construction of biomedical ontologies based on two main contributions, in order to tackle the previously mentioned challenges. The first contribution is about the automatic extraction of specialized biomedical terms (lexical complexity) from corpora. New ranking measures for single- and multi-word term extraction methods have been proposed and evaluated. In addition, we present BioTex software that implements the proposed measures. The second contribution concerns the concept extraction and semantic linkage of the extracted terminology (semantic complexity). This work seeks to induce semantic concepts of new candidate terms, and to find the semantic links, i.e. relevant location of new candidate terms, in an existing biomedical ontology. We proposed a methodology that extracts new terms in MeSH ontology. The experiments conducted on real data highlight the relevance of the contributions.
|
458 |
Etude terminologique de la chimie en arabe dans une approche de fouille de textes / .Albeiriss, Baian 07 July 2018 (has links)
Malgré l’importance d'une nomenclature internationale, le domaine de la chimie souffre encore de quelques problèmes linguistiques, liés notamment à ses unités terminologiques simples et complexes, pouvant gêner la communication scientifique. L’arabe ne fait pas exception, d’autant plus que sa graphie agglutinante et, en général, non-voyellée, pose d’énormesproblèmes d’ambiguïté. A cela s’ajoute l’emploi récurrent d’emprunts. La question est de savoir comment représenter les unités terminologiques simples et complexes de cette langue spécialisée. En d’autres termes, formaliser les caractéristiques terminologiques en étudiant les mécanismes de la construction morphosyntaxique des termes de la chimie en arabe. Cette étude devrait aboutir à la mise en place d’un outil de désambigüisation sémantique qui vise à constituer un outil d’extraction des termes de la chimie en arabe et de leurs relations. Une recherche pertinente en arabe passant obligatoirement par un système automatisé du traitement de la langue ; le traitement automatiquement des corpus écrits en arabe ne pouvant se faire sansanalyse linguistique ; cette analyse linguistique, plus précisément, cette étude terminologique, est la base pour la construction des règles d’une grammaire d’identification afin de déterminer les termes de la chimie en arabe. La construction de cette grammaire d’identification nécessite la modélisation des patrons morphosyntaxiques à partir de leur observation en corpus etdébouche sur la définition de règles de grammaire et de contraintes. / Despite the importance of an international nomenclature, the field of chemistry still suffers from some linguistic problems, linked in particular to its simple and complex terminological units, which can hinder scientific communication. Arabic is no exception, especially since its agglutinating spelling and, in general, not vowelized, may lead to enormous ambiguity's problems. This is in addition to the recurring use of borrowings. The problematic is how to represent the simple and complex terminological units of this specialized language. In other words, formalize the terminological characteristics by studying the mechanisms of themorphosyntactic construction of the chemistry' terms in Arabic. This study should lead to the establishment of a semantic-disambiguation tool that aims to create a tool for extracting the terms of Arabic chemistry and their relationships. A relevant search in Arabic cannot be done without an automated system of language processing; this automatic processing of corpuswritten in Arabic cannot be done without a language analysis; this linguistic analysis, more exactly, this terminology study, is the basis to build the rules of an identification grammar in order to identify the chemistry's terms in Arabic. The construction of this identification grammar requires modelling of morphosyntactic patterns from their observation in corpus and leads to the definition of rules of grammar and constraints.
|
459 |
Biagrupamento heurístico e coagrupamento baseado em fatoração de matrizes: um estudo em dados textuais / Heuristic biclustering and coclustering based on matrix factorization: a study on textual dataAlexandra Katiuska Ramos Diaz 16 October 2018 (has links)
Biagrupamento e coagrupamento são tarefas de mineração de dados que permitem a extração de informação relevante sobre dados e têm sido aplicadas com sucesso em uma ampla variedade de domínios, incluindo aqueles que envolvem dados textuais -- foco de interesse desta pesquisa. Nas tarefas de biagrupamento e coagrupamento, os critérios de similaridade são aplicados simultaneamente às linhas e às colunas das matrizes de dados, agrupando simultaneamente os objetos e os atributos e possibilitando a criação de bigrupos/cogrupos. Contudo suas definições variam segundo suas naturezas e objetivos, sendo que a tarefa de coagrupamento pode ser vista como uma generalização da tarefa de biagrupamento. Estas tarefas, quando aplicadas nos dados textuais, demandam uma representação em um modelo de espaço vetorial que, comumente, leva à geração de espaços caracterizados pela alta dimensionalidade e esparsidade, afetando o desempenho de muitos dos algoritmos. Este trabalho apresenta uma análise do comportamento do algoritmo para biagrupamento Cheng e Church e do algoritmo para coagrupamento de decomposição de valores em blocos não negativos (\\textit{Non-Negative Block Value Decomposition} - NBVD), aplicado ao contexto de dados textuais. Resultados experimentais quantitativos e qualitativos são apresentados a partir das experimentações destes algoritmos em conjuntos de dados sintéticos criados com diferentes níveis de esparsidade e em um conjunto de dados real. Os resultados são avaliados em termos de medidas próprias de biagrupamento, medidas internas de agrupamento a partir das projeções nas linhas dos bigrupos/cogrupos e em termos de geração de informação. As análises dos resultados esclarecem questões referentes às dificuldades encontradas por estes algoritmos nos ambiente de experimentação, assim como se são capazes de fornecer informações diferenciadas e úteis na área de mineração de texto. De forma geral, as análises realizadas mostraram que o algoritmo NBVD é mais adequado para trabalhar com conjuntos de dados em altas dimensões e com alta esparsidade. O algoritmo de Cheng e Church, embora tenha obtidos resultados bons de acordo com os objetivos do algoritmo, no contexto de dados textuais, propiciou resultados com baixa relevância / Biclustering e coclustering are data mining tasks that allow the extraction of relevant information about data and have been applied successfully in a wide variety of domains, including those involving textual data - the focus of interest of this research. In biclustering and coclustering tasks, similarity criteria are applied simultaneously to the rows and columns of the data matrices, simultaneously grouping the objects and attributes and enabling the discovery of biclusters/coclusters. However their definitions vary according to their natures and objectives, being that the task of coclustering can be seen as a generalization of the task of biclustering. These tasks applied in the textual data demand a representation in a model of vector space, which commonly leads to the generation of spaces characterized by high dimensionality and sparsity and influences the performance of many algorithms. This work provides an analysis of the behavior of the algorithm for biclustering Cheng and Church and the algorithm for coclustering non-negative block decomposition (NBVD) applied to the context of textual data. Quantitative and qualitative experimental results are shown, from experiments on synthetic datasets created with different sparsity levels and on a real data set. The results are evaluated in terms of their biclustering oriented measures, internal clustering measures applied to the projections in the lines of the biclusters/coclusters and in terms of generation of information. The analysis of the results clarifies questions related to the difficulties faced by these algorithms in the experimental environment, as well as if they are able to provide differentiated information useful to the field of text mining. In general, the analyses carried out showed that the NBVD algorithm is better suited to work with datasets in high dimensions and with high sparsity. The algorithm of Cheng and Church, although it obtained good results according to its own objectives, provided results with low relevance in the context of textual data
|
460 |
Discours de presse et veille stratégique d'évènements. Approche textométrique et extraction d'informations pour la fouille de textes / News Discourse and Strategic Monitoring of Events. Textometry and Information Extraction for Text MiningMacMurray, Erin 02 July 2012 (has links)
Ce travail a pour objet l’étude de deux méthodes de fouille automatique de textes, l’extraction d’informations et la textométrie, toutes deux mises au service de la veille stratégique des événements économiques. Pour l’extraction d’informations, il s’agit d’identifier et d’étiqueter des unités de connaissances, entités nommées — sociétés, lieux, personnes, qui servent de points d’entrée pour les analyses d’activités ou d’événements économiques — fusions, faillites, partenariats, impliquant ces différents acteurs. La méthode textométrique, en revanche, met en œuvre un ensemble de modèles statistiques permettant l’analyse des distributions de mots dans de vastes corpus, afin faire émerger les caractéristiques significatives des données textuelles. Dans cette recherche, la textométrie, traditionnellement considérée comme étant incompatible avec la fouille par l’extraction, est substituée à cette dernière pour obtenir des informations sur des événements économiques dans le discours. Plusieurs analyses textométriques (spécificités et cooccurrences) sont donc menées sur un corpus de flux de presse numérisé. On étudie ensuite les résultats obtenus grâce à la textométrie en vue de les comparer aux connaissances mises en évidence au moyen d’une procédure d’extraction d’informations. On constate que chacune des approches contribuent différemment au traitement des données textuelles, produisant toutes deux des analyses complémentaires. À l’issue de la comparaison est exposé l’apport des deux méthodes de fouille pour la veille d’événements. / This research demonstrates two methods of text mining for strategic monitoring purposes: information extraction and Textometry. In strategic monitoring, text mining is used to automatically obtain information on the activities of corporations. For this objective, information extraction identifies and labels units of information, named entities (companies, places, people), which then constitute entry points for the analysis of economic activities or events. These include mergers, bankruptcies, partnerships, etc., involving corresponding corporations. A Textometric method, however, uses several statistical models to study the distribution of words in large corpora, with the goal of shedding light on significant characteristics of the textual data. In this research, Textometry, an approach traditionally considered incompatible with information extraction methods, is applied to the same corpus as an information extraction procedure in order to obtain information on economic events. Several textometric analyses (characteristic elements, co-occurrences) are examined on a corpus of online news feeds. The results are then compared to those produced by the information extraction procedure. Both approaches contribute differently to processing textual data, producing complementary analyses of the corpus. Following the comparison, this research presents the advantages for these two text mining methods in strategic monitoring of current events.
|
Page generated in 0.0601 seconds