141 |
CASSANDRA: drug gene association prediction via text mining and ontologiesKissa, Maria 28 January 2015 (has links) (PDF)
The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce CASSANDRA, a fully corpus-based and unsupervised algorithm which uses the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. CASSANDRA measures the Pointwise Mutual Information (PMI) between biomedical terms derived from Gene Ontology (GO) and Medical Subject Headings (MeSH). Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles.
Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The algorithm can successfully identify direct drug gene associations with high precision and prioritize them over indirect drug gene associations. Validation shows that the statistically derived profiles from literature perform as good as (and at times better than) the manually curated profiles.
In addition, we examine CASSANDRA’s potential towards drug repositioning. For all FDA-approved drugs repositioned over the last 5 years, we generate profiles from publications before 2009 and show that the new indications rank high in these profiles. In summary, co-occurrence based profiles derived from the biomedical literature can accurately predict drug gene associations and provide insights onto potential repositioning cases.
|
142 |
Identidade e imagem da marca: uma análise comparativa em uma empresa do setor de serviços de telecomunicaçõesGarcia, Fernanda Cunha 20 June 2016 (has links)
Em uma sociedade altamente conectada, ávida por informações e inovações tecnológicas, em constante mudança dos comportamentos de consumo, a estratégia de gestão das marcas ocupa um lugar crescente. Aliada ao aumento da competição entre as empresas, a marca que consegue se diferenciar na mente dos consumidores se torna forte. Isso é ainda mais importante no setor de serviços, em que a experiência do consumidor, a definição e a sustentação dos valores da marca são vitais para a continuidade da força, tanto de sua identidade quanto da imagem. Tais aspectos são vistos como um processo de comunicação em que a forma como a imagem é desenvolvida na mente dos consumidores advém do modo como a identidade é construída e transmitida para eles (DE CHERNATONY; DRURY; SEGAL-HORN, 2004). Ao considerar esse cenário dinâmico e complexo, o presente trabalho tem como finalidade identificar e analisar as possíveis convergências ou divergências entre a identidade construída pela organização e a imagem da marca percebida pelos consumidores de uma empresa de serviços de telecomunicações. Para alcançar tal objetivo, foi utilizado como base teórica o modelo proposto por De Chernatony, Drury e Segal-Horn (2004), que aborda a transformação da identidade em imagem da marca, mais especificamente sob o ponto de vista de Pontes (2009). Para ele, os clientes são mais motivados a comprar e consumir produtos que acreditam possuir uma imagem complementar a que eles tenham de si mesmos, e propõe a existência de vários selfs: o percebido, que se refere às opiniões dos funcionários e dos gestores da organização sobre a marca; o ideal, que trata da identidade efetiva da marca pensada por seus líderes, a visão do que ela deveria ser; o social, que mostra como os gestores pensam que os consumidores a veem; o aparente, formado pela imagem da marca por parte dos clientes; e finalmente o self real, que seria um composto integrado de todas essas visões. Nesse intento, realizou-se um estudo de caso numa empresa de telecomunicações com atuação regional, a partir de uma abordagem quali-quantitativa. A visão da companhia foi obtida por meio de entrevistas semiestruturadas feitas com os gestores de marketing e de análises de documentos relacionados à estratégia de marcas. O ponto de vista dos consumidores foi trabalhado por técnicas de mineração de dados textuais (text mining) aplicadas a dados internos não estruturados oriundos da coleta de postagens realizadas no Facebook e no Twitter, relativas à marca, e da interação dos clientes com a empresa também por intermédio dessas redes sociais. Os resultados demonstraram a importância dos conceitos de identidade e imagem da marca, e como eles estão inter-relacionados. Além disso, pela análise qualitativa foi evidenciado que a visão dos executivos de marketing é bastante próxima e alinhada à do Livro da Marca, mostrando que há um discurso coeso e bem disseminado internamente na organização. Por outro lado, quando se avalia o ponto de vista dos clientes não se observou comentários específicos sobre a marca, e, com isso, não foi possível identificar a avaliação da imagem da Algar Telecom pelos consumidores. Apesar disso, outros aspectos relevantes puderam ser identificados para a consolidação da identidade da marca, como a ocorrência de um número considerável de reclamações, sobretudo com relação à internet, bem como a preocupação dos clientes quanto à prestação dos serviços. / In a highly connected society, avid for information and technological innovations, constantly changing the consumption patterns, the brand management strategy occupies a growing place. Allied with the increased competition among companies, the brand that can differentiate in consumers’ minds becomes strong. This aspect is even more important in the service industry, where the consumer experience, the definition and support of the brand’s values are vital to the continued strength of both your identity and image. These aspects are seen as a process of communication in which the way the image is developed in the minds of consumers comes from how identity is constructed and transmitted to them (DE CHERNATONY; DRURY; SEGAL-HORN, 2004). Considering the dynamic and complex scenario, this study aims to identify and analyze the possible convergences or divergences between the identity built by the organization and the brand image perceived by consumers of a telecommunications services company. To achieve this objective, the model proposed by De Chernatony, Drury and Segal-Horn (2004) was used as a theoretical basis, which addresses the transformation of identity in brand image, specifically under the perspective of Pontes (2009). For him, customers are more motivated to buy and consume products that they believe that take a complementary image that they have of themselves, and proposes the existence of multiple selves: the perceived, which refers to the employees and the organization’s management opinions on the brand; the ideal, which deals with effective brand identity thought by its leaders, the vision of what it should be; social, which shows how managers think that consumers see it; the apparent, formed by the image of the brand by customers; and finally the real self, that would be an integrated composite of all of these visions. In this regard, a case study was made in a telecommunications company with regional actions, from a qualitative and quantitative approach. It was identified the company’s vision through semi-structured interviews with marketing managers and analysis of documents related to the brand strategy. The point of view of consumers was addressed for text mining techniques applied to internal unstructured data coming from the collection of posts made on Facebook and Twitter, related to the brand, and customer interaction with the company through these social networks. The results showed the importance of the concepts of identity and brand image, and how they are interrelated. Moreover, the qualitative analysis it was shown that the vision of marketing executives is quite close and in line with the Brand Book, showing that there is a cohesive and well disseminated speech internally in the organization. On the other hand, when evaluating the customer's point of view there was no specific comments on the brand, and it was not possible to identify the evaluation of Algar Telecom image by consumers. Nevertheless, other relevant aspects could be identified for the consolidation of the brand identity, as the occurrence of a number of complaints, especially regarding the internet as well as the concern of customers for the quality of the provision of services. / Dissertação (Mestrado)
|
143 |
Contribution à la méthode de conception inventive par l'extraction automatique de connaissances des textes de brevets d'invention / Toward an automatic extraction of inventive design method knowledge from patentsSouili, Wendemmi Moukassa Achille 31 August 2015 (has links)
Les brevets d’invention titres de propriété industrielle confèrent à leurs titulaires le monopole de l’invention brevetée. On peut y trouver une sorte d’historique de l’évolution de l’artefact. Dans ce contexte le concepteur est très souvent amené à faire des recherches dans les documents de brevets afin de bénéficier des connaissances qui y sont contenues en vue de structurer le processus inventif. Développée pour assister les concepteurs dans leur démarche d’innovation, la Méthode de Conception Inventive (MCI), s’inscrit dans le modèle de la dialectique. La MCI a précisé les concepts entrant en jeu dans la description des évolutions des systèmes techniques et des artefacts. Ces items intéressent bien souvent les concepteurs et sont essentiels à la compréhension du problème sous-jacent et à la collecte de toutes les caractéristiques sur lesquelles on peut agir ; et de l’effet de leurs variations sur l’artefact. Cette thèse consiste d’abord à analyser le document de brevet d’un point de vue linguistique, afin d’en connaitre la typologie. Il s’agit, ensuite, de repérer dans le document de brevets les connaissances susceptibles d’être utiles à la MCI et à les formaliser sous forme de programme informatique. L’approche que nous proposons est issue du text-mining. Elle est à base de marqueurs linguistiques et utilise des patrons lexico-syntaxiques issus du domaine du traitement automatique des langues. Cette méthode d’extraction des concepts utiles à la MCI permet l’établissement d’une sorte de cartographie initiale des évolutions passées et possibles des caractéristiques de l’artefact. L’intérêt est en outre de faciliter grandement l’analyse préliminaire des connaissances relative au dit artefact. / Patents are industrial property titles that give their holders a monopoly over the patented invention. It is possible to find a sort of history of the evolution of the artifact. In this context the designer often like to do research in patent documents in order to benefit from the knowledge contained inside to structure the inventive process. Developed to assist designers in their innovation approach, the Inventive Design Method (IDM) is part of the pattern of dialectic. IDM has clarified the concepts at stake in the description of the evolution of technical systems and artifact. These items often interest designers and are essential to understanding the underlying problem and collecting of all features on which to act; and the effect of variations on the artifact. This thesis, firstly, deals with patent document analysis from a linguistic point of view, in order to know its typology. Then, it is possible to identify in the patent document, the knowledge likely to be useful to IDM and formalize it as a computer program. The approach proposed in this paper is based on text mining techniques. It uses a method based on linguistic markers using lexical and syntactic patterns from the field of natural language processing. This method of extraction of useful concepts for IDM allows the establishment of a kind of initial mapping of past and possible changes in the future of the artifact characteristics. The interest is also to greatly facilitate the preliminary analysis of knowledge on the said artifact.
|
144 |
CSR i VD:n har ordet – En kvantitativ innehållsanalys på OMXS30 / CSR in the CEO-letters - A quantitative content analysis on OMXS30Nilsson, Henrik, Palmgren, Marcus, Ngorsungnoen, Martin January 2017 (has links)
Årsredovisningar är ett ofta förekommande objekt inom redovisningsforskningen. Årsredovisningarna är en viktig kommunikationskanal mellan företaget och dess olika intressenter. Rapportens ändamål är primärt att ge intressenterna information om företagets resultat och finansiella ställning. Årsredovisningarnas karaktär har genom åren utvecklats och berikats med bland annat bilder och kompletterande textuella avsnitt. I denna studie har den narrativa delen av årsredovisningar ’VD:n har ordet’ studerats. Avsnittet utgör en viktig del av årsredovisningens frivilliga element, och är en av de mest lästa delarna. Samtidigt är ’VD:n har ordet’, till skillnad från flertalet andra delar i årsredovisningen, oreglerad. Detta ger företaget och den verkställande direktören stora möjligheter att kommunicera ut legitimitetskapande budskap till dess intressenter. Tidigare forskning visar att aktiviteter kopplade till CSR har en positiv påverkan på intressenternas inställning gentemot företaget. CSR är ett rymligt begrepp som innefattar relationen och ansvarstagandet mellan företag och samhälle. Det har visat sig ha stor betydelse för företagen att lyckas signalera dessa aktiviteter till intressenterna. Denna studie syftar till att ge en bättre inblick i hur svenska företag använder ’VD:n har ordet’ för att lyfta fram CSR. Vidare undersöks om det har skett en ökning av CSR-begrepp i ’VD:n har ordet’. Studien uppvisar att det inte finns något samband mellan företagens vinstmarginal och andel CSR-begrepp i ’VD:n har ordet’. Det går inte heller att urskilja ett samband mellan andel CSR-begrepp och andelen begrepp kopplade till resultat. Däremot kunde vi bevisa att det skett en linjär ökning av CSR-begrepp mellan år 2006 och 2015. Begrepp kopplade till resultat har minskat, om än inte linjärt. / Annual reports are a frequently used item in the accounting research. The annual reports are an important communication channel between the company and its various stakeholders. The purpose of the report is to give the stakeholders information about the company's earnings and financial position. The character of the annual reports has evolved over the years and has been enriched with, among other things, pictures and additional textual sections. In this study, the narrative part of the annual reports ‘CEO-letter’ has been studied. The section constitutes an important part of the annual report's optional elements, and is one of the most read sections. At the same time, the ‘CEO-letter’, unlike most other part of the annual report, is unregulated. This fact gives the company and the CEO a great opportunity to communicate legitimacy-creating messages to its stakeholders. Previous research shows that activities related to CSR have a positive impact on stakeholders' attitude towards the company. CSR is a spacious concept that includes the relationship and responsivities between the company and its society. It has been found to be of great importance for companies to successfully signal these activities to stakeholders. This study aims at giving a better insight into how Swedish companies use the 'CEO-letter' to highlight CSR. Further investigations are made if there has been an increase in the use of words related to CSR in the 'CEO-letters’. This study shows that there is no correlation between the company's profit margin and the share of CSR concepts in the ‘CEO-letter’. In addition, we could not discern a relationship between the share of CSR concepts and the proportion of concepts linked to results in the ‘CEO-letters’. On the other hand, we could prove that there has been a linear increase in CSR concepts over the past ten years. Furthermore, concepts linked to results have decreased, albeit not in a linear manner. Please note that the thesis language is in Swedish.
|
145 |
Finding conflicting statements in the biomedical literatureSarafraz, Farzaneh January 2012 (has links)
The main archive of life sciences literature currently contains more than 18,000,000 references, and it is virtually impossible for any human to stay up-to-date with this large number of papers, even in a specific sub-domain. Not every fact that is reported in the literature is novel and distinct. Scientists report repeat experiments, or refer to previous findings. Given the large number of publications, it is not surprising that information on certain topics is repeated over a number of publications. From consensus to contradiction, there are all shades of agreement between the claimed facts in the literature, and considering the volume of the corpus, conflicting findings are not unlikely. Finding such claims is particularly interesting for scientists, as they can present opportunities for knowledge consolidation and future investigations. In this thesis we present a method to extract and contextualise statements about molecular events as expressed in the biomedical literature, and to find those that potentially conflict each other. The approach uses a system that detects event negations and speculation, and combines those with contextual features (e.g. type of event, species, and anatomical location) to build a representational model for establishing relations between different biological events, including relations concerning conflicts. In the detection of negations and speculations, rich lexical, syntactic, and semantic features have been exploited, including the syntactic command relation. Different parts of the proposed method have been evaluated in a context of the BioNLP 09 challenge. The average F-measures for event negation and speculation detection were 63% (with precision of 88%) and 48% (with precision of 64%) respectively. An analysis of a set of 50 extracted event pairs identified as potentially conflicting revealed that 32 of them showed some degree of conflict (64%); 10 event pairs (20%) needed a more complex biological interpretation to decide whether there was a conflict. We also provide an open source integrated text mining framework for extracting events and their context on a large-scale basis using a pipeline of tools that are available or have been developed as part of this research, along with 72,314 potentially conflicting molecular event pairs that have been generated by mining the entire body of accessible biomedical literature. We conclude that, whilst automated conflict mining would need more comprehensive context extraction, it is feasible to provide a support environment for biologists to browse potential conflicting statements and facilitate data and knowledge consolidation.
|
146 |
Discovering relations between indirectly connected biomedical concepts: Research ArticleTsatsaronis, George, Weissenborn, Dirk, Schroeder, Michael 04 January 2016 (has links)
BACKGROUND:
The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.
RESULTS:
It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely \'has target\', and \'may treat\', are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.
CONCLUSIONS:
Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.
|
147 |
Release of the MySQL based implementation of the CTS protocolTiepmar, Jochen January 2016 (has links)
In a project called "A Library of a Billion Words" we needed an implementation of the CTS protocol that is capable of handling a text collection containing at least 1 billion words. Because the existing solutions did not work for this scale or were still in development I started an implementation of the CTS protocol using methods that MySQL provides. Last year we published a paper that introduced a prototype with the core functionalities without being compliant with the specifications of CTS (Tiepmar et al., 2013). The purpose of this paper is to describe and evaluate the MySQL based implementa-tion now that it is fulfilling the specifications version 5.0 rc.1 and mark it as finished and ready to use. Fur-ther information, online instances of CTS for all de-scribed datasets and binaries can be accessed via the projects website1. Reference Tiepmar J, Teichmann C, Heyer G, Berti M and Crane G. 2013. A new Implementation for Canonical Text Services. in Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH).
|
148 |
Status Quo der Textanalyse im Rahmen der Business IntelligenceSchieber, Andreas, Hilbert, Andreas January 2014 (has links)
Vor dem Hintergrund der Zunahme unstrukturierter Daten für Unternehmen befasst sich dieser Beitrag mit den Möglichkeiten, die durch den Einsatz der Business Intelligence für Unternehmen bestehen, wenn durch gezielte Analyse die Bedeutung dieser Daten erfasst, gefiltert und ausgewertet werden können. Allgemein ist das Ziel der Business Intelligence die Unterstützung von Entscheidungen, die im Unternehmen (auf Basis strukturierter Daten) getroffen werden. Die zusätzliche Auswertung von unstrukturierten Daten, d.h. unternehmensinternen Dokumenten oder Texten aus dem Web 2.0, führt zu einer Vergrößerung des Potenzials und dient der Erweiterung des Geschäftsverständnisses der Verbesserung der Entscheidungsfindung. Der Beitrag erläutert dabei nicht nur Konzepte und Verfahren, die diese Analysen ermöglichen, sondern zeigt auch Fallbeispiele zur Demonstration ihrer Nützlichkeit.:1 Einführung
2 Business Intelligence
2.1 Definition
2.2 Ordnungsrahmen
2.3 Analyseorientierte BI und Data Mining
3 Text Mining
3.1 Berührungspunkte mit anderen Disziplinen
3.2 Definition
3.3 Prozessmodell nach HIPPNER & RENTZMANN (2006a)
3.3.1 Aufgabendefinition
3.3.2 Dokumentselektion
3.3.3 Dokumentaufbereitung
3.3.4 Text-Mining-Methoden
3.3.5 Interpretation / Evaluation
3.3.6 Anwendung
4 Potenziale der Textanalyse
4.1 Erweiterung des CRM
4.2 Alternative zur Marktforschung
5 Fazit und Ausblick
Literaturverzeichnis
|
149 |
Entwicklung eines generischen Vorgehensmodells für Text MiningSchieber, Andreas, Hilbert, Andreas 29 April 2014 (has links)
Vor dem Hintergrund des steigenden Interesses von computergestützter Textanalyse in Forschung und Praxis entwickelt dieser Beitrag auf Basis aktueller Literatur ein generisches Vorgehensmodell für Text-Mining-Prozesse. Das Ziel des Beitrags ist, die dabei anfallenden, umfangreichen Aktivitäten zu strukturieren und dadurch die Komplexität von Text-Mining-Vorhaben zu reduzieren. Das Forschungsziel stützt sich auf die Tatsache, dass im Rahmen einer im Vorfeld durchgeführten, systematischen Literatur-Review keine detaillierten, anwendungsneutralen Vorgehensmodelle für Text Mining identifiziert werden konnten. Aufbauend auf den Erkenntnissen der Literatur-Review enthält das resultierende Modell daher sowohl induktiv begründete Komponenten aus spezifischen Ansätzen als auch aus literaturbasierten Anforderungen deduktiv abgeleitete Bestandteile. Die Evaluation des Artefakts belegt die Nützlichkeit des Vorgehensmodells im Vergleich mit dem bisherigen Forschungsstand.:1 Einführung
1.1 Motivation
1.2 Forschungsziel und Methodik
1.2.1 Systematische Literatur-Review
1.2.2 Design-Science-Research-Ansatz
1.3 Aufbau des Beitrags
2 Stand der Forschung
2.1 Begriffsverständnis
2.2 Merkmale von Vorgehensmodellen für Text Mining
2.3 Aktivitäten im Text-Mining-Prozess
2.4 Zusammenfassung
3 Anforderungen an ein generisches Vorgehensmodell
3.1 Strukturelle Anforderungen
3.2 Funktionelle Anforderungen
3.3 Zusammenfassung
4 Entwicklung des Modells
4.1 Aufgabendefinition
4.2 Dokumentenselektion und -untersuchung
4.3 Dokumentenaufbereitung
4.3.1 Linguistische Aufbereitung
4.3.2 Technische Aufbereitung
4.4 Text-Mining-Verfahren
4.5 Ergebnisevaluation
4.6 Anwendung
4.7 Zusammenfassung
4.7.1 Gesamtmodell
4.7.2 Feedbackschleifen
5 Evaluation
5.1 Evaluationsdesign
5.2 Messung und Auswertung
6 Fazit und Ausblick
Literaturverzeichnis
Anhang
A1 Anwendungsneutrale Vorgehensmodelle
A2 Auswirkungen von Grund- und Stammformenreduktion auf die Interpretierbarkeit von Texten
A3 Gesamtmodell
|
150 |
Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual ResourcesHänig, Christian 17 April 2013 (has links)
This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied.
Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data.
The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on.
Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly.
An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups.
Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\''s thoughts.
To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora.
|
Page generated in 0.0456 seconds