Spelling suggestions: "subject:"text minining"" "subject:"text chanining""
141 |
Dealing with unstructured data : A study about information quality and measurement / Hantera ostrukturerad data : En studie om informationskvalitet och mätningVikholm, Oskar January 2015 (has links)
Many organizations have realized that the growing amount of unstructured text may contain information that can be used for different purposes, such as making decisions. Organizations can by using so-called text mining tools, extract information from text documents. For example within military and intelligence activities it is important to go through reports and look for entities such as names of people, events, and the relationships in-between them when criminal or other interesting activities are being investigated and mapped. This study explores how information quality can be measured and what challenges it involves. It is done on the basis of Wang and Strong (1996) theory about how information quality can be measured. The theory is tested and discussed from empirical material that contains interviews from two case organizations. The study observed two important aspects to take into consideration when measuring information quality: context dependency and source criticism. Context dependency means that the context in which information quality should be measured in must be defined based on the consumer’s needs. Source criticism implies that it is important to take the original source into consideration, and how reliable it is. Further, data quality and information quality is often used interchangeably, which means that organizations needs to decide what they really want to measure. One of the major challenges in developing software for entity extraction is that the system needs to understand the structure of natural language, which is very complicated. / Många organisationer har insett att den växande mängden ostrukturerad text kan innehålla information som kan användas till flera ändamål såsom beslutsfattande. Genom att använda så kallade text-mining verktyg kan organisationer extrahera information från textdokument. Inom till exempel militär verksamhet och underrättelsetjänst är det viktigt att kunna gå igenom rapporter och leta efter exempelvis namn på personer, händelser och relationerna mellan dessa när brottslig eller annan intressant verksamhet undersöks och kartläggs. I studien undersöks hur informationskvalitet kan mätas och vilka utmaningar det medför. Det görs med utgångspunkt i Wang och Strongs (1996) teori om hur informationskvalité kan mätas. Teorin testas och diskuteras utifrån ett empiriskt material som består av intervjuer från två fall-organisationer. Studien uppmärksammar två viktiga aspekter att ta hänsyn till för att mäta informationskvalitét; kontextberoende och källkritik. Kontextberoendet innebär att det sammanhang inom vilket informationskvalitét mäts måste definieras utifrån konsumentens behov. Källkritik innebär att det är viktigt att ta hänsyn informationens ursprungliga källa och hur trovärdig den är. Vidare är det viktigt att organisationer bestämmer om det är data eller informationskvalitét som ska mätas eftersom dessa två begrepp ofta blandas ihop. En av de stora utmaningarna med att utveckla mjukvaror för entitetsextrahering är att systemen ska förstå uppbyggnaden av det naturliga språket, vilket är väldigt komplicerat.
|
142 |
CASSANDRA: drug gene association prediction via text mining and ontologiesKissa, Maria 28 January 2015 (has links) (PDF)
The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce CASSANDRA, a fully corpus-based and unsupervised algorithm which uses the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. CASSANDRA measures the Pointwise Mutual Information (PMI) between biomedical terms derived from Gene Ontology (GO) and Medical Subject Headings (MeSH). Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles.
Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The algorithm can successfully identify direct drug gene associations with high precision and prioritize them over indirect drug gene associations. Validation shows that the statistically derived profiles from literature perform as good as (and at times better than) the manually curated profiles.
In addition, we examine CASSANDRA’s potential towards drug repositioning. For all FDA-approved drugs repositioned over the last 5 years, we generate profiles from publications before 2009 and show that the new indications rank high in these profiles. In summary, co-occurrence based profiles derived from the biomedical literature can accurately predict drug gene associations and provide insights onto potential repositioning cases.
|
143 |
Identidade e imagem da marca: uma análise comparativa em uma empresa do setor de serviços de telecomunicaçõesGarcia, Fernanda Cunha 20 June 2016 (has links)
Em uma sociedade altamente conectada, ávida por informações e inovações tecnológicas, em constante mudança dos comportamentos de consumo, a estratégia de gestão das marcas ocupa um lugar crescente. Aliada ao aumento da competição entre as empresas, a marca que consegue se diferenciar na mente dos consumidores se torna forte. Isso é ainda mais importante no setor de serviços, em que a experiência do consumidor, a definição e a sustentação dos valores da marca são vitais para a continuidade da força, tanto de sua identidade quanto da imagem. Tais aspectos são vistos como um processo de comunicação em que a forma como a imagem é desenvolvida na mente dos consumidores advém do modo como a identidade é construída e transmitida para eles (DE CHERNATONY; DRURY; SEGAL-HORN, 2004). Ao considerar esse cenário dinâmico e complexo, o presente trabalho tem como finalidade identificar e analisar as possíveis convergências ou divergências entre a identidade construída pela organização e a imagem da marca percebida pelos consumidores de uma empresa de serviços de telecomunicações. Para alcançar tal objetivo, foi utilizado como base teórica o modelo proposto por De Chernatony, Drury e Segal-Horn (2004), que aborda a transformação da identidade em imagem da marca, mais especificamente sob o ponto de vista de Pontes (2009). Para ele, os clientes são mais motivados a comprar e consumir produtos que acreditam possuir uma imagem complementar a que eles tenham de si mesmos, e propõe a existência de vários selfs: o percebido, que se refere às opiniões dos funcionários e dos gestores da organização sobre a marca; o ideal, que trata da identidade efetiva da marca pensada por seus líderes, a visão do que ela deveria ser; o social, que mostra como os gestores pensam que os consumidores a veem; o aparente, formado pela imagem da marca por parte dos clientes; e finalmente o self real, que seria um composto integrado de todas essas visões. Nesse intento, realizou-se um estudo de caso numa empresa de telecomunicações com atuação regional, a partir de uma abordagem quali-quantitativa. A visão da companhia foi obtida por meio de entrevistas semiestruturadas feitas com os gestores de marketing e de análises de documentos relacionados à estratégia de marcas. O ponto de vista dos consumidores foi trabalhado por técnicas de mineração de dados textuais (text mining) aplicadas a dados internos não estruturados oriundos da coleta de postagens realizadas no Facebook e no Twitter, relativas à marca, e da interação dos clientes com a empresa também por intermédio dessas redes sociais. Os resultados demonstraram a importância dos conceitos de identidade e imagem da marca, e como eles estão inter-relacionados. Além disso, pela análise qualitativa foi evidenciado que a visão dos executivos de marketing é bastante próxima e alinhada à do Livro da Marca, mostrando que há um discurso coeso e bem disseminado internamente na organização. Por outro lado, quando se avalia o ponto de vista dos clientes não se observou comentários específicos sobre a marca, e, com isso, não foi possível identificar a avaliação da imagem da Algar Telecom pelos consumidores. Apesar disso, outros aspectos relevantes puderam ser identificados para a consolidação da identidade da marca, como a ocorrência de um número considerável de reclamações, sobretudo com relação à internet, bem como a preocupação dos clientes quanto à prestação dos serviços. / In a highly connected society, avid for information and technological innovations, constantly changing the consumption patterns, the brand management strategy occupies a growing place. Allied with the increased competition among companies, the brand that can differentiate in consumers’ minds becomes strong. This aspect is even more important in the service industry, where the consumer experience, the definition and support of the brand’s values are vital to the continued strength of both your identity and image. These aspects are seen as a process of communication in which the way the image is developed in the minds of consumers comes from how identity is constructed and transmitted to them (DE CHERNATONY; DRURY; SEGAL-HORN, 2004). Considering the dynamic and complex scenario, this study aims to identify and analyze the possible convergences or divergences between the identity built by the organization and the brand image perceived by consumers of a telecommunications services company. To achieve this objective, the model proposed by De Chernatony, Drury and Segal-Horn (2004) was used as a theoretical basis, which addresses the transformation of identity in brand image, specifically under the perspective of Pontes (2009). For him, customers are more motivated to buy and consume products that they believe that take a complementary image that they have of themselves, and proposes the existence of multiple selves: the perceived, which refers to the employees and the organization’s management opinions on the brand; the ideal, which deals with effective brand identity thought by its leaders, the vision of what it should be; social, which shows how managers think that consumers see it; the apparent, formed by the image of the brand by customers; and finally the real self, that would be an integrated composite of all of these visions. In this regard, a case study was made in a telecommunications company with regional actions, from a qualitative and quantitative approach. It was identified the company’s vision through semi-structured interviews with marketing managers and analysis of documents related to the brand strategy. The point of view of consumers was addressed for text mining techniques applied to internal unstructured data coming from the collection of posts made on Facebook and Twitter, related to the brand, and customer interaction with the company through these social networks. The results showed the importance of the concepts of identity and brand image, and how they are interrelated. Moreover, the qualitative analysis it was shown that the vision of marketing executives is quite close and in line with the Brand Book, showing that there is a cohesive and well disseminated speech internally in the organization. On the other hand, when evaluating the customer's point of view there was no specific comments on the brand, and it was not possible to identify the evaluation of Algar Telecom image by consumers. Nevertheless, other relevant aspects could be identified for the consolidation of the brand identity, as the occurrence of a number of complaints, especially regarding the internet as well as the concern of customers for the quality of the provision of services. / Dissertação (Mestrado)
|
144 |
Contribution à la méthode de conception inventive par l'extraction automatique de connaissances des textes de brevets d'invention / Toward an automatic extraction of inventive design method knowledge from patentsSouili, Wendemmi Moukassa Achille 31 August 2015 (has links)
Les brevets d’invention titres de propriété industrielle confèrent à leurs titulaires le monopole de l’invention brevetée. On peut y trouver une sorte d’historique de l’évolution de l’artefact. Dans ce contexte le concepteur est très souvent amené à faire des recherches dans les documents de brevets afin de bénéficier des connaissances qui y sont contenues en vue de structurer le processus inventif. Développée pour assister les concepteurs dans leur démarche d’innovation, la Méthode de Conception Inventive (MCI), s’inscrit dans le modèle de la dialectique. La MCI a précisé les concepts entrant en jeu dans la description des évolutions des systèmes techniques et des artefacts. Ces items intéressent bien souvent les concepteurs et sont essentiels à la compréhension du problème sous-jacent et à la collecte de toutes les caractéristiques sur lesquelles on peut agir ; et de l’effet de leurs variations sur l’artefact. Cette thèse consiste d’abord à analyser le document de brevet d’un point de vue linguistique, afin d’en connaitre la typologie. Il s’agit, ensuite, de repérer dans le document de brevets les connaissances susceptibles d’être utiles à la MCI et à les formaliser sous forme de programme informatique. L’approche que nous proposons est issue du text-mining. Elle est à base de marqueurs linguistiques et utilise des patrons lexico-syntaxiques issus du domaine du traitement automatique des langues. Cette méthode d’extraction des concepts utiles à la MCI permet l’établissement d’une sorte de cartographie initiale des évolutions passées et possibles des caractéristiques de l’artefact. L’intérêt est en outre de faciliter grandement l’analyse préliminaire des connaissances relative au dit artefact. / Patents are industrial property titles that give their holders a monopoly over the patented invention. It is possible to find a sort of history of the evolution of the artifact. In this context the designer often like to do research in patent documents in order to benefit from the knowledge contained inside to structure the inventive process. Developed to assist designers in their innovation approach, the Inventive Design Method (IDM) is part of the pattern of dialectic. IDM has clarified the concepts at stake in the description of the evolution of technical systems and artifact. These items often interest designers and are essential to understanding the underlying problem and collecting of all features on which to act; and the effect of variations on the artifact. This thesis, firstly, deals with patent document analysis from a linguistic point of view, in order to know its typology. Then, it is possible to identify in the patent document, the knowledge likely to be useful to IDM and formalize it as a computer program. The approach proposed in this paper is based on text mining techniques. It uses a method based on linguistic markers using lexical and syntactic patterns from the field of natural language processing. This method of extraction of useful concepts for IDM allows the establishment of a kind of initial mapping of past and possible changes in the future of the artifact characteristics. The interest is also to greatly facilitate the preliminary analysis of knowledge on the said artifact.
|
145 |
CSR i VD:n har ordet – En kvantitativ innehållsanalys på OMXS30 / CSR in the CEO-letters - A quantitative content analysis on OMXS30Nilsson, Henrik, Palmgren, Marcus, Ngorsungnoen, Martin January 2017 (has links)
Årsredovisningar är ett ofta förekommande objekt inom redovisningsforskningen. Årsredovisningarna är en viktig kommunikationskanal mellan företaget och dess olika intressenter. Rapportens ändamål är primärt att ge intressenterna information om företagets resultat och finansiella ställning. Årsredovisningarnas karaktär har genom åren utvecklats och berikats med bland annat bilder och kompletterande textuella avsnitt. I denna studie har den narrativa delen av årsredovisningar ’VD:n har ordet’ studerats. Avsnittet utgör en viktig del av årsredovisningens frivilliga element, och är en av de mest lästa delarna. Samtidigt är ’VD:n har ordet’, till skillnad från flertalet andra delar i årsredovisningen, oreglerad. Detta ger företaget och den verkställande direktören stora möjligheter att kommunicera ut legitimitetskapande budskap till dess intressenter. Tidigare forskning visar att aktiviteter kopplade till CSR har en positiv påverkan på intressenternas inställning gentemot företaget. CSR är ett rymligt begrepp som innefattar relationen och ansvarstagandet mellan företag och samhälle. Det har visat sig ha stor betydelse för företagen att lyckas signalera dessa aktiviteter till intressenterna. Denna studie syftar till att ge en bättre inblick i hur svenska företag använder ’VD:n har ordet’ för att lyfta fram CSR. Vidare undersöks om det har skett en ökning av CSR-begrepp i ’VD:n har ordet’. Studien uppvisar att det inte finns något samband mellan företagens vinstmarginal och andel CSR-begrepp i ’VD:n har ordet’. Det går inte heller att urskilja ett samband mellan andel CSR-begrepp och andelen begrepp kopplade till resultat. Däremot kunde vi bevisa att det skett en linjär ökning av CSR-begrepp mellan år 2006 och 2015. Begrepp kopplade till resultat har minskat, om än inte linjärt. / Annual reports are a frequently used item in the accounting research. The annual reports are an important communication channel between the company and its various stakeholders. The purpose of the report is to give the stakeholders information about the company's earnings and financial position. The character of the annual reports has evolved over the years and has been enriched with, among other things, pictures and additional textual sections. In this study, the narrative part of the annual reports ‘CEO-letter’ has been studied. The section constitutes an important part of the annual report's optional elements, and is one of the most read sections. At the same time, the ‘CEO-letter’, unlike most other part of the annual report, is unregulated. This fact gives the company and the CEO a great opportunity to communicate legitimacy-creating messages to its stakeholders. Previous research shows that activities related to CSR have a positive impact on stakeholders' attitude towards the company. CSR is a spacious concept that includes the relationship and responsivities between the company and its society. It has been found to be of great importance for companies to successfully signal these activities to stakeholders. This study aims at giving a better insight into how Swedish companies use the 'CEO-letter' to highlight CSR. Further investigations are made if there has been an increase in the use of words related to CSR in the 'CEO-letters’. This study shows that there is no correlation between the company's profit margin and the share of CSR concepts in the ‘CEO-letter’. In addition, we could not discern a relationship between the share of CSR concepts and the proportion of concepts linked to results in the ‘CEO-letters’. On the other hand, we could prove that there has been a linear increase in CSR concepts over the past ten years. Furthermore, concepts linked to results have decreased, albeit not in a linear manner. Please note that the thesis language is in Swedish.
|
146 |
Finding conflicting statements in the biomedical literatureSarafraz, Farzaneh January 2012 (has links)
The main archive of life sciences literature currently contains more than 18,000,000 references, and it is virtually impossible for any human to stay up-to-date with this large number of papers, even in a specific sub-domain. Not every fact that is reported in the literature is novel and distinct. Scientists report repeat experiments, or refer to previous findings. Given the large number of publications, it is not surprising that information on certain topics is repeated over a number of publications. From consensus to contradiction, there are all shades of agreement between the claimed facts in the literature, and considering the volume of the corpus, conflicting findings are not unlikely. Finding such claims is particularly interesting for scientists, as they can present opportunities for knowledge consolidation and future investigations. In this thesis we present a method to extract and contextualise statements about molecular events as expressed in the biomedical literature, and to find those that potentially conflict each other. The approach uses a system that detects event negations and speculation, and combines those with contextual features (e.g. type of event, species, and anatomical location) to build a representational model for establishing relations between different biological events, including relations concerning conflicts. In the detection of negations and speculations, rich lexical, syntactic, and semantic features have been exploited, including the syntactic command relation. Different parts of the proposed method have been evaluated in a context of the BioNLP 09 challenge. The average F-measures for event negation and speculation detection were 63% (with precision of 88%) and 48% (with precision of 64%) respectively. An analysis of a set of 50 extracted event pairs identified as potentially conflicting revealed that 32 of them showed some degree of conflict (64%); 10 event pairs (20%) needed a more complex biological interpretation to decide whether there was a conflict. We also provide an open source integrated text mining framework for extracting events and their context on a large-scale basis using a pipeline of tools that are available or have been developed as part of this research, along with 72,314 potentially conflicting molecular event pairs that have been generated by mining the entire body of accessible biomedical literature. We conclude that, whilst automated conflict mining would need more comprehensive context extraction, it is feasible to provide a support environment for biologists to browse potential conflicting statements and facilitate data and knowledge consolidation.
|
147 |
Discovering relations between indirectly connected biomedical concepts: Research ArticleTsatsaronis, George, Weissenborn, Dirk, Schroeder, Michael 04 January 2016 (has links)
BACKGROUND:
The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.
RESULTS:
It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely \'has target\', and \'may treat\', are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.
CONCLUSIONS:
Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.
|
148 |
Release of the MySQL based implementation of the CTS protocolTiepmar, Jochen January 2016 (has links)
In a project called "A Library of a Billion Words" we needed an implementation of the CTS protocol that is capable of handling a text collection containing at least 1 billion words. Because the existing solutions did not work for this scale or were still in development I started an implementation of the CTS protocol using methods that MySQL provides. Last year we published a paper that introduced a prototype with the core functionalities without being compliant with the specifications of CTS (Tiepmar et al., 2013). The purpose of this paper is to describe and evaluate the MySQL based implementa-tion now that it is fulfilling the specifications version 5.0 rc.1 and mark it as finished and ready to use. Fur-ther information, online instances of CTS for all de-scribed datasets and binaries can be accessed via the projects website1. Reference Tiepmar J, Teichmann C, Heyer G, Berti M and Crane G. 2013. A new Implementation for Canonical Text Services. in Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH).
|
149 |
Status Quo der Textanalyse im Rahmen der Business IntelligenceSchieber, Andreas, Hilbert, Andreas January 2014 (has links)
Vor dem Hintergrund der Zunahme unstrukturierter Daten für Unternehmen befasst sich dieser Beitrag mit den Möglichkeiten, die durch den Einsatz der Business Intelligence für Unternehmen bestehen, wenn durch gezielte Analyse die Bedeutung dieser Daten erfasst, gefiltert und ausgewertet werden können. Allgemein ist das Ziel der Business Intelligence die Unterstützung von Entscheidungen, die im Unternehmen (auf Basis strukturierter Daten) getroffen werden. Die zusätzliche Auswertung von unstrukturierten Daten, d.h. unternehmensinternen Dokumenten oder Texten aus dem Web 2.0, führt zu einer Vergrößerung des Potenzials und dient der Erweiterung des Geschäftsverständnisses der Verbesserung der Entscheidungsfindung. Der Beitrag erläutert dabei nicht nur Konzepte und Verfahren, die diese Analysen ermöglichen, sondern zeigt auch Fallbeispiele zur Demonstration ihrer Nützlichkeit.:1 Einführung
2 Business Intelligence
2.1 Definition
2.2 Ordnungsrahmen
2.3 Analyseorientierte BI und Data Mining
3 Text Mining
3.1 Berührungspunkte mit anderen Disziplinen
3.2 Definition
3.3 Prozessmodell nach HIPPNER & RENTZMANN (2006a)
3.3.1 Aufgabendefinition
3.3.2 Dokumentselektion
3.3.3 Dokumentaufbereitung
3.3.4 Text-Mining-Methoden
3.3.5 Interpretation / Evaluation
3.3.6 Anwendung
4 Potenziale der Textanalyse
4.1 Erweiterung des CRM
4.2 Alternative zur Marktforschung
5 Fazit und Ausblick
Literaturverzeichnis
|
150 |
Entwicklung eines generischen Vorgehensmodells für Text MiningSchieber, Andreas, Hilbert, Andreas 29 April 2014 (has links)
Vor dem Hintergrund des steigenden Interesses von computergestützter Textanalyse in Forschung und Praxis entwickelt dieser Beitrag auf Basis aktueller Literatur ein generisches Vorgehensmodell für Text-Mining-Prozesse. Das Ziel des Beitrags ist, die dabei anfallenden, umfangreichen Aktivitäten zu strukturieren und dadurch die Komplexität von Text-Mining-Vorhaben zu reduzieren. Das Forschungsziel stützt sich auf die Tatsache, dass im Rahmen einer im Vorfeld durchgeführten, systematischen Literatur-Review keine detaillierten, anwendungsneutralen Vorgehensmodelle für Text Mining identifiziert werden konnten. Aufbauend auf den Erkenntnissen der Literatur-Review enthält das resultierende Modell daher sowohl induktiv begründete Komponenten aus spezifischen Ansätzen als auch aus literaturbasierten Anforderungen deduktiv abgeleitete Bestandteile. Die Evaluation des Artefakts belegt die Nützlichkeit des Vorgehensmodells im Vergleich mit dem bisherigen Forschungsstand.:1 Einführung
1.1 Motivation
1.2 Forschungsziel und Methodik
1.2.1 Systematische Literatur-Review
1.2.2 Design-Science-Research-Ansatz
1.3 Aufbau des Beitrags
2 Stand der Forschung
2.1 Begriffsverständnis
2.2 Merkmale von Vorgehensmodellen für Text Mining
2.3 Aktivitäten im Text-Mining-Prozess
2.4 Zusammenfassung
3 Anforderungen an ein generisches Vorgehensmodell
3.1 Strukturelle Anforderungen
3.2 Funktionelle Anforderungen
3.3 Zusammenfassung
4 Entwicklung des Modells
4.1 Aufgabendefinition
4.2 Dokumentenselektion und -untersuchung
4.3 Dokumentenaufbereitung
4.3.1 Linguistische Aufbereitung
4.3.2 Technische Aufbereitung
4.4 Text-Mining-Verfahren
4.5 Ergebnisevaluation
4.6 Anwendung
4.7 Zusammenfassung
4.7.1 Gesamtmodell
4.7.2 Feedbackschleifen
5 Evaluation
5.1 Evaluationsdesign
5.2 Messung und Auswertung
6 Fazit und Ausblick
Literaturverzeichnis
Anhang
A1 Anwendungsneutrale Vorgehensmodelle
A2 Auswirkungen von Grund- und Stammformenreduktion auf die Interpretierbarkeit von Texten
A3 Gesamtmodell
|
Page generated in 0.0664 seconds