401 |
Určení základního tvaru slova / Determination of basic form of wordsŠanda, Pavel January 2011 (has links)
Lemmatization is an important preprocessing step for many applications of text mining. Lemmatization process is similar to the stemming process, with the difference that determines not only the word stem, but it´s trying to determines the basic form of the word using the methods Brute Force and Suffix Stripping. The main aim of this paper is to present methods for algorithmic improvements Czech lemmatization. The created training set of data are content of this paper and can be freely used for student and academic works dealing with similar problematics.
|
402 |
Sumarizace dokumentů na webu / Summarization of Documents from the WebŠkurla, Ján January 2012 (has links)
Topic of this master's thesis is a summarization of the documents on the web. First, it deals with the issues of acquiring text from the web using wrapper. An overview of wrappers used as an inspiration for the future implementation is stated. This paper also includes various methods for creating summary (Luhn`s, Edmundson`s and KPC) from the text data. Application design for the text data extraction and summarization is also part of this paper. Application is based on Java platform and Swing graphic library.
|
403 |
Metody shlukování textových dat / Textual Data Clustering MethodsMiloš, Roman January 2011 (has links)
Clustering of text data is one of tasks of text mining. It divides documents into the different categories that are based on their similarities. These categories help to easily search in the documents. This thesis describes the current methods that are used for the text document clustering. From these methods we chose Simultaneous keyword identification and clustering of text documents (SKWIC). It should achieve better results than the standard clustering algorithms such as k-means. There is designed and implemented an application for this algorithm. In the end, we compare SKWIC with a k-means algorithm.
|
404 |
Algorithmes de machine learning en assurance : solvabilité, textmining, anonymisation et transparence / Machine learning algorithms in insurance : solvency, textmining, anonymization and transparencyLy, Antoine 19 November 2019 (has links)
En été 2013, le terme de "Big Data" fait son apparition et suscite un fort intérêt auprès des entreprises. Cette thèse étudie ainsi l'apport de ces méthodes aux sciences actuarielles. Elle aborde aussi bien les enjeux théoriques que pratiques sur des thématiques à fort potentiel comme l'textit{Optical Character Recognition} (OCR), l'analyse de texte, l'anonymisation des données ou encore l'interprétabilité des modèles. Commençant par l'application des méthodes du machine learning dans le calcul du capital économique, nous tentons ensuite de mieux illustrer la frontrière qui peut exister entre l'apprentissage automatique et la statistique. Mettant ainsi en avant certains avantages et différentes techniques, nous étudions alors l'application des réseaux de neurones profonds dans l'analyse optique de documents et de texte, une fois extrait. L'utilisation de méthodes complexes et la mise en application du Réglement Général sur la Protection des Données (RGPD) en 2018 nous a amené à étudier les potentiels impacts sur les modèles tarifaires. En appliquant ainsi des méthodes d'anonymisation sur des modèles de calcul de prime pure en assurance non-vie, nous avons exploré différentes approches de généralisation basées sur l'apprentissage non-supervisé. Enfin, la réglementation imposant également des critères en terme d'explication des modèles, nous concluons par une étude générale des méthodes qui permettent aujourd'hui de mieux comprendre les méthodes complexes telles que les réseaux de neurones / In summer 2013, the term "Big Data" appeared and attracted a lot of interest from companies. This thesis examines the contribution of these methods to actuarial science. It addresses both theoretical and practical issues on high-potential themes such as textit{Optical Character Recognition} (OCR), text analysis, data anonymization and model interpretability. Starting with the application of machine learning methods in the calculation of economic capital, we then try to better illustrate the boundary that may exist between automatic learning and statistics. Highlighting certain advantages and different techniques, we then study the application of deep neural networks in the optical analysis of documents and text, once extracted. The use of complex methods and the implementation of the General Data Protection Regulation (GDPR) in 2018 led us to study its potential impacts on pricing models. By applying anonymization methods to pure premium calculation models in non-life insurance, we explored different generalization approaches based on unsupervised learning. Finally, as regulations also impose criteria in terms of model explanation, we conclude with a general study of methods that now allow a better understanding of complex methods such as neural networks
|
405 |
Nachrichtenklassifikation als Komponente in WEBISKrellner, Björn 25 September 2006 (has links)
In der Diplomarbeit wird die Weiterentwicklung eines Prototyps zur Nachrichtenklassifikation sowie die Integration in das bestehende Web-orientierte Informationssystem (WEBIS) beschrieben.
Mit der entstandenen Software vorgenommene Klassifikationen werden vorgestellt und mit bisherigen Erkenntnissen verglichen.
|
406 |
CASSANDRA: drug gene association prediction via text mining and ontologiesKissa, Maria 20 January 2015 (has links)
The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce CASSANDRA, a fully corpus-based and unsupervised algorithm which uses the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. CASSANDRA measures the Pointwise Mutual Information (PMI) between biomedical terms derived from Gene Ontology (GO) and Medical Subject Headings (MeSH). Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles.
Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The algorithm can successfully identify direct drug gene associations with high precision and prioritize them over indirect drug gene associations. Validation shows that the statistically derived profiles from literature perform as good as (and at times better than) the manually curated profiles.
In addition, we examine CASSANDRA’s potential towards drug repositioning. For all FDA-approved drugs repositioned over the last 5 years, we generate profiles from publications before 2009 and show that the new indications rank high in these profiles. In summary, co-occurrence based profiles derived from the biomedical literature can accurately predict drug gene associations and provide insights onto potential repositioning cases.
|
407 |
Knowledge Integration and Representation for Biomedical AnalysisAlachram, Halima 04 February 2021 (has links)
No description available.
|
408 |
Value Creation From User Generated Content for Smart Tourism DestinationsCelen, Mustafa, Rojas, Maximiliano January 2020 (has links)
This paper aims to show how User Generated Content can create value for Smart Tourism Destinations. Applying the analysis on 5 different cases in the region of Stockholm to derive patterns and opportunities of value creation generated by UGC in tourism. Findings of this paper is also discussed in terms of improving decision making, possibilities of new business models and importance of technological improvements on STD’s. Finally, thoughts on models are presented for researchers and practitioners that might be interested in exploitation of UGC in the context of information-intensive industries and mainly in Tourism.
|
409 |
New Computational Methods for Literature-Based DiscoveryDing, Juncheng 05 1900 (has links)
In this work, we leverage the recent developments in computer science to address several of the challenges in current literature-based discovery (LBD) solutions. First, LBD solutions cannot use semantics or are too computational complex. To solve the problems we propose a generative model OverlapLDA based on topic modeling, which has been shown both effective and efficient in extracting semantics from a corpus. We also introduce an inference method of OverlapLDA. We conduct extensive experiments to show the effectiveness and efficiency of OverlapLDA in LBD. Second, we expand LBD to a more complex and realistic setting. The settings are that there can be more than one concept connecting the input concepts, and the connectivity pattern between concepts can also be more complex than a chain. Current LBD solutions can hardly complete the LBD task in the new setting. We simplify the hypotheses as concept sets and propose LBDSetNet based on graph neural networks to solve this problem. We also introduce different training schemes based on self-supervised learning to train LBDSetNet without relying on comprehensive labeled hypotheses that are extremely costly to get. Our comprehensive experiments show that LBDSetNet outperforms strong baselines on simple hypotheses and addresses complex hypotheses.
|
410 |
Language Engineering for Information ExtractionSchierle, Martin 12 July 2011 (has links)
Accompanied by the cultural development to an information society and knowledge economy and driven by the rapid growth of the World Wide Web and decreasing prices for technology and disk space, the world\''s knowledge is evolving fast, and humans are challenged with keeping up.
Despite all efforts on data structuring, a large part of this human knowledge is still hidden behind the ambiguities and fuzziness of natural language. Especially domain language poses new challenges by having specific syntax, terminology and morphology. Companies willing to exploit the information contained in such corpora are often required to build specialized systems instead of being able to rely on off the shelf software libraries and data resources. The engineering of language processing systems is however cumbersome, and the creation of language resources, annotation of training data and composition of modules is often enough rather an art than a science. The scientific field of Language Engineering aims at providing reliable information, approaches and guidelines of how to design, implement, test and evaluate language processing systems.
Language engineering architectures have been a subject of scientific work for the last two decades and aim at building universal systems of easily reusable components. Although current systems offer comprehensive features and rely on an architectural sound basis, there is still little documentation about how to actually build an information extraction application. Selection of modules, methods and resources for a distinct usecase requires a detailed understanding of state of the art technology, application demands and characteristics of the input text. The main assumption underlying this work is the thesis that a new application can only occasionally be created by reusing standard components from different repositories. This work recapitulates existing literature about language resources, processing resources and language engineering architectures to derive a theory about how to engineer a new system for information extraction from a (domain) corpus.
This thesis was initiated by the Daimler AG to prepare and analyze unstructured information as a basis for corporate quality analysis. It is therefore concerned with language engineering in the area of Information Extraction, which targets the detection and extraction of specific facts from textual data. While other work in the field of information extraction is mainly concerned with the extraction of location or person names, this work deals with automotive components, failure symptoms, corrective measures and their relations in arbitrary arity.
The ideas presented in this work will be applied, evaluated and demonstrated on a real world application dealing with quality analysis on automotive domain language. To achieve this goal, the underlying corpus is examined and scientifically characterized, algorithms are picked with respect to the derived requirements and evaluated where necessary. The system comprises language identification, tokenization, spelling correction, part of speech tagging, syntax parsing and a final relation extraction step. The extracted information is used as an input to data mining methods such as an early warning system and a graph based visualization for interactive root cause analysis. It is finally investigated how the unstructured data facilitates those quality analysis methods in comparison to structured data. The acceptance of these text based methods in the company\''s processes further proofs the usefulness of the created information extraction system.
|
Page generated in 0.0341 seconds