Global ETD Search

21	Sémantické anotace / Semantic annotations Dědek, Jan January 2012 (has links) Four relatively separate topics are presented in the thesis. Each topic represents one particular aspect of the Information Extraction discipline. The first two topics are focused on our information extraction methods based on deep language parsing. The first topic relates to how deep language parsing was used in our extraction method in combination with manually designed extraction rules. The second topic deals with a method for automated induction of extraction rules using Inductive Logic Programming. The third topic of the thesis combines information extraction with rule based reasoning. The core of our extraction method was experimentally reimplemented using semantic web technologies, which allows saving the extraction rules in so called shareable extraction ontologies that are not dependent on the original extraction tool. The last topic of the thesis deals with document classification and fuzzy logic. We are investigating the possibility of using information obtained by information extraction techniques to document classification. Our implementation of so called Fuzzy ILP Classifier was experimentally used for the purpose of document classification.
22	Deep Learning for Document Image Analysis Tensmeyer, Christopher Alan 01 April 2019 (has links) Automatic machine understanding of documents from image inputs enables many applications in modern document workflows, digital archives of historical documents, and general machine intelligence, among others. Together, the techniques for understanding document images comprise the field of Document Image Analysis (DIA). Within DIA, the research community has identified several sub-problems, such as page segmentation and Optical Character Recognition (OCR). As the field has matured, there has been a trend of moving away from heuristic-based methods, designed for particular tasks and domains of documents, and moving towards machine learning methods that learn to solve tasks from examples of input/output pairs. Within machine learning, a particular class of models, known as deep learning models, have established themselves as the state-of-the-art for many image-based applications, including DIA. While traditional machine learning models typically operate on features designed by researchers, deep learning models are able to learn task-specific features directly from raw pixel inputs.This dissertation is collection of papers that proposes several deep learning models to solve a variety of tasks within DIA. The first task is historical document binarization, where an input image of a degraded historical document is converted to a bi-tonal image to separate foreground text from background regions. The next part of the dissertation considers document segmentation problems, including identifying the boundary between the document page and its background, as well as segmenting an image of a data table into rows, columns, and cells. Finally, a variety of deep models are proposed to solve recognition tasks. These tasks include whole document image classification, identifying the font of a given piece of text, and transcribing handwritten text in low-resource languages. document image analysis deep learning binarization document classification font recognition handwriting recognition Computer Sciences Physical Sciences and Mathematics
23	Extended Multidimensional Conceptual Spaces in Document Classification Hadish, Mulugeta January 2008 (has links) No description available. Computer Science Conceptual spaces Document classification ranking knowledge representiaton Fuzzy set Fuzzy sets cardinality
24	[en] A MULTI-AGENT FRAMEWORK FOR SEARCH AND FLEXIBILIZATION OF DOCUMENT CLASSIFICATION ALGORITHMS / [pt] UM FRAMEWORK MULTI-AGENTES PARA BUSCA E FLEXIBILIZAÇÃO DE ALGORITMOS DE CLASSIFICAÇÃO DE DOCUMENTOS JOAO ALFREDO PINTO DE MAGALHAES 18 June 2003 (has links) [pt] Vivemos na era da informação, onde o conhecimento é criado numa velocidade nunca antes vista. Esse aumento de velocidade teve como principalrazão a Internet, que alterou os paradigmas até então existentes de troca de informações entre as pessoas. Através da rede, trabalhos inteiros podem ser publicados, atingindo um público alvo impossível de ser alcançado através dos meios existentes anteriormente. Porém, o excesso de informação também pode agir no sentido contrário: muita informação pode ser igual a nenhuma informação. Nosso trabalho foi o de produzir um sistema multi-agentes para busca e classificação de documentos textuais de um domínio específico. Foi construída uma infra-estrutura que separa as questões referentes à busca e seleção dos documentos (plataforma) das referentes ao algoritmo de classificação utilizado (uma aplicação do conceito de separation of concerns). Dessa forma, é possível não só acoplar algoritmos já existentes, mas também gerar novos algoritmos levando em consideração características específicas do domínio de documentos abordado. Foram geradas quatro instâncias a partir do framework, uma aplicação de webclipping, um componente para auxílio a knowledge management, um motor de busca para websites e uma aplicação para a web semântica. / [en] We are living in the information age, where knowledge is constantly being created in a rate that was never seen before. This is mainly due to Internet, that changed all the information exchange paradigms between people. Through the net, it is possible to publish or exchange whole works, reaching an audience impossible to be reached through other means. However, excess of information can be harmful: having too much information can be equal to having no information at all. Our work was to build a multi-agent framework for search and flexibilization of textual document classification algorithms of a specific domain. We have built an infra-structure that separates the concerns of document search and selection (platform) from the concerns of document classification (an application of the separation of concerns concept). It is possible not only to use existing algorithms, but also to generate new ones that consider domain-specific characteristics of documents. We generated four instances of the framework, a webclipping application, a knowledge management component, a search engine for websites and an application for the semantic web. [pt] FRAMEWORK [en] FRAMEWORK [pt] SISTEMAS MULTI-AGENTES [en] MULTI-AGENT SYSTEMS [pt] CLASSIFICACAO DE DOCUMENTOS [en] DOCUMENT CLASSIFICATION [pt] SEPARACAO DE RESPONSABILIDADES [en] SEPARATION OF CONCERNS [pt] WEB SEMANTICA [en] SEMANTIC WEB
25	Méthodes de classifications dynamiques et incrémentales : application à la numérisation cognitive d'images de documents / Incremental and dynamic learning for document image : application for intelligent cognitive scanning of documents Ngo Ho, Anh Khoi 19 March 2015 (has links) Cette thèse s’intéresse à la problématique de la classification dynamique en environnements stationnaires et non stationnaires, tolérante aux variations de quantités des données d’apprentissage et capable d’ajuster ses modèles selon la variabilité des données entrantes. Pour cela, nous proposons une solution faisant cohabiter des classificateurs one-class SVM indépendants ayant chacun leur propre procédure d’apprentissage incrémentale et par conséquent, ne subissant pas d’influences croisées pouvant émaner de la configuration des modèles des autres classificateurs. L’originalité de notre proposition repose sur l’exploitation des anciennes connaissances conservées dans les modèles de SVM (historique propre à chaque SVM représenté par l’ensemble des vecteurs supports trouvés) et leur combinaison avec les connaissances apportées par les nouvelles données au moment de leur arrivée. Le modèle de classification proposé (mOC-iSVM) sera exploité à travers trois variations exploitant chacune différemment l’historique des modèles. Notre contribution s’inscrit dans un état de l’art ne proposant pas à ce jour de solutions permettant de traiter à la fois la dérive de concepts, l’ajout ou la suppression de concepts, la fusion ou division de concepts, tout en offrant un cadre privilégié d’interactions avec l’utilisateur. Dans le cadre du projet ANR DIGIDOC, notre approche a été appliquée sur plusieurs scénarios de classification de flux d’images pouvant survenir dans des cas réels lors de campagnes de numérisation. Ces scénarios ont permis de valider une exploitation interactive de notre solution de classification incrémentale pour classifier des images arrivant en flux afin d’améliorer la qualité des images numérisées. / This research contributes to the field of dynamic learning and classification in case of stationary and non-stationary environments. The goal of this PhD is to define a new classification framework able to deal with very small learning dataset at the beginning of the process and with abilities to adjust itself according to the variability of the incoming data inside a stream. For that purpose, we propose a solution based on a combination of independent one-class SVM classifiers having each one their own incremental learning procedure. Consequently, each classifier is not sensitive to crossed influences which can emanate from the configuration of the models of the other classifiers. The originality of our proposal comes from the use of the former knowledge kept in the SVM models (represented by all the found support vectors) and its combination with the new data coming incrementally from the stream. The proposed classification model (mOC-iSVM) is exploited through three variations in the way of using the existing models at each step of time. Our contribution states in a state of the art where no solution is proposed today to handle at the same time, the concept drift, the addition or the deletion of concepts, the fusion or division of concepts while offering a privileged solution for interaction with the user. Inside the DIGIDOC project, our approach was applied to several scenarios of classification of images streams which can correspond to real cases in digitalization projects. These different scenarios allow validating an interactive exploitation of our solution of incremental classification to classify images coming in a stream in order to improve the quality of the digitized images. Classification dynamique Incrémentalité SVM Images de documents Numérisation Document classification Incremental learning One-class SVM classifier Selection of support vectors
26	Analýza sentimentu s využitím dolování dat / Sentiment Analysis with Use of Data Mining Sychra, Martin January 2016 (has links) The theme of the work is sentiment analysis, especially in terms of informatics (marginally from a linguistic point of view). The linguistic part discusses the term sentiment and language methods for its analysis, e.g. lemmatization, POS tagging, using the list of stopwords etc. More attention is paid to the structure of the sentiment analyzer which is based on some of the machine learning methods (support vector machines, Naive Bayes and maximum entropy classification). On the basis of the theoretical background, a functional analyzer is projected and implemented. The experiments are focused mainly on comparing the classification methods and on the benefits of using the individual preprocessing methods. The success rate of the constructed classifier reaches up to 84 % in the cross-validation.
27	Contribution à la construction d’ontologies et à la recherche d’information : application au domaine médical / Contribution to ontology building and to semantic information retrieval : application to medical domain Drame, Khadim 10 December 2014 (has links) Ce travail vise à permettre un accès efficace à des informations pertinentes malgré le volume croissant des données disponibles au format électronique. Pour cela, nous avons étudié l’apport d’une ontologie au sein d’un système de recherche d'information (RI).Nous avons tout d’abord décrit une méthodologie de construction d’ontologies. Ainsi, nous avons proposé une méthode mixte combinant des techniques de traitement automatique des langues pour extraire des connaissances à partir de textes et la réutilisation de ressources sémantiques existantes pour l’étape de conceptualisation. Nous avons par ailleurs développé une méthode d’alignement de termes français-anglais pour l’enrichissement terminologique de l’ontologie. L’application de notre méthodologie a permis de créer une ontologie bilingue de la maladie d’Alzheimer.Ensuite, nous avons élaboré des algorithmes pour supporter la RI sémantique guidée par une ontologie. Les concepts issus d’une ontologie ont été utilisés pour décrire automatiquement les documents mais aussi pour reformuler les requêtes. Nous nous sommes intéressés à : 1) l’identification de concepts représentatifs dans des corpus, 2) leur désambiguïsation, 3), leur pondération selon le modèle vectoriel, adapté aux concepts et 4) l’expansion de requêtes. Ces propositions ont permis de mettre en œuvre un portail de RI sémantique dédié à la maladie d’Alzheimer. Par ailleurs, le contenu des documents à indexer n’étant pas toujours accessible dans leur ensemble, nous avons exploité des informations incomplètes pour déterminer les concepts pertinents permettant malgré tout de décrire les documents. Pour cela, nous avons proposé deux méthodes de classification de documents issus d’un large corpus, l’une basée sur l’algorithme des k plus proches voisins et l’autre sur l’analyse sémantique explicite. Ces méthodes ont été évaluées sur de larges collections de documents biomédicaux fournies lors d’un challenge international. / This work aims at providing efficient access to relevant information among the increasing volume of digital data. Towards this end, we studied the benefit from using ontology to support an information retrieval (IR) system.We first described a methodology for constructing ontologies. Thus, we proposed a mixed method which combines natural language processing techniques for extracting knowledge from text and the reuse of existing semantic resources for the conceptualization step. We have also developed a method for aligning terms in English and French in order to enrich terminologically the resulting ontology. The application of our methodology resulted in a bilingual ontology dedicated to Alzheimer’s disease.We then proposed algorithms for supporting ontology-based semantic IR. Thus, we used concepts from ontology for describing documents automatically and for query reformulation. We were particularly interested in: 1) the extraction of concepts from texts, 2) the disambiguation of terms, 3) the vectorial weighting schema adapted to concepts and 4) query expansion. These algorithms have been used to implement a semantic portal about Alzheimer’s disease. Further, because the content of documents are not always fully available, we exploited incomplete information for identifying the concepts, which are relevant for indexing the whole content of documents. Toward this end, we have proposed two classification methods: the first is based on the k nearest neighbors’ algorithm and the second on the explicit semantic analysis. The two methods have been evaluated on large standard collections of biomedical documents within an international challenge. Construction d’ontologie Réutilisation de RTO Recherche d’information Indexation sémantique Classification de documents biomédicaux Maladie d’Alzheimer Ontology construction TOR reuse Information retrieval Semantic indexing Biomedical document classification Alzheimer’s disease
28	Applying Natural Language Processing to document classification / Tillämpning av Naturlig Språkbehandling för dokumentklassificering Kragbé, David January 2022 (has links) In today's digital world, we produce and use more electronic documents than ever before. And this trend is far from slowing down. Particularly, more and more companies and businesses now need to treat a considerable amount of documents to deal with their clients' requests. Scaling this process often requires building an automatic document treatment pipeline. Since the treatment of a document depends on its content, those pipelines heavily rely on an automatic document classifier to correctly process the documents received. Such document classifier should be able to receive a document of any type and output its class based on the text content of the document. In this thesis, we designed and implemented a machine learning pipeline for automated insurance claims documents classification. In order to find the best pipeline, we created several combination of different classifiers (logistic regressor and random forest classifier) and embedding models (Fasttext and Doc2vec). We then compared the performances of all of the pipelines using a the precision and accuracy metrics. We found that a pipeline composed of a Fasttext embedding model combined with a logistic regressor classifier was the most performant, yielding a precision of 85% and an accuracy of 86% on our dataset. / I dagens digitala värld, producerar och använder vi fler elektroniska dokument än någonsin tidigare. Denna trend är långt ifrån att sakta ner sig. Särskilt fler och fler företag behöver nu behandla en stor mängd dokument för att hantera sina kunders önskemål. Att skala denna process kräver ofta att man bygger en pipeline för automatisk dokumentbehandling. Eftersom behandlingen av ett dokument beror på dess innehåll, är dessa pipelines starkt beroende av en automatisk dokumentklassificerare för att korrekt bearbeta de mottagna dokumenten. En sådan dokumentklassificerare skall kunna ta emot ett dokument av vilken typ som helst och mata ut dess klass baserat på dokumentets textinnehåll. I detta examensarbete, designade och implementerade vi en maskininlärningspipeline för automatiserad klassificering av försäkringskrav-dokument. För att hitta den bästa pipelinen, skapade vi flera kombinationer av olika klassificerare (logistisk regressor och random forest klassificerare) och inbäddningsmodeller (Fasttext och Doc2vec). Vi jämförde sedan prestandan för alla pipelines med hjälp av precisions- och noggrannhetsmåtten. Vi fann att en pipeline bestående av en Fasttext-inbäddningsmodell kombinerad med en logistisk regressorklassificerare var den mest presterande, vilket gav en precision på 85% och en noggrannhet på 86% på vår datauppsättning. Natural Language Processing Document Classification Embeddings Classifiers Naturlig Språkbehandling Dokumentklassificering Inbäddningar Klassificerare Computer Sciences Datavetenskap (datalogi) Computer Engineering Datorteknik Computer and Information Sciences Data- och informationsvetenskap
29	Employing a Transformer Language Model for Information Retrieval and Document Classification : Using OpenAI's generative pre-trained transformer, GPT-2 / Transformermodellers användbarhet inom informationssökning och dokumentklassificering Bjöörn, Anton January 2020 (has links) As the information flow on the Internet keeps growing it becomes increasingly easy to miss important news which does not have a mass appeal. Combating this problem calls for increasingly sophisticated information retrieval methods. Pre-trained transformer based language models have shown great generalization performance on many natural language processing tasks. This work investigates how well such a language model, Open AI’s General Pre-trained Transformer 2 model (GPT-2), generalizes to information retrieval and classification of online news articles, written in English, with the purpose of comparing this approach with the more traditional method of Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. The aim is to shed light on how useful state-of-the-art transformer based language models are for the construction of personalized information retrieval systems. Using transfer learning the smallest version of GPT-2 is trained to rank and classify news articles achieving similar results to the purely TF-IDF based approach. While the average Normalized Discounted Cumulative Gain (NDCG) achieved by the GPT-2 based model was about 0.74 percentage points higher the sample size was too small to give these results high statistical certainty. / Informationsflödet på Internet fortsätter att öka vilket gör det allt lättare att missa viktiga nyheter som inte intresserar en stor mängd människor. För att bekämpa detta problem behövs allt mer sofistikerade informationssökningsmetoder. Förtränade transformermodeller har sedan ett par år tillbaka tagit över som de mest framstående neurala nätverken för att hantera text. Det här arbetet undersöker hur väl en sådan språkmodell, Open AIs General Pre-trained Transformer 2 (GPT-2), kan generalisera från att generera text till att användas för informationssökning och klassificering av texter. För att utvärdera detta jämförs en transformerbaserad modell med en mer traditionell Term Frequency- Inverse Document Frequency (TF-IDF) vektoriseringsmodell. Målet är att klargöra hur användbara förtränade transformermodeller faktiskt är i skapandet av specialiserade informationssökningssystem. Den minsta versionen av språkmodellen GPT-2 anpassas och tränas om till att ranka och klassificera nyhetsartiklar, skrivna på engelska, och uppnår liknande prestanda som den TF-IDF baserade modellen. Den GPT-2 baserade modellen hade i genomsnitt 0.74 procentenheter högre Normalized Discounted Cumulative Gain (NDCG) men provstorleken var ej stor nog för att ge dessa resultat hög statistisk säkerhet. Deep Learning Transformer Models Information Retrieval Ranking Generative Pre-training Document Classification djupinlärning transformermodeller informationssökning ranking generativ förträning dokumentklassificering Computer and Information Sciences Data- och informationsvetenskap
30	Gestion de l’hétérogénéité d’un SI de classification documentaire multifacette et positionnement dans l’environnement des ECM. / Management of heterogeneity of a documentary multifaceted classification Information System and position in the ECM environment. Ankoud, Manel 19 December 2014 (has links) L’organisation des connaissances est une discipline investie par des bibliothécaires, documentalistes, archivistes spécialistes de l’information, informaticiens et tous professionnels de documents. Elle englobe toutes activités, études et recherches qui élaborent et traitent les processus d’organisation et de présentation des ressources documentaires utiles dans une organisation. Dans ce contexte, le projet ANR Miipa-Doc a pour objectifs d’explorer des nouvelles méthodes d’indexation ascendantes, en utilisant des termes descripteurs formulés par les individus plutôt que choisis parmi une liste préétablie, pour l’organisation des contenus documentaires complexes au sein des entreprises de large taille, et concevoir l’architecture logicielle correspondante.Dans ce projet notre contribution consiste à gérer l’hétérogénéité d’un système d’information d’organisation des contenus documentaires, basé sur une approche orientée métier et un SOC (système d’organisation des connaissances) folksonomique à facette. Nous proposons dans cette gestion une approche incrémentale dirigée par les modèles, issue de l’IDM (ingénierie dirigée par les modèles), basée sur des méta-modèles pour garantir l’aspect d’évolutivité. Après l’implémentation du prototype HyperTaging qui met en place ces deux approches, nous proposons un processus d’évaluation permet de positionner ce prototype et tous SI de classification documentaire dans l’environnement des ECM, en se basant sur des critères d’évaluation fins et particuliers. / The knowledge organization is invested by librarians, archivists, information specialists, IT professionals and all discipline of document. It includes all activities, studies and research which develop and treat organization process and presentation of relevant information resources in an organization. In this context the Miipa-Doc project aims to explore new ascendants indexing methods, using descriptors made by individuals rather than selected given list for complex contained in the organization document, in large size companies, and design the corresponding software architecture.Our contribution in this project is to manage the heterogeneity of an information system of document organization, based on a business-oriented approach and a KOS (knowledge organization system) of folksonomy facet. We propose an incremental approach this management model driven, outcome of MDE (Model Driven Engineering), based on meta-models to ensure scalability appearance. After implementing the HyperTaging prototype, that implements both approaches, we propose an evaluation process used to position the prototype and all IS of documentary classification in the environment of ECM based on purposes of delicate and particular evaluation criteria. Organisation des connaissances Knowledge organization 025.3

Search results