Spelling suggestions: "subject:"document analysis anda recognition"" "subject:"document analysis ando recognition""
1 |
Etude de la confusion des descripteurs locaux de points d'intérêt : application à la mise en correspondance d'images de documents / Study of keypoints and local features confusion : document images matching scenarioRoyer, Emilien 24 October 2017 (has links)
Ce travail s’inscrit dans une tentative de liaison entre la communauté classique de la Vision par ordinateur et la communauté du traitement d’images de documents, analyse être connaissance (DAR). Plus particulièrement, nous abordons la question des détecteurs de points d’intérêts et des descripteurs locaux dans une image. Ceux-ci ayant été conçus pour des images issues du monde réel, ils ne sont pas adaptés aux problématiques issues du document dont les images présentent des caractéristiques visuelles différentes.Notre approche se base sur la résolution du problème de la confusion entre les descripteurs,ceux-ci perdant leur pouvoir discriminant. Notre principale contribution est un algorithme de réduction de la confusion potentiellement présente dans un ensemble de vecteurs caractéristiques d’une même image, ceci par une approche probabiliste en filtrant les vecteurs fortement confusifs. Une telle conception nous permet d’appliquer des algorithmes d’extractions de descripteurs sans avoir à les modifier ce qui constitue une passerelle entre ces deux mondes. / This work tries to establish a bridge between the field of classical computer vision and document analysis and recognition. Specificaly, we tackle the issue of keypoints detection and associated local features computation in the image. These are not suitable for document images since they were designed for real-world images which have different visual characteristic. Our approach is based on resolving the issue of reducing the confusion between feature vectors since they usually lose their discriminant power with document images. Our main contribution is an algorithm reducing the confusion between local features by filtering the ones which present a high confusing risk. We are tackling this by using tools from probability theory. Such a method allows us to apply features extraction algorithms without having to modify them, thus establishing a bridge between these two worlds.
|
2 |
Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper InductionPacker, Thomas L 01 October 2014 (has links) (PDF)
Lists of records in machine-printed documents contain much useful information. As one example, the thousands of family history books scanned, OCRed, and placed on-line by FamilySearch.org probably contain hundreds of millions of fact assertions about people, places, family relationships, and life events. Data like this cannot be fully utilized until a person or process locates the data in the document text, extracts it, and structures it with respect to an ontology or database schema. Yet, in the family history industry and other industries, data in lists goes largely unused because no known approach adequately addresses all of the costs, challenges, and requirements of a complete end-to-end solution to this task. The diverse information is costly to extract because many kinds of lists appear even within a single document, differing from each other in both structure and content. The lists' records and component data fields are usually not set apart explicitly from the rest of the text, especially in a corpus of OCRed historical documents. OCR errors and the lack of document structure (e.g. HMTL tags) make list content hard to recognize by a software tool developed without a substantial amount of highly specialized, hand-coded knowledge or machine learning supervision. Making an approach that is not only accurate but also sufficiently scalable in terms of time and space complexity to process a large corpus efficiently is especially challenging. In this dissertation, we introduce a novel family of scalable approaches to list discovery and ontology population. Its contributions include the following. We introduce the first general-purpose methods of which we are aware for both list detection and wrapper induction for lists in OCRed or other plain text. We formally outline a mapping between in-line labeled text and populated ontologies, effectively reducing the ontology population problem to a sequence labeling problem, opening the door to applying sequence labelers and other common text tools to the goal of populating a richly structured ontology from text. We provide a novel admissible heuristic for inducing regular expression wrappers using an A* search. We introduce two ways of modeling list-structured text with a hidden Markov model. We present two query strategies for active learning in a list-wrapper induction setting. Our primary contributions are two complete and scalable wrapper-induction-based solutions to the end-to-end challenge of finding lists, extracting data, and populating an ontology. The first has linear time and space complexity and extracts highly accurate information at a low cost in terms of user involvement. The second has time and space complexity that are linear in the size of the input text and quadratic in the length of an output record and achieves higher F1-measures for extracted information as a function of supervision cost. We measure the performance of each of these approaches and show that they perform better than strong baselines, including variations of our own approaches and a conditional random field-based approach.
|
Page generated in 0.132 seconds